[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
P4 timing update
Guys,
As I said, the first timings I sent out were just a default ATLAS install.
I have since done some twiddling, which has resulted in the P4 getting
an additional 15% or so speedup. With this speedup, the 1.5Ghz P4 finally
beats the 1Ghz Athlon for double precision flops. As a matter of fact,
this chip does pretty well up against the IA64 as well . . .
The trick was twofold: first, since the L1 cache is so small, and the L2
cache is so fast, I found it was better to ignore the L1 cache and choose
a large NB (say in the range of 72-80). The second trick is that the P4
seems to be better at out of order, or register renaming, or something along
those lines than the PIII, 'cause you can choose lat=1 (rather than the real
lat=12 or so) and use the extra registers for better register blockings.
I include the double precision results below. In the new install, I can't
yet time single precision, because the new, larger, NB breaks the SSE cleanup
routines, but it looks like if we can fix that, sMM peak will go from
around 3.7Gflop to 4.2Gflop.
Some people seemed confused, so here's an explanation of P4's theoretical peak:
(1) If you are using the x86 FPU, theoretical peak for all precisions is
the Mhz of the machine (so 1.5Gflop for our P4)
(2) If you are using SSE1 instructions for single precision, the theoretical
peak is 4*Mhz (6Gflop)
(3) If you are using SSE2 instructions for double precision, the theoretical
peak is 2*Mhz (3Gflop)
ATLAS presently uses (2) for single precision, and (1) for double. Thus the
1262.3MFLOP observed dmatmul timing represents roughly 84% of theoretical
peak.
Cheers,
Clint
ATH : 1Ghz Athlon, SDRAM $1269
P4 : 1.5Ghz Pentium 4, Rambus $2109
IA64: 666Mhz Itanium, no idea on mem ?????
P40 : my original, non-optimal, ATLAS install on the P4
100 200 300 400 500 600 700 800 900 1000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATH dMM 909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
P4 dMM 1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
P40 dMM 952.4 1010.5 1080.0 984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
IA64 dMM 866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5
ATH dLU 477.4 611.8 695.0 709.8 780.1 777.4 815.8 793.1 823.0 865.2
P4 dLU 428.7 659.8 763.1 851.7 887.6 933.9 951.8 974.3 1033.2 1040.9
P40 dLU 435.8 611.8 718.2 788.6 805.2 821.8 878.5 874.4 882.9 888.2
IA64 dLU 241.2 419.4 554.3 652.8 754.2 800.4 832.4 873.0 926.0 937.0
1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATH dMM 1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
P4 dMM 1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
P40 dMM 1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4 852.3
IA64 dMM 1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3
ATH dLU 878.8 923.4 925.2 943.2 950.3 965.5 974.9 983.5 994.6 1003.1
P4 dLU 1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
P40 dLU 906.5 932.8 937.9 950.2 955.4 965.5 969.8 977.8 975.4 986.1
IA64 dLU 990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0
GEMM SYMM SYRK SYR2K TRMM TRSM
===== ===== ===== ===== ===== =====
ATH 500 1136.4 1000.0 835.0 1087.0 961.5 961.5
P4 500 1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
P40 500 1041.7 961.5 835.0 1000.0 892.9 1041.7
IA64 500 1610.1 1201.9 1462.9 1462.9 1082.5 816.6