[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

P4 timing update



Guys,

As I said, the first timings I sent out were just a default ATLAS install.
I have since done some twiddling, which has resulted in the P4 getting
an additional 15% or so speedup.  With this speedup, the 1.5Ghz P4 finally
beats the 1Ghz Athlon for double precision flops.  As a matter of fact,
this chip does pretty well up against the IA64 as well . . .

The trick was twofold: first, since the L1 cache is so small, and the L2
cache is so fast, I found it was better to ignore the L1 cache and choose
a large NB (say in the range of 72-80).  The second trick is that the P4
seems to be better at out of order, or register renaming, or something along
those lines than the PIII, 'cause you can choose lat=1 (rather than the real
lat=12 or so) and use the extra registers for better register blockings.

I include the double precision results below.  In the new install, I can't
yet time single precision, because the new, larger, NB breaks the SSE cleanup
routines, but it looks like if we can fix that, sMM peak will go from
around 3.7Gflop to 4.2Gflop.

Some people seemed confused, so here's an explanation of P4's theoretical peak:
(1) If you are using the x86 FPU, theoretical peak for all precisions is
    the Mhz of the machine (so 1.5Gflop for our P4)
(2) If you are using SSE1 instructions for single precision, the theoretical
    peak is 4*Mhz (6Gflop)
(3) If you are using SSE2 instructions for double precision, the theoretical
    peak is 2*Mhz (3Gflop)

ATLAS presently uses (2) for single precision, and (1) for double.  Thus the
1262.3MFLOP observed dmatmul timing represents roughly 84% of theoretical
peak. 

Cheers,
Clint

ATH : 1Ghz Athlon, SDRAM                           $1269
P4  : 1.5Ghz Pentium 4, Rambus                     $2109
IA64: 666Mhz Itanium, no idea on mem               ?????
P40  : my original, non-optimal, ATLAS install on the P4

             100    200    300    400    500    600    700    800    900   1000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATH  dMM   909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
P4   dMM  1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
P40  dMM   952.4 1010.5 1080.0  984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
IA64 dMM   866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5

ATH  dLU   477.4  611.8  695.0  709.8  780.1  777.4  815.8  793.1  823.0  865.2
P4   dLU   428.7  659.8  763.1  851.7  887.6  933.9  951.8  974.3 1033.2 1040.9
P40  dLU   435.8  611.8  718.2  788.6  805.2  821.8  878.5  874.4  882.9  888.2
IA64 dLU   241.2  419.4  554.3  652.8  754.2  800.4  832.4  873.0  926.0  937.0

            1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATH  dMM  1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
P4   dMM  1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
P40  dMM  1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4  852.3
IA64 dMM  1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3

ATH  dLU   878.8  923.4  925.2  943.2  950.3  965.5  974.9  983.5  994.6 1003.1
P4   dLU  1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
P40  dLU   906.5  932.8  937.9  950.2  955.4  965.5  969.8  977.8  975.4  986.1
IA64 dLU   990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0

                          GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
                         =====  =====  =====  =====  =====  =====
ATH       500           1136.4 1000.0  835.0 1087.0  961.5  961.5
P4        500           1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
P40       500           1041.7  961.5  835.0 1000.0  892.9 1041.7
IA64      500           1610.1 1201.9 1462.9 1462.9 1082.5  816.6