[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: P4 timing update
Hi Clint! Great!
R Clint Whaley <rwhaley@cs.utk.edu> writes:
> Guys,
>
> As I said, the first timings I sent out were just a default ATLAS install.
> I have since done some twiddling, which has resulted in the P4 getting
> an additional 15% or so speedup. With this speedup, the 1.5Ghz P4 finally
> beats the 1Ghz Athlon for double precision flops. As a matter of fact,
> this chip does pretty well up against the IA64 as well . . .
>
> The trick was twofold: first, since the L1 cache is so small, and the L2
> cache is so fast, I found it was better to ignore the L1 cache and choose
> a large NB (say in the range of 72-80). The second trick is that the P4
> seems to be better at out of order, or register renaming, or something along
> those lines than the PIII, 'cause you can choose lat=1 (rather than the real
> lat=12 or so) and use the extra registers for better register blockings.
>
I wonder if this will apply to the SSE stuff too. As you may recall,
we found, much to our surprise, that no pipelining performed best on
the P3. I.e. mul a,b ; add b,c. I think I now understand your
latency parameter to refer to what I'd been calling pipeline depth.
If this is so, is it worth reinvestigating the SSE pipeline
conclusion? Has Intel documented how this *should* work anywhere?
Take care,
> I include the double precision results below. In the new install, I can't
> yet time single precision, because the new, larger, NB breaks the SSE cleanup
> routines, but it looks like if we can fix that, sMM peak will go from
> around 3.7Gflop to 4.2Gflop.
>
> Some people seemed confused, so here's an explanation of P4's theoretical peak:
> (1) If you are using the x86 FPU, theoretical peak for all precisions is
> the Mhz of the machine (so 1.5Gflop for our P4)
> (2) If you are using SSE1 instructions for single precision, the theoretical
> peak is 4*Mhz (6Gflop)
> (3) If you are using SSE2 instructions for double precision, the theoretical
> peak is 2*Mhz (3Gflop)
>
> ATLAS presently uses (2) for single precision, and (1) for double. Thus the
> 1262.3MFLOP observed dmatmul timing represents roughly 84% of theoretical
> peak.
>
> Cheers,
> Clint
>
> ATH : 1Ghz Athlon, SDRAM $1269
> P4 : 1.5Ghz Pentium 4, Rambus $2109
> IA64: 666Mhz Itanium, no idea on mem ?????
> P40 : my original, non-optimal, ATLAS install on the P4
>
> 100 200 300 400 500 600 700 800 900 1000
> ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> ATH dMM 909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
> P4 dMM 1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
> P40 dMM 952.4 1010.5 1080.0 984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
> IA64 dMM 866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5
>
> ATH dLU 477.4 611.8 695.0 709.8 780.1 777.4 815.8 793.1 823.0 865.2
> P4 dLU 428.7 659.8 763.1 851.7 887.6 933.9 951.8 974.3 1033.2 1040.9
> P40 dLU 435.8 611.8 718.2 788.6 805.2 821.8 878.5 874.4 882.9 888.2
> IA64 dLU 241.2 419.4 554.3 652.8 754.2 800.4 832.4 873.0 926.0 937.0
>
> 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
> ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> ATH dMM 1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
> P4 dMM 1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
> P40 dMM 1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4 852.3
> IA64 dMM 1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3
>
> ATH dLU 878.8 923.4 925.2 943.2 950.3 965.5 974.9 983.5 994.6 1003.1
> P4 dLU 1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
> P40 dLU 906.5 932.8 937.9 950.2 955.4 965.5 969.8 977.8 975.4 986.1
> IA64 dLU 990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0
>
> GEMM SYMM SYRK SYR2K TRMM TRSM
> ===== ===== ===== ===== ===== =====
> ATH 500 1136.4 1000.0 835.0 1087.0 961.5 961.5
> P4 500 1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
> P40 500 1041.7 961.5 835.0 1000.0 892.9 1041.7
> IA64 500 1610.1 1201.9 1462.9 1462.9 1082.5 816.6
>
>
>
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah