[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Timing roundup (P4, IA64, Athlon)
Hi Clint! This looks great! Congratulations!
R Clint Whaley <rwhaley@cs.utk.edu> writes:
> Guys,
>
> I include below some interesting timings, where I compare the following three
> systems, all running linux, and using a default ATLAS 3.2.0 install:
>
> ATH : 1Ghz Athlon, SDRAM $1269
> P4 : 1.5Ghz Pentium 4, Rambus $2109
> IA64: 666Mhz Itanium, no idea on mem ?????
>
> So, the first thing to note is that the Athlon is using the old memory
> type (SDRAM, not the newer SDDRAM, or whatever the hell it is), and the
> P4 is using rambus. I have no idea what the Itanium has. The price above
> is not what we payed for the machines (I have no idea), it's what Gateway
> tells me those machines with 256Mb of memory cost.
>
> All the numbers here are using the P4's normal FPU. This machine will
> need SSE2 to really shine (that will pump it's theor peak to 2*mhz). However,
> the normal FPU is what you get just using gcc on linux, so it's what linux
> people will be getting for a while, as well as MSVC++ people (Intel has a
> compiler for Windows that apparently generates SSE2 code automatically,
> which is what MKL is apparently already using to get dmatmul > Mhz).
>
Should be very little work (comparatively) to port the current SSE
stuff to SSE2. Problem is, I don't have access to any such machine.
If you're ever interested in this project and would like my help (and
if I can find the time :-)), I'd love to see atlas shine in this area
too and could perhaps assist if you could provide ssh access to a
linux p4 somewhere. Or I'm sure Peter's generator could do just fine
too. Perhaps this is best put off though for some time to give us a
rest from the release!
These timings are very interesting. How does the Athlon manage to
double the fpu peak with the ordinary instructions? They must have
some scheduler on the chip farming out consistent sets of fpu
instructions to two different units?
Take care,
> So, the good news is that the P4 looks a lot like a PIII at the greater
> clock speed, even when using the normal FPU (I had heard rumors that the
> P4 fpu was crippled), since you get roughly 72% of peak with dgemm (the
> exact number the PII gets; PIII's typically get more like 76%). Here's
> some peak numbers (extracted from detailed timings below):
>
>
> Theo dMatmul dLU dMM % dLU %
> Mhz peak (MFLOP) (MFLOP) of Mhz of Mhz
> ==== ==== ======= ====== ====== ======
> ATH : 1000 2000 1192.6 1003.1 119.3 100.3
> P4 : 1500 1500 1073.9 986.1 71.6 65.7
> IA64: 666 2664 1866.3 1336.0 280.2 200.6
>
> Theoretical dMatmul dLU dLU %
> peak (Mflop) % peak % peak of dMM
> ============ ======= ====== ======
> ATH : 2000 59.6 50.2 84.1
> P4 : 1500 71.6 65.7 91.8
> IA64: 2664 70.1 50.2 71.6
>
>
> OK, so peak performance-wise (where N=3000 is largest timings I took: both
> Athlon and IA64 LU numbers were still getting better, as you would expect by
> looking at their LU % of MM numbers), without SSE2, it looks like the P4 will
> need to be about 1.66 times faster than an Athlon to maintain the same GEMM
> peak, and about 1.53 times faster to maintain the same LU peak. Since the LU
> peak should be perked up quite a bit by faster memory, it may look more like
> the MM numbers soon. So, under these conditions, Athlon is the fp king of
> the two. Athlon is far and away the flops/$ champion, and as far as I know,
> this is true of any machine on the market.
>
> Anyway, the full timings are given below. You'll see that the P4 does well
> early (probably due to superior memory), with the IA64 doing really poorly
> for small probs (memory again).
>
> Cheers,
> Clint
>
>
> 100 200 300 400 500 600 700 800 900 1000
> ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
> ATH dMM 909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
> P4 dMM 952.4 1010.5 1080.0 984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
> IA64 dMM 866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5
>
> ATH dLU 477.4 611.8 695.0 709.8 780.1 777.4 815.8 793.1 823.0 865.2
> P4 dLU 435.8 611.8 718.2 788.6 805.2 821.8 878.5 874.4 882.9 888.2
> IA64 dLU 241.2 419.4 554.3 652.8 754.2 800.4 832.4 873.0 926.0 937.0
>
> 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
> ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
> ATH dMM 1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
> P4 dMM 1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4 852.3
> IA64 dMM 1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3
>
> ATH dLU 878.8 923.4 925.2 943.2 950.3 965.5 974.9 983.5 994.6 1003.1
> P4 dLU 906.5 932.8 937.9 950.2 955.4 965.5 969.8 977.8 975.4 986.1
> IA64 dLU 990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0
>
> GEMM SYMM SYRK SYR2K TRMM TRSM
> ===== ===== ===== ===== ===== =====
> ATH-1 500 1136.4 1000.0 835.0 1087.0 961.5 961.5
> P4-1.5 500 1041.7 961.5 835.0 1000.0 892.9 1041.7
> IA64-666 500 1610.1 1201.9 1462.9 1462.9 1082.5 816.6
>
>
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah