[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
updated P4 timings
OK, I include below the updated P4 timings. I went ahead and timed all 4
types/precisions, and threw in Cholesky as well. I've kept my original
timings (indicated by P40 instead of P4) for comparison. In particular, we
see that the large blocking factor for SSE provides much better GEMM
performance, but that LU is slowed down until we get to very large cases
(~2000), probably due to the inadequacy of the cleanup (if you can call
3Gflop LU inadequate :) . . .
For the larger problem sizes, we see that zMM loses performance. This is due
to running out of memory, with ATLAS having to use less and less workspace
(which causes more and more cache thrashing), until around 2400, where
swapping sets in.
The relatively terrible performance of TRSM for the SSE-enabled code is
because accuracy prevents us from inverting diagonal blocks, and using
a gemm-based kernel, and thus we have to drop to using the x86 FPU (with
it's associated 1/4 theoretical peak) for that part of the computation.
Cheers,
Clint
100 200 300 400 500 600 700 800 900 1000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40 dMM 952.4 1010.5 1080.0 984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
P4 dMM 1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
P40 dLU 435.8 611.8 718.2 788.6 805.2 821.8 878.5 874.4 882.9 888.2
P4 dLU 428.7 659.8 763.1 851.7 887.6 933.9 951.8 974.3 1033.2 1040.9
P4 dLLt 400.1 547.1 615.1 713.8 759.8 820.2 929.0 876.9 918.5 1011.6
P40 sMM 2500.0 3674.1 3240.0 3584.0 3571.4 3600.0 3811.1 3657.1 3645.0 3703.7
P4 sMM 2631.6 3100.0 3351.7 3895.7 4000.0 3756.5 3811.1 4096.0 4050.0 4000.0
P40 sLU 606.5 1153.8 1529.5 1703.5 1808.9 2054.6 2284.2 2200.1 2428.0 2467.3
P4 sLU 537.8 921.9 1182.9 1525.5 1664.2 1830.4 2003.7 2087.8 2174.3 2337.4
P4 sLLt 430.4 727.8 753.8 1162.4 1266.4 1546.7 1697.5 1832.0 1872.3 1963.7
P4 zMM 1184.0 1163.6 1200.0 1219.0 1219.5 1216.9 1219.6 1219.0 1212.5 1201.2
P4 zLU 521.0 749.2 846.0 897.4 951.7 992.5 1027.2 1041.8 1056.1 1062.1
P4 zLLt 604.5 780.1 837.1 876.0 899.6 889.1 928.0 928.0
P4 cMM 2755.6 2968.7 3085.7 4266.7 4000.0 3927.3 2976.8 4137.4 4107.0 4060.9
P4 cLU 636.8 986.8 1232.7 1598.5 1708.1 1856.9 2031.5 2201.1 2341.2 2423.3
P4 cLLt 1078.7 906.8 1716.3 1674.2 1927.2 1911.7 2013.5 2118.3 2121.2
1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40 dMM 1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4 852.3
P4 dMM 1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
P40 dLU 906.5 932.8 937.9 950.2 955.4 965.5 969.8 977.8 975.4 986.1
P4 dLU 1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
P4 dLLt 961.2 1017.4 1035.3 1074.9 1050.7 1089.5 1072.3 1081.6 1097.6 1108.9
P40 sMM 3676.6 3683.2 3673.5 3679.5 3661.3 3665.4 3671.7 3657.9 3670.9 3658.5
P4 sMM 4114.3 4126.3 4137.4 4107.0 4166.7 4127.1 4182.8 4140.4 4189.3 4150.7
P40 sLU 2616.5 2688.8 2785.1 2922.1 2913.3 2994.2 3040.6 2074.5 3093.2 3118.8
P4 sLU 2398.5 2539.4 2702.4 2796.0 2961.8 2932.3 3050.7 3090.8 3153.2 3168.2
P4 sLLt 2306.9 2543.5 2484.8 2560.0 2668.7 2732.1 2828.8 2916.4 2940.3 3062.8
P4 cMM 4065.9 3991.3 4030.5 3957.3 4015.1 3806.3 3818.8 3773.7 3063.2 2544.2
P4 cLU 1712.5 1750.1 1835.3 1893.9 1930.3 1972.9 2016.3 2030.4 2064.6 2092.8
P4 zMM 1189.7 1159.6 1159.1 1157.1 1156.0 794.2 ...........................
P4 zLU 872.5 806.5 772.3 763.3 737.8 714.9 691.4 680.9 674.4 670.6
GEMM SYMM SYRK SYR2K TRMM TRSM
===== ===== ===== ===== ===== =====
P40 d500 1041.7 961.5 835.0 1000.0 892.9 1041.7
P4 d500 1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
P4 z500 1204.8 1190.5 1043.8 1204.8 1042.7 981.4
P40 s500 3571.4 3125.0 2636.8 3333.3 2941.2 2500.0
P4 s500 3571.4 4166.7 3006.0 3571.4 3125.0 2419.4
P4 c500 4000.0 3846.2 3131.3 4166.7 3575.0 1787.5