[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
updated P4 timings
OK, I include below the updated P4 timings.  I went ahead and timed all 4
types/precisions, and threw in Cholesky as well.  I've kept my original
timings (indicated by P40 instead of P4) for comparison.  In particular, we
see that the large blocking factor for SSE provides much better GEMM
performance, but that LU is slowed down until we get to very large cases
(~2000), probably due to the inadequacy of the cleanup (if you can call
3Gflop LU inadequate :) . . .
For the larger problem sizes, we see that zMM loses performance.  This is due
to running out of memory, with ATLAS having to use less and less workspace
(which causes more and more cache thrashing), until around 2400, where
swapping sets in.
The relatively terrible performance of TRSM for the SSE-enabled code is
because accuracy prevents us from inverting diagonal blocks, and using
a gemm-based kernel, and thus we have to drop to using the x86 FPU (with
it's associated 1/4 theoretical peak) for that part of the computation.
Cheers,
Clint
             100    200    300    400    500    600    700    800    900   1000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40  dMM   952.4 1010.5 1080.0  984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
P4   dMM  1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
P40  dLU   435.8  611.8  718.2  788.6  805.2  821.8  878.5  874.4  882.9  888.2
P4   dLU   428.7  659.8  763.1  851.7  887.6  933.9  951.8  974.3 1033.2 1040.9
P4   dLLt  400.1  547.1  615.1  713.8  759.8  820.2  929.0  876.9  918.5 1011.6
P40  sMM  2500.0 3674.1 3240.0 3584.0 3571.4 3600.0 3811.1 3657.1 3645.0 3703.7
P4   sMM  2631.6 3100.0 3351.7 3895.7 4000.0 3756.5 3811.1 4096.0 4050.0 4000.0
P40  sLU   606.5 1153.8 1529.5 1703.5 1808.9 2054.6 2284.2 2200.1 2428.0 2467.3
P4   sLU   537.8  921.9 1182.9 1525.5 1664.2 1830.4 2003.7 2087.8 2174.3 2337.4
P4   sLLt  430.4  727.8  753.8 1162.4 1266.4 1546.7 1697.5 1832.0 1872.3 1963.7
P4   zMM  1184.0 1163.6 1200.0 1219.0 1219.5 1216.9 1219.6 1219.0 1212.5 1201.2
P4   zLU   521.0  749.2  846.0  897.4  951.7  992.5 1027.2 1041.8 1056.1 1062.1
P4   zLLt                604.5  780.1  837.1  876.0  899.6  889.1  928.0  928.0
P4   cMM  2755.6 2968.7 3085.7 4266.7 4000.0 3927.3 2976.8 4137.4 4107.0 4060.9
P4   cLU   636.8  986.8 1232.7 1598.5 1708.1 1856.9 2031.5 2201.1 2341.2 2423.3
P4   cLLt        1078.7  906.8 1716.3 1674.2 1927.2 1911.7 2013.5 2118.3 2121.2
            1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40  dMM  1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4  852.3
P4   dMM  1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
P40  dLU   906.5  932.8  937.9  950.2  955.4  965.5  969.8  977.8  975.4  986.1
P4   dLU  1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
P4   dLLt  961.2 1017.4 1035.3 1074.9 1050.7 1089.5 1072.3 1081.6 1097.6 1108.9
P40  sMM  3676.6 3683.2 3673.5 3679.5 3661.3 3665.4 3671.7 3657.9 3670.9 3658.5
P4   sMM  4114.3 4126.3 4137.4 4107.0 4166.7 4127.1 4182.8 4140.4 4189.3 4150.7
P40  sLU  2616.5 2688.8 2785.1 2922.1 2913.3 2994.2 3040.6 2074.5 3093.2 3118.8
P4   sLU  2398.5 2539.4 2702.4 2796.0 2961.8 2932.3 3050.7 3090.8 3153.2 3168.2
P4   sLLt 2306.9 2543.5 2484.8 2560.0 2668.7 2732.1 2828.8 2916.4 2940.3 3062.8
P4   cMM  4065.9 3991.3 4030.5 3957.3 4015.1 3806.3 3818.8 3773.7 3063.2 2544.2
P4   cLU  1712.5 1750.1 1835.3 1893.9 1930.3 1972.9 2016.3 2030.4 2064.6 2092.8
P4   zMM  1189.7 1159.6 1159.1 1157.1 1156.0  794.2 ...........................
P4   zLU   872.5  806.5  772.3  763.3  737.8  714.9  691.4  680.9  674.4  670.6
                          GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
                         =====  =====  =====  =====  =====  =====
P40      d500           1041.7  961.5  835.0 1000.0  892.9 1041.7
P4       d500           1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
P4       z500           1204.8 1190.5 1043.8 1204.8 1042.7  981.4
P40      s500           3571.4 3125.0 2636.8 3333.3 2941.2 2500.0
P4       s500           3571.4 4166.7 3006.0 3571.4 3125.0 2419.4
P4       c500           4000.0 3846.2 3131.3 4166.7 3575.0 1787.5