[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UPDATED MKL5.0 vs. ATLAS3.2.0 on 933Mhz PIII
Guys,
This is the same timing mail, but with an error fixed. I got to wondering
why MKL was beating us for DGER, and when I scoped the defaults, I noticed
it was not using Camm's prefetch ger, but my old axpy-based implementation.
I got rid of the erroneous default, and Camm's stuff provided GER speedup.
Updated timings and discussion follow,
Clint
Some guys at Intel have been asking me to publish some ATLAS vs MKL numbers,
since most of my previous graphs compared against Greg Henri's BLAS. I used
to compare against Greg's BLAS 'cause MKL wasn't available under Linux, and
it is always a pain for me to get access to a Windows platform. MKL 5.1
is presently in BETA, and it has a Linux version. Since it's a BETA,
however, Intel requires you to agree to an NDA saying you won't publish
any benchmarks using it, and the Intel people have been unable to free
me from the NDA.
I've been working on the windows stuff lately, however, and once I figured
out how to call MKL, I was able to get numbers with MKL 5.0, which does not
have a no-publish NDA. Because I agree that comparing against Greg's stuff
is not the thing to do, I tried to do a fairly wide range of timings to
clear the air here. I include these timings below.
If I had to summarize these PIII timings, it would be that ATLAS blows chunks
for Level 1 BLAS, tends to be beat MKL for Level 2 BLAS, and varies between
quite a bit slower and quite a bit faster than MKL for Level 3 BLAS,
depending on problem size and data type.
The Level 1 results are easily explained. ATLAS's present Level 1 gets
its optimization mainly from the compiler. This gives MKL two huge
advantages: MKL can use the SSE prefetch instructions to speed up pretty
much all Level 1 ops. The second advantage is in how ABS() is done.
ABS() *should* be a 1-cycle operation, since you can just mask off the
sign bit. However, you cannot standardly do bit operation on floats in
ANSI C, so ATLAS has to use an if-type construct instead. This spells
absolute doom for the performance of NRM2, ASUM and AMAX.
For the Level 2 and 3, ATLAS has it's usual advantage of leveraging basic
kernels to the maximum. This means that all Level 3 ops follow the performance
of GEMM, and Level 2 ops follow GER or GEMV. MKL has the usual disadvantage
of optimizing all these routines seperately, leading to widely varying
performance.
For Level 2, ATLAS wins for pretty much all operations, sizes and precisions
other than small case [S,D] TRSV and TRMV. ATLAS's success here is due mainly
to Camm's excellent prefetched Level 2 GEMV and GER kernels.
For the Level 3, we really have a mixed bag. ATLAS's main weakness is in its
complex TRSM. This is because TRSM cannot use the GEMM kernel as much as
the rest of the operations. Anytime TRSM runs slower than TRMM, this is
the reason. Complex is hit harder than real because I wrote a hand-tuned
kernel for real, while we must recur to 1 for complex. The fix for this
poor performance requires some theory that we don't yet have: details
of the problem are posted on the developer site, if anyone is interested.
ATLAS is also in general less good at small problems than MKL.
The main weakness of MKL in the Level 3 operations is in it's handling of
single precision complex, where it doesn't look like they have SSE
optimizations yet. MKL also tends to lose to ATLAS on pretty much everything
except GEMM for large problems.
For the factorizations, ATLAS tends to lose for small problems, and win for
large. In part, this is because we recur down to 1; I am hoping to include
LU and possibly LLt that stop the recursion before one in the next developer
release. Preliminary timings show this to make a large performance difference
for small problem sizes. For complex, the poor small-size TRSM performance
also has a definite impact, and a crushing one for LLt.
Cheers,
Clint
*******************************************************************************
* NOTES *
*******************************************************************************
All timings were taken on a 933Mhz PIII, 256K L2, under Windows 2000, using
MKL 5.0 and ATLAS 3.2.0.
The ATLAS timers were used: this may mean performance is less than with
other timers, as ATLAS flushes the data caches before each call.
For all timings, M=K=N, alpha=1.0, beta=1.0, Side='Left', Uplo='Lower',
TRANS='Notrans', DIAG='Nonunit', except for the Level 1, where alpha=2.0 for
real, and (2.0, 2.2) for complex.
No timings are given for 500x500 HERK and HER2K for MKL, 'cause this call gave
an access violation.
MKL does not possess the Level 1 routines DSDOT and SDSDOT.
No timings are given for N=100 or 200 complex Cholesky, 'cause our timer
couldn't get enough accuracy to be repeatable.
There's a lot of other timings that could be done, but I'm unlikely to do them.
I will be posting the library I built to do these timings to the prebuilt page
(and it was just a standard ATLAS install, anyway, if you want to install
yourself), if other people would like to time further.
Timings either have problem size or operation along X axis. When problem
size is along the X axis, library (MKL for MKL 5.0, ATL, for ATLAS 3.2.0),
data type (S: single real, D: double real, C: single complex, Z: double complex)
and operation are given along Y. When operation is along the X axis,
library, data type and problem size are given along Y.
LU is GETRF, LLT is POTRF.
Theoretical peak for double precision for this machine is 933 MFLOP. For
single precision using SSE (as both libraries do), theoretical peak is
3.732 GFLOP.
*******************************************************************************
* LEVEL 3 TIMINGS *
*******************************************************************************
100 200 300 400 500 600 700 800 900 1000
===== ===== ===== ===== ===== ===== ===== ===== ===== =====
MKL SGEMM 1327.7 1445.3 1400.6 1672.4 1584.3 1592.5 1661.6 1724.6 1675.9 1662.5
ATL SGEMM 911.6 1359.5 1347.4 1492.8 1502.4 1543.9 1544.5 1569.3 1599.9 1610.3
MKL DGEMM 640.2 648.4 648.0 664.4 680.3 673.9 697.2 704.7 691.3 699.5
ATL DGEMM 551.9 622.3 635.3 646.5 653.6 673.9 665.4 682.7 675.9 677.0
MKL CGEMM 773.8 818.8 766.0 819.2 810.4 825.6 820.6 829.5 825.7 825.8
ATL CGEMM 1094.9 1449.1 1542.9 1561.0 1524.4 1556.8 1554.7 1588.8 1595.2 1610.0
MKL ZGEMM 610.8 664.4 692.3 745.3 727.3 747.4 734.9 753.4 737.6 740.9
ATL ZGEMM 599.0 647.6 727.9 668.4 681.2 682.7 683.3 682.7 688.6 690.0
MKL SLU 477.8 751.1 846.4 839.1 810.1 837.4 812.9 909.4 887.7 906.3
ATL SLU 385.7 633.3 748.1 860.3 931.9 995.3 1019.7 1064.0 1109.9 1152.5
MKL DLU 366.5 462.0 475.6 487.3 484.7 497.6 504.2 519.0 518.2 526.6
ATL DLU 337.5 430.5 459.4 504.6 514.7 525.9 541.3 560.0 555.0 568.4
MKL CLU 606.4 667.2 644.5 641.8 646.1 669.3 664.9 682.3 690.8 696.4
ATL CLU 459.0 681.4 768.5 910.2 969.7 1052.4 1083.1 1134.4 1173.4 1201.3
MKL SLLT 288.4 459.2 568.5 644.8 683.2 753.0 763.9 782.0 779.3 821.2
ATL SLLT 244.3 407.1 530.0 632.1 730.0 808.4 833.9 887.4 953.3 970.4
MKL DLLT 298.5 416.3 428.9 442.4 461.3 461.7 473.4 775.6 486.8 496.8
ATL DLLT 256.5 348.6 403.9 428.3 445.5 478.0 505.9 508.9 501.9 534.1
MKL CLLT 585.0 613.0 629.4 616.4 639.0 635.1 642.8 648.1
ATL CLLT 585.0 686.5 715.5 840.3 862.4 912.8 959.1 983.3
MKL ZLLT 465.0 550.1 695.8 597.3 599.7 616.7 629.9 638.2
ATL ZLLT 385.9 456.5 466.3 499.3 506.4 527.8 537.8 524.7
HEMM HERK HER2K
GEMM SYMM SYRK SYR2K TRMM TRSM
====== ====== ====== ====== ====== ======
MKL S100 1362.4 581.7 504.0 414.8 800.0 711.2
ATL S100 941.6 1049.3 688.2 912.3 598.1 542.3
MKL S500 1560.1 1000.0 1079.7 959.1 1453.5 901.7
ATL S500 1524.4 1422.5 1144.6 1500.0 1305.5 1102.5
MKL S1000 1662.5 1163.5 1256.0 1142.9 1560.1 1033.1
ATL S1000 1600.0 1524.4 1334.7 1600.0 1455.6 1333.3
MKL D100 640.0 419.7 376.3 326.5 569.0 512.2
ATL D100 556.3 543.0 400.0 541.5 473.9 465.7
MKL D500 693.4 551.9 572.8 551.9 648.8 545.1
ATL D500 666.7 615.8 522.6 666.7 600.0 600.0
MKL D1000 699.3 606.6 621.7 598.1 666.7 566.6
ATL D1000 688.2 656.4 587.4 666.7 639.8 639.8
MKL C100 771.0 533.3 487.5 492.1 522.8 607.9
ATL C100 1067.2 1033.1 608.1 1023.8 725.1 425.3
MKL C500 810.4 718.9 712.7 703.2 727.5 728.5
ATL C500 1522.1 1488.1 1187.2 1488.1 1334.7 1001.0
MKL C1000 825.8 756.3 753.6 748.6 778.4 740.2
ATL C1000 1605.1 1585.1 1370.8 1475.5 1515.3 1249.5
MKL Z100 656.8 473.9 441.0 411.3 579.3 620.9
ATL Z100 609.8 595.2 457.1 597.6 462.8 392.1
MKL Z500 718.9 646.8 681.0 653.4
ATL Z500 681.2 659.6 553.0 681.2 616.4 582.7
MKL Z1000 725.2 683.6 692.6 679.0 719.5 672.3
ATL Z1000 689.1 678.1 625.0 681.8 660.1 638.7
*******************************************************************************
* LEVEL 2 TIMINGS *
*******************************************************************************
HEMV GERU HER HER2
GEMV SYMV TRMV TRSV GER SYR SYR2
====== ====== ====== ====== ====== ====== ======
MKL s100 253.3 178.8 230.2 223.8 155.4 96.4 164.1
ATL s100 301.7 323.2 176.8 175.9 188.2 163.3 246.2
MKL s500 211.9 183.9 175.8 227.0 165.0 101.6 191.6
ATL s500 340.6 463.8 227.0 223.8 192.8 172.1 283.1
MKL s1000 319.0 215.5 201.2 301.8 173.3 105.6 195.9
ATL s1000 414.2 358.5 340.4 333.3 185.4 174.9 273.0
MKL d100 202.5 146.8 193.9 205.2 97.9 83.8 118.9
ATL d100 186.1 145.4 89.9 86.0 95.8 83.8 119.0
MKL d500 166.7 151.7 122.6 178.8 100.9 63.6 115.1
ATL d500 203.8 192.8 157.6 159.2 96.4 93.0 147.5
MKL d1000 167.8 152.6 123.1 173.0 100.0 65.6 117.9
ATL d1000 208.4 189.8 176.8 176.8 95.2 90.4 147.9
MKL c100 381.1 266.6 200.0 228.6 301.9 177.7 271.1
ATL c100 695.4 615.7 355.6 296.2 323.2 275.9 421.2
MKL c500 414.6 275.2 249.7 363.3 312.9 190.3 282.3
ATL c500 693.7 693.7 581.5 570.9 322.4 307.4 419.9
MKL c1000 429.0 276.4 258.2 397.0 311.4 182.9 284.1
ATL c1000 706.1 676.3 635.4 622.6 314.4 300.3 408.2
MKL z100 268.9 203.8 130.6 154.6 155.3 102.2 191.6
ATL z100 380.8 304.9 205.1 189.3 192.7 169.3 225.3
MKL z500 303.9 208.7 153.7 253.7 158.8 106.2 198.1
ATL z500 375.6 301.0 316.5 307.4 186.7 178.6 234.6
MKL z1000 317.8 207.7 159.6 294.2 160.4 105.4 195.2
ATL z1000 373.8 305.5 345.1 341.5 177.5 174.8 224.2
*******************************************************************************
* LEVEL 1 TIMINGS *
*******************************************************************************
DOTU
ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
====== ====== ====== ====== ====== ====== ====== ====== ======
MKL s500 246.3 118.5 76.1 114.2 106.6 168.9 276.4 267.4 357.1
ATL s500 152.4 53.3 11.8 56.1 69.6 94.2 26.0 65.3 57.1
MKL d500 168.3 59.3 71.0 54.2 59.2 145.8 145.8 290.7 213.7
ATL d500 82.1 44.4 38.1 30.8 54.2 91.4 22.7 44.4 40.0
MKL c500 110.1 21.8 110.4 188.0 320.5 641.0 320.5 400.0
ATL c500 53.3 20.8 56.1 138.9 145.3 52.5 66.7 57.1
MKL z500 57.1 118.5 60.3 103.3 127.9 454.5 228.3 228.3
ATL z500 44.4 83.2 30.8 78.0 118.5 45.7 43.8 38.1