[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sgemm questions
Greetings! Well, I suppose curiosity got the better of me. At
http://people.debian.org/~/camm/sgemm_20001122.tgz
I've put the results of my investigation into Peter's outstanding
code. KB=NB=MB=lda=ldb, KB%4=0. Briefy, here are the results:
=============================================================================
Peter's code (sent in email, couldn't get complex to work?):
=============================================================================
1689.400606 +- 4.709633 ATL_out 56 s 0 moves=
1346.513939 +- 8.287574 ATL_out 56 s 0
1751.309394 +- 5.058435 ATL_out 56 s 1 moves=
1384.111818 +- 8.147836 ATL_out 56 s 1
1712.988182 +- 3.164270 ATL_out 56 s 2 moves=
1293.062424 +- 6.591418 ATL_out 56 s 2
=============================================================================
New submission:
=============================================================================
1865.305455 +- 6.159026 ATL_sgemm_SSE 56 s 0 moves=
1441.084848 +- 8.923852 ATL_sgemm_SSE 56 s 0
1865.305455 +- 6.159026 ATL_sgemm_SSE 56 s 1 moves=
1409.389697 +- 6.385470 ATL_sgemm_SSE 56 s 1
1753.685455 +- 7.090097 ATL_sgemm_SSE 56 s 2 moves=
1352.806364 +- 6.675575 ATL_sgemm_SSE 56 s 2
1877.224848 +- 6.401376 ATL_sgemm_SSE 60 s 0 moves=
1468.712727 +- 5.514439 ATL_sgemm_SSE 60 s 0
1870.535758 +- 6.330379 ATL_sgemm_SSE 60 s 1 moves=
1473.095455 +- 6.562474 ATL_sgemm_SSE 60 s 1
1771.200000 +- 0.000000 ATL_sgemm_SSE 60 s 2 moves=
1420.441515 +- 7.363922 ATL_sgemm_SSE 60 s 2
1868.018182 +- 6.256875 ATL_sgemm_SSE 64 s 0 moves=
1458.247273 +- 7.546355 ATL_sgemm_SSE 64 s 0
1863.560000 +- 6.036407 ATL_sgemm_SSE 64 s 1 moves=
1461.693636 +- 5.455752 ATL_sgemm_SSE 64 s 1
1753.680000 +- 4.903616 ATL_sgemm_SSE 64 s 2 moves=
1396.104848 +- 6.463921 ATL_sgemm_SSE 64 s 2
1849.863939 +- 4.590667 ATL_sgemm_SSE 52 s 0 moves=
1461.447273 +- 5.518950 ATL_sgemm_SSE 52 s 0
1865.305455 +- 6.159026 ATL_sgemm_SSE 56 s 0 moves=
1429.503939 +- 9.230962 ATL_sgemm_SSE 56 s 0
1877.224848 +- 6.401376 ATL_sgemm_SSE 60 s 0 moves=
1516.266364 +- 8.898227 ATL_sgemm_SSE 60 s 0
1870.247273 +- 6.328658 ATL_sgemm_SSE 64 s 0 moves=
1487.890303 +- 5.682958 ATL_sgemm_SSE 64 s 0
1745.235152 +- 5.377374 ATL_sgemm_SSE 68 s 0 moves=
1434.725455 +- 6.320064 ATL_sgemm_SSE 68 s 0
1627.469091 +- 4.598963 ATL_sgemm_SSE 72 s 0 moves=
1356.362727 +- 5.500429 ATL_sgemm_SSE 72 s 0
1847.635152 +- 4.178696 ATL_sgemm_SSE 52 s 1 moves=
1424.103333 +- 8.197587 ATL_sgemm_SSE 52 s 1
1863.076667 +- 6.035586 ATL_sgemm_SSE 56 s 1 moves=
1434.814545 +- 9.466845 ATL_sgemm_SSE 56 s 1
1868.306061 +- 6.258576 ATL_sgemm_SSE 60 s 1 moves=
1475.670909 +- 5.923164 ATL_sgemm_SSE 60 s 1
1863.560000 +- 6.036407 ATL_sgemm_SSE 64 s 1 moves=
1468.908182 +- 7.004336 ATL_sgemm_SSE 64 s 1
1731.824848 +- 5.377374 ATL_sgemm_SSE 68 s 1 moves=
1429.246970 +- 5.253022 ATL_sgemm_SSE 68 s 1
1615.819394 +- 4.724985 ATL_sgemm_SSE 72 s 1 moves=
1343.587576 +- 5.246843 ATL_sgemm_SSE 72 s 1
1809.829091 +- 5.858913 ATL_sgemm_SSE 52 s 2 moves=
1379.219091 +- 7.690277 ATL_sgemm_SSE 52 s 2
1828.243333 +- 4.249878 ATL_sgemm_SSE 56 s 2 moves=
1441.033939 +- 8.667444 ATL_sgemm_SSE 56 s 2
1839.320000 +- 0.000004 ATL_sgemm_SSE 60 s 2 moves=
1461.923939 +- 5.521045 ATL_sgemm_SSE 60 s 2
1841.269091 +- 2.195057 ATL_sgemm_SSE 64 s 2 moves=
1464.526364 +- 5.946164 ATL_sgemm_SSE 64 s 2
1712.667273 +- 3.163770 ATL_sgemm_SSE 68 s 2 moves=
1407.878182 +- 6.316569 ATL_sgemm_SSE 68 s 2
1605.833939 +- 4.097082 ATL_sgemm_SSE 72 s 2 moves=
1348.433636 +- 6.104257 ATL_sgemm_SSE 72 s 2
1801.574545 +- 5.902800 ATL_sgemm_SSE 52 c 0 moves=
1573.307576 +- 4.372838 ATL_sgemm_SSE 52 c 0
1839.436970 +- 2.192968 ATL_sgemm_SSE 56 c 0 moves=
1597.456970 +- 4.226868 ATL_sgemm_SSE 56 c 0
1834.340000 +- 0.000004 ATL_sgemm_SSE 60 c 0 moves=
1616.339394 +- 4.769277 ATL_sgemm_SSE 60 c 0
1804.527273 +- 8.006376 ATL_sgemm_SSE 64 c 0 moves=
1599.050909 +- 4.231042 ATL_sgemm_SSE 64 c 0
1708.967879 +- 3.224055 ATL_sgemm_SSE 68 c 0 moves=
1512.530000 +- 4.098100 ATL_sgemm_SSE 68 c 0
1605.941515 +- 4.678871 ATL_sgemm_SSE 72 c 0 moves=
1449.201515 +- 2.315129 ATL_sgemm_SSE 72 c 0
1807.765455 +- 5.902800 ATL_sgemm_SSE 52 c 1 moves=
1533.680909 +- 4.929485 ATL_sgemm_SSE 52 c 1
1826.900909 +- 4.246757 ATL_sgemm_SSE 56 c 1 moves=
1567.348182 +- 4.468261 ATL_sgemm_SSE 56 c 1
1819.928485 +- 4.834932 ATL_sgemm_SSE 60 c 1 moves=
1574.220606 +- 4.102429 ATL_sgemm_SSE 60 c 1
1806.016970 +- 5.925492 ATL_sgemm_SSE 64 c 1 moves=
1570.467273 +- 4.457004 ATL_sgemm_SSE 64 c 1
1703.352727 +- 2.444810 ATL_sgemm_SSE 68 c 1 moves=
1496.558485 +- 2.828707 ATL_sgemm_SSE 68 c 1
1594.184242 +- 1.638833 ATL_sgemm_SSE 72 c 1 moves=
1425.814848 +- 3.704411 ATL_sgemm_SSE 72 c 1
1795.383636 +- 5.702647 ATL_sgemm_SSE 52 c 2 moves=
1517.324848 +- 4.192475 ATL_sgemm_SSE 52 c 2
1816.591818 +- 5.443238 ATL_sgemm_SSE 56 c 2 moves=
1556.658485 +- 5.173491 ATL_sgemm_SSE 56 c 2
1815.810909 +- 5.267221 ATL_sgemm_SSE 60 c 2 moves=
1560.235152 +- 4.411821 ATL_sgemm_SSE 60 c 2
1735.065455 +-18.207197 ATL_sgemm_SSE 64 c 2 moves=
1556.541212 +- 4.513313 ATL_sgemm_SSE 64 c 2
1687.300000 +- 4.830110 ATL_sgemm_SSE 68 c 2 moves=
1488.329394 +- 4.328542 ATL_sgemm_SSE 68 c 2
1587.850000 +- 2.570750 ATL_sgemm_SSE 72 c 2 moves=
1411.768788 +- 3.646260 ATL_sgemm_SSE 72 c 2
=============================================================================
A few comments:
1) I like Peter's idea of using a generator to write C code and then
compile, better than my approach of having the cpp preprocessor
generate assembly from defined macros. I'd originally adopted the
latter because I couldn't get rid of register thrashing as gcc
switched between its asm and mine, but Peter's code generates very
clean assembly, and gcc always handles the loop overhead best. I
was further a little concerned about the documentation, which seems
to indicate that gcc is free to insert whatever it wishes between
asm() calls. We can currently produce good asm using multiple
asm() calls because a) gcc currently doesn't reference the extended
registers, and b) if we don't reference the ordinary registers in
the asm() explicitly, gcc's optimizer can do a good job of
maximizing register use across asm() calls. If and when gcc
starts emitting references to SSE/MMX registers, of course, things
will have to change.
2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
c) loading C at the beginning rather than at the end and
(shockingly) d) doing no pipelining at all all seem to be wins. I
couldn't believe d) when I saw it, but its apparently true -- the
PIII likes code like load(a) mul(b,a) add(a,c) best. Apparently,
the parallelism between muls and adds mentioned by Doug Aberdeen in
his earlier email only appears fully when the intermediary register
is the same. Doug, maybe you can try this and see if you can get
better than 0.75 clock? Or maybe I misunderstand you?
3) I noticed the practice of checking the loops at the end, so that the
code fails if called with any length = 0. This seems reasonable,
but I thought I'd point it out to ensure that atlas is making the
calls accordingly.
4) I really only did three things, and a few minor cleanups, to
Peter's code: a) shaved an instruction off the main block of 4
multiplies, b) tightened the writing of C, and c) with these, and
the elimination of a few extraneous instructions, increased the
optimal KB to 60 or 64.
5) Peter, if you'd like to make these changes in your generator, and
maintain this code or its equivalent, that would be just fine with
me. You're doing a great job, and atlas is all the better for it!
6) I've got a cleanup too, which works but isn't fully optimized, if
anyone would like to look at it.
Take care,
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah