[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SSE Level 3 drop in gemm
Greetings!
R Clint Whaley <rwhaley@cs.utk.edu> writes:
> Doug,
>
> >I've (finally) found the time to finish adding my SSE sgemm into ATLAS
> >as a drop in kernel. Atlas timing says it runs up to 2.39 time faster
> >than ATLAS when it's computing the cross over points. Two questions:
>
This is great! Thanks for all the work, you guys! Congratulations!
I had been working on a L3 sgemm *kernel*, (not a complete
implementation), which I'm sure won't perform as well as Doug's, and
which is thus now obsolete. The sgemm I had been working on gives so
far about ~550 MFLOPS with the 'make ummcase' -- the best that atlas
previously found was 223 (res/sMMRES). These are compiled with -g, so
I don't know what the real speedup would be. PIII 450Mhz. xl3blastst
from a previous optimized (i.e. no -g) build gives around 370.
In any case, I thought I might turn this into a complex gemm
contribution. Reading the docs, it seems one only needs double ldc?
Will atlas call the kernel repeatedly for all real/imaginary matrix
combos?
A few thoughts:
1) One ought to be able to do better with a true complex kernel than
calling the routines 4 times, no?
2) The xsmmtst always doubles ldc, even with single real precision.
This makes it difficult to fully capitalize on he compile-time
constant nature of the dimensions (i.e. one must read ldc runtime
if one wants a routine that will past both the tester and the
timer.)
3) I found it useful to also define NB4,MB4, and KB4 in emit_mm.c,
for obvious (In the case of SSE) reasons.
4) Believe it or not, prefetch added about 50-80 MFLOPS on a base of
450. Still, I don't imagine that would warrant double precision
kernels?
5) xmmsearch still reports the old atlas kernel as the best to
stdout, at 223, but adds mine at the bottom of res/sMMRES.
Haven't tried installing the whole library yet, but I had doubts
on whether this would result in my kernel being selected.
6) These are a *lot* easier to write, IMHO, than the l2 stuff.
7) I remember reading that the AMD 3dNow! had the same kni throughput
as the PIII, even though its mm registers were half as big.
Something else was doubled, but I can't find it now. I know there
are still only 8 mm regs. Anyone know the answer? Should be easy
to make Athlon stuff from what we have.
8) Sure would be nice, since a copy is being done anyway, to align
data to 16 bytes. Anywhere I can change this locally just to see
what it adds to the performance?
Take care,
> Great news! I was hoping we'd have some L3 SSE stuff before release . . .
> Is it a kernel or a complete GEMM implementation? I'm not sure from the
> info below . . .
>
> >It compiles fine using the documented instructions for forcing
> >compilation, but it doesn't seem to automatically detect it during a
> >normal compilation. For this to work I am guessing all I need to do is
> >add the correct UMMdir definition to ATLAS/Make.<arch> before starting the
> >./make arch=<arch> install? There is an ATLAS/makes/Make.goto. Do I
> >need one of these?
>
> Depends on whether you've got a kernel or a GEMM replacement. For a kernel,
> you shouldn't need to fool with all this stuff. . . .
>
> >2) What's the best way to send in the changes? Complete tar file, tar
> >file with the changes, patch file?
>
> I like a tarfile with just your codes best . . .
>
> Thanks,
> Clint
>
>
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah