[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SSE Level 3 drop in gemm
Camm,
No knowledge/understanding of the register reservation, unfortunately . . .
>Otherwise, the kernel is working fine. Performance fluctuates on the
>short timer runs, but is somewhere between 670 and 700 MFLOPS for the
>beta=0 case, and about 670 for arbitrary beta.
Great, that represents something like a 1.9 speedup over ATLAS's kernel,
doesn't it?
>On another front -- Do you have any word on the complex compilation
>procedure, Clint? The deal is that all beta cases seem to be
>referenced by the same timer (fc.c) program, regardless of beta= flag.
Yep, ATLAS/doc/atlas_contrib.ps explains this in the section on complex
matmul: it's done with 4 calls to essentially a real matmul. Even the
case of beta=1 requires a real beta=X, 'cause you need the -1.0 case
because the two imaginary elements that contribute to the real component
(notice steps 1 and 3 on page 14 use negative). The timer compiles your
complex code 3 times to get the b1, b0, and bX cases. What exactly is
the problem you are having with it?
Cheers,
Clint