[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sgemm questions
Hi Clint!
R Clint Whaley <rwhaley@cs.utk.edu> writes:
> Camm,
>
> >lda=ldb=KB, yes, that's right. It appears that I've also assumed that
> >all dimensions are KB, i.e. MB=NB=KB, as described in the doc for the
> >L1 kernel case. It would be trivial to separate out MB and NB if
> >needed -- just two macros need to be changed. Please let me know if
> >you want this edit. The input array dimension args are ignored.
>
> OK, I need to get a new developer release out with the cleanup stuff in.
> That is where this need for MB != NB != KB is coming from. The kernel you
> already have is fine for the non-cleanup case, and I would recommend leaving
> it alone for that. If you have the time, I think it would be worth doing to
> produce a second kernel, modified from the first, so that M and N are passed
> in as parameters to the routine, rather than fixed at MB and NB. This has
> not caused serious slowdown on any platform I've tested so far (since these
> dimensions do not effect lda/ldb and the innermost loop), and it allows the
> routine to be used for M and N loop cleanup without compiling NB different
> instantiations of the routine (leading to code bloat, and reducing performance
> through repetive instruction load). For the K-cleanup, it *is* often necessary
> to use compile-time KB, since it controls lda and ldb, as well as the inner
> loop, especially on Intels, where the inner loop needs heavy unrolling. So,
> a second kernel taking M & N as input kernels, and then probably fixing K to
> KB would be a good cleanup (obviously, if it didn't kill performance, taking
> K as an input parameter would be great, but I don't think it is doable).
> The idea would be to use the input file's flag variable to indicate your first
> routine is to be used for kernel only, and the second to be used for cleanup
> only.
>
OK, it seems as though if we can insist that KB be a multiple of 4 (2
for complex), we can even input kb without too much trouble. Please
let me know if this is workable. What I'm unsure of is whether to
write a 1x1xkb cleanup kernel, or something that can branch from
2x1xkb, 1x2xkb, to 1x1xkb. Do you think this is worth it? How will
kb%4!=0 work? It can be done of course, but the normal fpu needs to
be used in this case, and there may be issues of getting into and out
of xmm mode.
> You can still insist that M be a multiple of 2, for instance, though this
> will mean that your cleanup will only be called when M%2 == 0, and the
> generated cleanup will be called otherwise . . .
>
> Normally, you can leave the cleanup to ATLAS's generated cleanup, but your
> kernel is 1.8 times faster than the generated code, so cleanup could really
> hurt your performance . . .
>
OK. What percent of peak is this, BTW?
> Thanks,
> Clint
>
>
Take care,
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah