[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Math-atlas-devel] prefetch ravings
Hello all,
I can only say some words to prefetching on Athlon. Prefetchings must be
set here very
carefully. One reason is, that Athlon can handle "only" six oustanding
prefetches a time all
following are simply ignored. The second reason is, that they sometimes
decrease performance
in a unpredictable way (hello Clint!). BTW, the K6 series can handle
only one outstanding prefetch, which
makes it worthless in praxis.
My personal stategy in my dgemm kernel was:
- Make sure that that the prefetches do not decrease performance of the
"pure" kernel. Means, that
there should be no (or less) performance difference between the kernel
with and without prefetching
enabled, when running the kernel in a loop with always the same three
matrices (all in L1 cache)
- Place the prefetches right before the register exchange part of the
kernel. This is the place
where the mul and add pipelines are emptied. At least on Athlon the
first store must wait 4 cycles
for its data here.
- Unroll the two inner loops completely as the prefetching instructions
for a column of B and of A+1
must be sperated by at least mem_latency+5*(cachline refill time) cycles
- make sure that the prefetch of the next column of B starts
mem_latency+5*(cachline refill time) cycles
before the column is needed (if one wonders why I need 5 cachelines to
prefetch: As the blocks are
not cacheline aligned, each columm of each matrix can touch 5
cachelines in worst case)
I was not able to implement prefetching of C succesfully although C is
the most critical matrix because
it is not copied. The loss when going from LDC=30 (test) to LDC=M
(reality) is ~40 MFLOPS on my
Athlon classic 600. This is due to TLB misses I think, so C was normally
a good candidate for
prefetching.
Julian