[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UltraSparc kernel results
Peter,
>with zgemm(), I feel that it will always be hard for a kernel do better
>than a hand-written implementation (you can get close, but second-order
>things like stride-2 accesses may take that few % off that makes the
>difference).
I agree; when you are dealing with such high percentages of peak, even
a low-order term like the access of C can be an insurmountable hurdle;
I increased the speed of your kernel by a few percentage points by fixing
the loops and so on, and in doing was amazed at the 10% drops you could get
in performance by moving single instructions . . .
>I'd be interested to hear how the full user-supplied US zgemm()
>implementations compare to SunPerf.
A good point, and I'll be interested as well. However, it will be a bit
before I get to them; I've been doing the work on your kernel as part of
the debugging of the new GEMM kernel install (which now allows user-supplied
cleanup; very important if you are using nb=80); I will not be looking at
user-supplied full gemm's (I have Doug's SSE sgemm as well as your stuff)
until the tarfile for the release is practically complete . . .
> >> That's the good news. The bad news is I got access to an Ultra-5/10,
> >> sun's PCI-based low-end ultrasparc, and the submitted kernels don't
> >> seem to do very well on those machines; ATLAS's generated code is
> >> as good as the kernel there, and both get *completely* waxed by
> >> sunperf. My guess is the motherboard can have such an effect
> >> because the UltraSparc II has an off-chip cache, and the PCI-based
> >> one makes the code really different . . . Anyway, I'll have to
> >> investigate this further, maybe I just messed up the build . . .
> >>
>Hmm, this must be the one based on the Ultra IIi chip. I ran a
>benchmark on one of these some time ago, and was so disgusted with the
>performance (relative to clock speed), I vowed never to run numeric
>codes that procesor again :).
>
>I read an article on the IIi, but there was nothing to suggest that it
>should be significantly different from the II for floating point.
>Possibly you need to use an explicit prefetch instruction (which SunPerf
>uses) to get good performance?
As I said, I suspect the difference is in the L2 caches. The chip is pretty
much the same. With the suspicion on the L2 cache, I would say the prefetch
instruction is probably the culprit. It's worth noticing that the gap between
Sunperf and ATLAS is only 1/2 as wide on an UltraSparc I (which does not
implement the prefetch instruction) as it is on an UltraSparc II. After the
release, we are planning to have an atlas_prefetch.h with some macros that
use the various computer-specific prefetches (SSE/3DNow/MMX/UltraSparc/Power3).
My hope is this might allow use to play with this kind of thing ourselves . . .
Here's some numbers in support of the Ultra5/Ultra2 difference coming from
L2 or memory difference:
out-of-cache in-cache
Ultra5-269Mhz 285.7 (53%) 435.1 (81%)
Ultra2-200Mhz 283.9 (71%) 346.2 (87%)
So you see that the in-L1 performance is comparable, but when you exceed it,
the non-pci solution pulls ahead. I did a little more work after the last
mail, and what I have found is the best blocking for your kernel is dependent
on the system:
Ultra5: 40
Ultra2: 80
Ultra4: 120
And, again, my guess the growth in block factor corresponds to better L2s;
It's too bad the ultrasparc L2 is not on-die to avoid this problem . . .
Also, as long as you adjust NB, your kernel is better than the ATLAS kernel
even on the Ultra 5, though it's percentage ahead is much less . . .
Cheers,
Clint