[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UltraSPARC dgemm user contribution
Hi,
I have at last got around to my promise to Clint of putting these together.
from:
http://cs.anu.edu.au/~Peter.Strazdins/projects/SparcBLAS/UserUSATLAS.tar.gz
there ATLAS are dgemm kernels (L1 and complete). The directory structure
reflects that of ATLAS.
These kernels are C codes best compiled with:
gcc -mcpu=ultrasparc -O -fomit-frame-pointer -mtune=ultrasparc ...
and were primarily written by a Viet Nguyen, who worked with me last year
on (mainly complex) UltraSPARC BLAS (and did an excellent job too).
The kernels use `lookahead over the level 1 cache' (equivalent to prefetching)
so they can perform well for large blocksizes (eg 60-90).
Their performance, when run on a 170 MHz Ultra, does not look terribly
fast when run from ATLAS:
The L1 kernel:
peter@kaffa make -e ummcase pre=d nb=40 mmrout=../CASES/ALT_ANUUltraL1mm.c beta=1
...
dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.510000, mflop=245.458824
dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.510000, mflop=245.458824
dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.530000, mflop=236.196226
but for the equivalent test from my own test program (in which
everything is warm in cache and there is no copying), it gets over ~300
mflops.
The full kernel generally shows `speed ups' of > 1.0 in the
except for small matrices. eg.
...
TEST TA TB M N K alpha beta Time Mflop SpUp
==== == == === === === ===== ===== ====== ===== ====
86 T N 10 10 40 1.0 1.0 1.80 69.6 1.00
86 T N 10 10 40 1.0 1.0 2.63 47.6 0.68
87 T N 750 750 40 1.0 1.0 0.75 180.0 1.00
87 T N 750 750 40 1.0 1.0 0.58 232.8 1.29
..
Again somehow from my own test program it runs a little faster (funny
how my test programs are so optimistic!): for test 87, it runs at 247
MFLOPS; for K=60, it climbs to 263 MFLOPs, and peaks at K=88 with 273
MFLOPs.
Anyway, I will let the ATLAS team check it out (I hope there is still
time left!). Its all been pretty rushed so if there are minor problems
(like it does not give the correct answers :) I will try to fix them up.
Regards, Peter