[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
latest blas contribs
Greetings! OK, my last stuff for the release is at:
http://people.debian.org/~camm/blas_20001204.tgz
A few notes:
1) Beware, the ?cases.dsc files are in here, and will overwrite what
you have if you simply unpack the tar ball in the root directory.
2) No changes to the l3 stuff from previous
3) All l2 integrated into two files include/contrib/camm_dpa.h and
include/contrib/ATL_gemv_ger_SSE.c. These are included with
appropriate macro parameter settings in
ATL_{ger1,gemvT,gemvN}_SSE.c.
4) Current parameters that are settable with macros:
a) NO_TRANSPOSE (indicates an axpy strategy)
b) GER (self explanatory, invokes NO_TRANSPOSE automatically)
c) PREFETCH (how far ahead to prefetch in bytes)
d) LUNROLL (how many TYPE elements to unroll in the inner
loop)
e) NDPM (How many rows to process at a time, most routines can
do up to 4, DCPLX only 2, SCPLX NO_TRANSPOSE only 3)
e) STRIDE (how many rows to skip when processing multiple rows
at once)
e) (SREAL only, STRIDE %4==0 || NDPM==1) ALIGN (aligns the
inner loop to 16 bytes and uses aligned assembler instructions
thereafter)
5) Performance: This code is selected over the default atlas code in
all cases, but in some, the margin is not much:
Key:
850n -- Coppermine 850, new code
850o -- atlas 3.0 lib compiled on PII 350 run on
Coppermine 850
450n -- Katmai 450, new code
450o -- atlas 3.0 lib compiled on PII 350 run on
Katmai 450
------------------------------- GEMV --------------------------------
TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
s850n N 1000 1000 1.0 1000 1 1.0 1 0.01 319.5 3.26 PASS
s850n T 1000 1000 1.0 1000 1 1.0 1 0.01 330.2 3.13 PASS
d850n N 1000 1000 1.0 1000 1 1.0 1 0.01 151.2 2.69 PASS
d850n T 1000 1000 1.0 1000 1 1.0 1 0.01 157.2 1.64 PASS
c850n N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.02 436.7 2.50 PASS
c850n T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.02 480.4 2.65 PASS
z850n N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.03 286.8 2.55 PASS
z850n T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.03 305.0 2.73 PASS
s850o N 1000 1000 1.0 1000 1 1.0 1 0.01 176.9 1.00 PASS
s850o T 1000 1000 1.0 1000 1 1.0 1 0.01 145.7 1.00 PASS
d850o N 1000 1000 1.0 1000 1 1.0 1 0.02 82.9 1.00 PASS
d850o T 1000 1000 1.0 1000 1 1.0 1 0.02 91.7 1.00 PASS
c850o N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.03 237.2 0.99 PASS
c850o T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.04 200.2 1.00 PASS
z850o N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.06 133.4 1.01 PASS
z850o T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.06 142.3 1.01 PASS
s450n N 1000 1000 1.0 1000 1 1.0 1 0.01 279.0 4.38 PASS
s450n T 1000 1000 1.0 1000 1 1.0 1 0.01 300.1 4.17 PASS
d450n N 1000 1000 1.0 1000 1 1.0 1 0.02 117.2 3.25 PASS
d450n T 1000 1000 1.0 1000 1 1.0 1 0.01 143.6 2.12 PASS
c450n N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.02 392.1 3.45 PASS
c450n T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.02 417.7 3.43 PASS
z450n N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.04 218.3 3.10 PASS
z450n T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.03 249.5 3.48 PASS
s450o N 1000 1000 1.0 1000 1 1.0 1 0.01 159.8 0.95 PASS
s450o T 1000 1000 1.0 1000 1 1.0 1 0.02 131.2 1.00 PASS
d450o N 1000 1000 1.0 1000 1 1.0 1 0.02 95.2 1.00 PASS
d450o T 1000 1000 1.0 1000 1 1.0 1 0.02 93.9 1.00 PASS
c450o N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.04 192.1 1.00 PASS
c450o T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.05 174.7 1.09 PASS
z450o N 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.06 134.4 1.15 PASS
z450o T 1000 1000 1.0 0.0 1000 1 1.0 0.0 1 0.06 126.4 1.02 PASS
------------------------------ GER -----------------------------
TST# M N ALPHA INCX INCY LDA TIME MFLOP SpUp TEST
==== ===== ===== ===== ==== ==== ===== ====== ====== ===== =====
s850n 1000 1000 1.0 1 1 1000 0.02 100.1 1.18 PASS
d850n 1000 1000 1.0 1 1 1000 0.04 46.5 1.03 PASS
c850n 1000 1000 1.0 0.0 1 1 1000 0.04 202.3 1.28 PASS
c850n 1000 1000 1.0 0.0 1 1 1000 0.04 202.3 1.29 PASS
z850n 1000 1000 1.0 0.0 1 1 1000 0.08 99.6 1.19 PASS
z850n 1000 1000 1.0 0.0 1 1 1000 0.09 85.0 1.08 PASS
s850o 1000 1000 1.0 1 1 1000 0.02 83.6 1.00 PASS
d850o 1000 1000 1.0 1 1 1000 0.05 42.1 0.96 PASS
c850o 1000 1000 1.0 0.0 1 1 1000 0.05 155.0 1.00 PASS
c850o 1000 1000 1.0 0.0 1 1 1000 0.05 155.0 1.00 PASS
z850o 1000 1000 1.0 0.0 1 1 1000 0.10 83.2 1.00 PASS
z850o 1000 1000 1.0 0.0 1 1 1000 0.10 83.5 1.00 PASS
s450n 1000 1000 1.0 1 1 1000 0.02 105.9 1.98 PASS
d450n 1000 1000 1.0 1 1 1000 0.04 51.9 1.35 PASS
c450n 1000 1000 1.0 0.0 1 1 1000 0.04 196.1 1.88 PASS
c450n 1000 1000 1.0 0.0 1 1 1000 0.04 196.1 1.87 PASS
z450n 1000 1000 1.0 0.0 1 1 1000 0.08 101.7 1.92 PASS
z450n 1000 1000 1.0 0.0 1 1 1000 0.08 102.2 1.92 PASS
s450o 1000 1000 1.0 1 1 1000 0.03 75.0 1.00 PASS
d450o 1000 1000 1.0 1 1 1000 0.04 49.0 1.00 PASS
c450o 1000 1000 1.0 0.0 1 1 1000 0.08 104.4 0.99 PASS
c450o 1000 1000 1.0 0.0 1 1 1000 0.08 105.0 1.00 PASS
z450o 1000 1000 1.0 0.0 1 1 1000 0.15 54.4 1.00 PASS
z850n 1000 1000 1.0 0.0 1 1 1000 0.15 54.7 1.00 PASS
6) Issues:
a) The prefetch distance seems to be a function of the cpu/bus
speed ratio, and may also be different for Coppermine
vs. Katmai. I therefore left this as a settable parameter,
even though I could not find significant repeatable gains over
the default 2 Cacheline lengths ahead for any case on the
850Mhz Coppermine I used for testing. This may also interact
with b) below.
b) STRIDE: Double precision loves stride around the
regrettably large value of 20, 10 for complex. This is
apparently getting around the blocking in some way I don't
really understand. I leave the stride out of the
{d,z}cases.dsc unroll value, and seem to get good results.
This doesn't seem satisfactory, but its what works best here
so far.
c) Inlining: the routine in camm_dpa.h cannot currently be
inlined. I have included an effective work around for gcc by
defining a NO_INLINE macro in camm_util.h and invoking in this
function. Don't know about other compilers. I believe I can
fix this quickly, but I didn't want to hold up releasing for
this.
d) L3 arbitrary KB cleanup: With a few extra macros, this code
should make a nice K cleanup when looped externally over B. I
also have one that doesn't worry about alignment which gets ~
1100 MFLOPS on an 850 (if memory serves), but this is not
included here. I thought the loop over l2 would be better,
but again didn't want to hold up a release.
Take care,
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah