[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Testing ATLAS with user contributed code.
Peter,
>I am trying to benchmark ATLAS using my generated kernel and cleanups for
>a varying number of blocksizes. I would like to build ATLAS with a
>blocksize ranging from e.g. 2 to 100 and to test each blocksize on
>problems from size 100 to 1000.
>
>So, the questions:
>
>1: Is it at all possible to force ATLAS to use my code for a blocksize of
>say 2, or will it choose its own generated code with a more sane
>blocksize?
I gotta ask: why on god's green earth do you want to try NB<16? I understand
not trusting the kernel timer to give you best NB, but I usually only try
a couple (usually smaller, since it has less cleanup overhead) . . .
Unfortunately, forcing a particular NB is not that simple, and is very
time consuming. I usually figure out I've screwed it up myself when my
tester seg faults. In general, changing tune/blas/gemm/<arch>/res/<pre>NB
to your blocking factor is the first step, then you rerun the search.
This may or may not be all you need to do . . .
I include below a half complete, and undoubtedly incorrect documentation file
I started to produce describing the generation process and the output
files the search produces. Understanding and playing with these guys,
and the intermediate output files produced in res/, is the key to success.
Examining how the various searches (ummsearch and mmsearch big ticket items)
work, and the intermediate outputs they produce will be revealing . . .
As for forcing ATLAS to use a suboptimal case, this is also not always
easy. The way I do it I run the search, and if my case is not chosen, I
go look at the timer output file in res/, and increase it so it is the
fastest, and then rerun the search, and let the search choose my artificially
inflated case. For instance, if you want the third user index entry to win,
with nb=32, you first cause the search to call it, then you edit
res/duser003_32x32x32, and pump up the mflop numbers written there so that they
are the best, and the next time the search is ran, it will be taken as the
best case . . .
>2: How do I run the tests for varying problem sizes? It is a standard test
>that I am thinking of, I have just forgotten how to call it.
ATLAS/doc/TestTime.txt, Section 4. You can also build the testers, and
for instance, type "./xdl3blastst -help" . . .
>3: How do I do this with the least amount of overhead since I have to
>search for kernels, build ATLAS and run tests some 50 times for each
>architecture.
The basics are page 22 of ATLAS/doc/atlas_contrib.ps. I think for all
the stuff you are planning to do, you are going to have to do quite a bit
of learning if you want to automate this whole process. It would take me
several days to produce a document with the canned answers for you, and I
don't have that kind of time at the moment . . .
Cheers,
Clint
This file documents the order in which files are generated in ATLAS. If you
are crazy enough, it can be used as a starting point for building ATLAS
by hand, rather than letting install do it.
Stage 1 : System discovery/aux compile
(1) cd ATLAS/src/auxil/<arch> ; make lib
HEADERS RESULTS
atlas_type.h res/[s,d]MULADD
atlas_[s,d,c,z]sysinfo.h res/[s,d]nreg
res/L1CacheSize
Stage 2 : Type-dependent tuning (pre = d, s, z, c)
NOTE: right now, the Level 3 are tuned first, followed by Level 2
because the Level 2 can call the Level 3 for gemv. It should be
the other way around, but it ain't :)
(1) Run ATLAS/tune/blas/gemm/<arch>/xmmsearch -p <pre>, creating
ATLAS/include/<arch>/<pre>mm.h & ATL_<pre>NCmm.h, and
res/:
dgMMRES : generated NBmm kernel results
dMMRES : generated & user NBmm kernel results
dClean[M,N,K] : generated cleanup results
duMMRES : User-supplied kernel NBmm results
duClean[M,N,K]: Best user-supplied cleanups
duClean[M,N,K]F : User supplied cleanups that beat generated cases
dbest[N,T][N,T]_0x0x0: best no-copy case with no fixed loop dimension
dbest[N,T][N,T]_0x0x<nb>: best no-copy case with M and N loop
parameters variable, but K-loop fixed at <nb>
dbest[N,T][N,T]_<nb>x<nb>x<nb>: best no-copy case with all loop
dimensions fixed to <nb>
(2) if first precision, run ATLAS/tune/blas/gemm/<arch>/x<pre>findCE,
creating ATLAS/include/<arch>/atlas_cacheedge.h
(3) Run ATLAS/tune/blas/gemm/<arch>/x<pre>Run_tfc, creating
ATLAS/include/<arch>/<pre>Xover.h
(4) GEMV tune, creating ATLAS/include/<arch>/atlas_<pre>mv.h,
atlas_<pre>mv[N,T].h
(5) GER tune, creating ATLAS/include/<arch>/atlas_<pre>r1.h
Stage 3: General library build
(1) Finish all compilation