[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: FW: newbie Athlon optimization question...
Reading over the documentation online, it seems like what I need to do is
basically just write some faster kernel implementation and stick it in
ATLAS/tune/blas/gemm/CASES, as well as create one of those description
files. It also mentioned that the main bulk of time was taken up by the
matrix-matrix multiply algorithm. I'm trying to base a faster Athlon
implementation off of some existing ATLAS code. Of the files listed
below, which one actually is the matrix-matrix multiply source for the
kernel? BTW this was taekn from my tracing of HPL as mentioned in the
earlier email:
...
cblas_dgemm.c: cblas_dgem()
ATL_gemm.c: ATL_dgemm()
ATL_gemmXX.c: ATL_dGEMM2NN()
ATL_mmJIK.c: ATL_dmmJIK()
ATL_mmJIK.c: ATL_dmmJIK2()
ATL_dNBmm_b1.c: ATL_dJIK60x60x60TN60x60x0_a1_b1()
I'm guessing it's ATL_dmmJIK.c. So in a nutshell, I could spend my time
examining that file, produce an optimized version, stick it in the proper
directory and follow the instructions to compile ATLAS with my new kernel,
correct? Lastly, is there an easy way to find out if ATLAS really did use
my source file/function? Or do I have to trace the execution of a program
to find out which function was called? Thanks for your help.
original message:
Hi Jeff,
you have found the main Atlas kernel, so it is no wonder that the program
spends 73% of the time
here. I
don't think messing with ATLAS' code would do much good. It's hard to
omtimize something to be fast on all platforms, because you would have to
test to code on all supported platforms to make sure that it was faster
than the code already contained in
ATLAS. A
better thing would be to optimize the kernel you mentioned for the
Athlon. You can do that by submitting your own kernel, and ATLAS will then
choose your kernel if it is faster than the ones generated by ATLAS. Check
out atlas_contrib.ps in the ATLAS/doc/ directory. Since work done by the
kernel takes up 73% of the time any speedup would be
good. Theoretical
peek performance for the Athlon is 2 flop per clockcycle, and Atlas
currently gets around 1.2, so there is room for optimization if you
like x87 assembly. There has been some discussion previously on the list
about Athlon
optimizations. For
some general techniques you can also look at
http://www.cs.utk.edu/~soender/atlas/doc/atl_report.ps
Cheers,
-- /---------------------------------\
Jeff W., jeff@dark-techno.org ICQ# 17989474
"It's substance, not process"
http://dark-techno.org
http://logic-slave.org