[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ATLAS developer release 3.3.1 is out
Julian,
>(I don't know how you calculate MFLOPS in a copy routine, but I
>think 920MB/s=58 MFLOPS).
Yeah, for routines like copy, we're calling it 1 (2 for cplx) flop per element,
though the correct number is, of course, 0.
>Maybe you recall that I have sent you a mail with an Athlon optimized
>STREAM some weeks
>ago.
>In the sources you find a vector copy routine (dassign.asm) that
>copies a vector
>with ~920MB/s on my Athlon classic 600/PC133
Great! I really couldn't get prefetch to do anything for me on the
copy the way I was doing things. I knew there had to be a better way.
Just so you know, if you are hinting I should grab your stuff for the
Level 1, it will be a long time before I'm ready to get to this level
of detail. I took the one day to proof the tools, but I'll be busy
adding level 1 ops and getting ATLAS CVS-ready for quite a bit of time . . .
As I said before, I mainly wanted to get something out so others had the
option of playing with the Level 1 (a couple of people have asked about
tuning the Level 1) while I did this boring infrastracture stuff in parallel,
thus possibly leaving me with less level 1 work to do once I'm ready to
start. Also, I must admit that I'm still looking for some applications
to motivate me to get real excited about tuning the level 1 . . .
>It uses MMX/3dnow instructions and bypasses the caches via movntq.
Hmm. This is an interesting point. I must say that when I use dcopy,
I usually expect my output vector to be in cache for reuse, but obviously
skipping the caches for a copy is the way to go, and will kick butt for
those cases you do not plan to immediately reuse Y . . .
Can you cache the output vector and not the input? That would be best
for most of my operations . . .
>> Along the same lines, I'm already considering adding support for
>> atlas_set (set a vector to a constant)
>
>dfill() of my STREAM fills a vector with zeros with amazing 1020MB/s on
>my machine. It can be easily modified for all precisions.
Great. I just finished the first hack at ATL_set tuning, and the only
special case for alpha that I allow is for 0; I had heard of special
instructions of zeroing memory, glad to know we'll have it for x86 . . .
ATL_set with alpha=0 is used by ATLAS itself in various places, so this
should be nice indeed . . .
Thanks for the info,
Clint