[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ATHLON performance tips



Hi Julian,

I have been trying to use some of your tricks to speed up ATLAS on the
athlon (and on my own K6-2), so far I have only gotten a speedup on my
K6-2, but I am optimistic :-)

I would like to ask you (or anyone else) if you known an easy way to force
an assembly instruction to use a specific adressing method. For example if
I load the first element with a offset of 0 by:
       movq   (%ecx),%mm6
the instruction will be 3 bytes long, however all subsequent instructions
that load data with an offset bigger than 0:
       movq 0x8(%ecx),%mm4
 will be 4 bytes long. How do I most easily make
the first instruction 4 bytes long. Putting a rep prefix in front is fun
and works but using the same adress mode would be the proper way to do it.
I have the same problem if I want all instructions to use the "offset
bigger than 128" adress mode, even though the offset is lower than 128.

I hope someone knows an elegant solution.

cheers,

Peter. 

On Fri, 18 May 2001, R Clint Whaley wrote:

> I've been corresponding with Julian Ruhe, and he had some tips on Athlon
> optimization I thought might be of general interest.  I forward on what
> he sent to me, in case someone is looking at this architecture . . .
> 
> Cheers,
> Clint
> 
> Now some information how to achieve peak performance on Athlon:
> (1) The first and most important thing is that you must put three x86 instructions into packages
> of exactly 8 bytes to make the decoders run as smooth as possible. Additionally these
> packages must be 8 byte aligned. If one of these packages for example consists of three instruction but
> only 7 bytes, you can use the REP prefix as natural code filler. If you already have two (longer) instructions
> in one package and your next does not fit, use a suitable neutral x87 instruction ("nop", "fnop" see
> Athlon manual) as code filler and move your instruction into the next block. Here some examples:
> 
> bad!                                 		
>     20 00000010 DD4038                  		fld qword [eax+7*8]
>     21 00000013 D8C9                    		fmul st0,st1
>     22 00000015 D8C1                    		fadd st0,st1
> 
> good!                   		
>     26 00000020 DD4038                  		fld qword [eax+7*8]
>     27 00000023 D8C9                    		fmul st0,st1
>     28 00000025 F3                      		db 0F3H
>     29 00000026 D8C1                    		fadd st0,st1
> 
> good!
>     33 00000030 DC4C1818                		fmul qword [eax+ebx+3*8]
>     34 00000034 DC4020                  		fadd qword [eax+4*8]
>     35 00000037 90                      		nop              
> 
> (2) To keep instructions short and to achieve FP peak performance you MUST use 8 bit immediates only when
> operating with memory operands. Otherwise it is not possible to put 1 fadd + 1 fmul + 1fld/fst into a
> 8 byte package. Example:
> 
> bad!
>     39 00000040 DD8038020000            		fld qword [eax+71*8]
>     40 00000046 D8C9                    		fmul st0,st1
>     41 00000048 D8C1                    		fadd st0,st1		
> 
> (3) In praxis it is not possible to achieve peak in 3dnow! because 3dnow! instructions are to long. Examples
> 
>     45 00000050 0F6F01                  		movq mm0,[ecx]
>     46 00000053 0F0FC1B4                		pfmul mm0,mm1
>     47 00000057 0F0FD09E                		pfadd mm2,mm0
> 
> (4) Athlon's prefetch instructions are not compatible with Intel's. Athlon can only prefetch into L1 cache and it
> is prefetch=prefetchnta=prefetcht0=prefetcht1=prefetcht2
> 
> (5) Athlon instruction scheduler is very aggressive and register renaming is very intelligent. Let them do
> most work. It is not problem to initiate the calculation of a dependency chain like b=b*a and c=c+b in one cycle.
> Additionally you can use 'a' in the next cycle without waiting that b=b*a is complete, because Athlon's
> register renaming makes a copy of 'a' for you internally. Example:
> 
> Assuming our stack looks already like this:  c0 c1 c2 c3 a0 <-top
> If we want to calculate the following:
> 
> c0=c0+a0*b0
> c1=c1+a0*b1
> c2=c2+a0*b2
> c3=c3+a0*b3
> c0=c0+a0*b4
> c1=c1+a0*b5
> [...]
> 
> .. the pseudo assembly code would look like this (with stack view)
> 
> fld b0       c0c1c2c3a0b0   
> fmul b0,a0   c0c1c2c3a0b0   
> faddp c0,b0    c0c1c2c3a0
> 
> fld b1       c0c1c2c3a0b1
> fmul b1,a0   c0c1c2c3a0b1
> faddp c1,b1    c0c1c2c3a0
> [...]
> 
> this code runs with peak performance which is clear when you take a look on how the instructions are scheduled by the CPU
> 
> cycle 0 >| 
>          |
> fld      oo
> fmul       oooo
> faddp          oooo        <- c0 calculated
> fld       oo
> fmul        oooo           <- uses copy of original 'a0'
> faddp           oooo       <- c1 calculated
> fld        oo
> fmul         oooo
> faddp            oooo      <- c2 calculated
> fld         oo
> fmul          oooo
> faddp             oooo     <- c3 calculated
> fld          oo
> fmul           oooo        
> faddp              oooo
>                    |<- in this cycle c0 is free again (see above) and exactly here the new c0 calculation starts
> 
> As you see this code needs only 6 stack registers to achieve peak performance. 
>