[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ATHLON performance tips
Hi Julian,
I have been trying to use some of your tricks to speed up ATLAS on the
athlon (and on my own K6-2), so far I have only gotten a speedup on my
K6-2, but I am optimistic :-)
I would like to ask you (or anyone else) if you known an easy way to force
an assembly instruction to use a specific adressing method. For example if
I load the first element with a offset of 0 by:
movq (%ecx),%mm6
the instruction will be 3 bytes long, however all subsequent instructions
that load data with an offset bigger than 0:
movq 0x8(%ecx),%mm4
will be 4 bytes long. How do I most easily make
the first instruction 4 bytes long. Putting a rep prefix in front is fun
and works but using the same adress mode would be the proper way to do it.
I have the same problem if I want all instructions to use the "offset
bigger than 128" adress mode, even though the offset is lower than 128.
I hope someone knows an elegant solution.
cheers,
Peter.
On Fri, 18 May 2001, R Clint Whaley wrote:
> I've been corresponding with Julian Ruhe, and he had some tips on Athlon
> optimization I thought might be of general interest. I forward on what
> he sent to me, in case someone is looking at this architecture . . .
>
> Cheers,
> Clint
>
> Now some information how to achieve peak performance on Athlon:
> (1) The first and most important thing is that you must put three x86 instructions into packages
> of exactly 8 bytes to make the decoders run as smooth as possible. Additionally these
> packages must be 8 byte aligned. If one of these packages for example consists of three instruction but
> only 7 bytes, you can use the REP prefix as natural code filler. If you already have two (longer) instructions
> in one package and your next does not fit, use a suitable neutral x87 instruction ("nop", "fnop" see
> Athlon manual) as code filler and move your instruction into the next block. Here some examples:
>
> bad!
> 20 00000010 DD4038 fld qword [eax+7*8]
> 21 00000013 D8C9 fmul st0,st1
> 22 00000015 D8C1 fadd st0,st1
>
> good!
> 26 00000020 DD4038 fld qword [eax+7*8]
> 27 00000023 D8C9 fmul st0,st1
> 28 00000025 F3 db 0F3H
> 29 00000026 D8C1 fadd st0,st1
>
> good!
> 33 00000030 DC4C1818 fmul qword [eax+ebx+3*8]
> 34 00000034 DC4020 fadd qword [eax+4*8]
> 35 00000037 90 nop
>
> (2) To keep instructions short and to achieve FP peak performance you MUST use 8 bit immediates only when
> operating with memory operands. Otherwise it is not possible to put 1 fadd + 1 fmul + 1fld/fst into a
> 8 byte package. Example:
>
> bad!
> 39 00000040 DD8038020000 fld qword [eax+71*8]
> 40 00000046 D8C9 fmul st0,st1
> 41 00000048 D8C1 fadd st0,st1
>
> (3) In praxis it is not possible to achieve peak in 3dnow! because 3dnow! instructions are to long. Examples
>
> 45 00000050 0F6F01 movq mm0,[ecx]
> 46 00000053 0F0FC1B4 pfmul mm0,mm1
> 47 00000057 0F0FD09E pfadd mm2,mm0
>
> (4) Athlon's prefetch instructions are not compatible with Intel's. Athlon can only prefetch into L1 cache and it
> is prefetch=prefetchnta=prefetcht0=prefetcht1=prefetcht2
>
> (5) Athlon instruction scheduler is very aggressive and register renaming is very intelligent. Let them do
> most work. It is not problem to initiate the calculation of a dependency chain like b=b*a and c=c+b in one cycle.
> Additionally you can use 'a' in the next cycle without waiting that b=b*a is complete, because Athlon's
> register renaming makes a copy of 'a' for you internally. Example:
>
> Assuming our stack looks already like this: c0 c1 c2 c3 a0 <-top
> If we want to calculate the following:
>
> c0=c0+a0*b0
> c1=c1+a0*b1
> c2=c2+a0*b2
> c3=c3+a0*b3
> c0=c0+a0*b4
> c1=c1+a0*b5
> [...]
>
> .. the pseudo assembly code would look like this (with stack view)
>
> fld b0 c0c1c2c3a0b0
> fmul b0,a0 c0c1c2c3a0b0
> faddp c0,b0 c0c1c2c3a0
>
> fld b1 c0c1c2c3a0b1
> fmul b1,a0 c0c1c2c3a0b1
> faddp c1,b1 c0c1c2c3a0
> [...]
>
> this code runs with peak performance which is clear when you take a look on how the instructions are scheduled by the CPU
>
> cycle 0 >|
> |
> fld oo
> fmul oooo
> faddp oooo <- c0 calculated
> fld oo
> fmul oooo <- uses copy of original 'a0'
> faddp oooo <- c1 calculated
> fld oo
> fmul oooo
> faddp oooo <- c2 calculated
> fld oo
> fmul oooo
> faddp oooo <- c3 calculated
> fld oo
> fmul oooo
> faddp oooo
> |<- in this cycle c0 is free again (see above) and exactly here the new c0 calculation starts
>
> As you see this code needs only 6 stack registers to achieve peak performance.
>