[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ATHLON performance tips

To: atlas-comm@cs.utk.edu
Subject: ATHLON performance tips
From: R Clint Whaley <rwhaley@cs.utk.edu>
Date: Fri, 18 May 2001 20:08:30 -0400 (EDT)

I've been corresponding with Julian Ruhe, and he had some tips on Athlon
optimization I thought might be of general interest. I forward on what
he sent to me, in case someone is looking at this architecture . . .

Cheers,
Clint

Now some information how to achieve peak performance on Athlon:
(1) The first and most important thing is that you must put three x86 instructions into packages
of exactly 8 bytes to make the decoders run as smooth as possible. Additionally these
packages must be 8 byte aligned. If one of these packages for example consists of three instruction but
only 7 bytes, you can use the REP prefix as natural code filler. If you already have two (longer) instructions
in one package and your next does not fit, use a suitable neutral x87 instruction ("nop", "fnop" see
Athlon manual) as code filler and move your instruction into the next block. Here some examples:

bad!
20 00000010 DD4038 fld qword [eax+7*8]
21 00000013 D8C9 fmul st0,st1
22 00000015 D8C1 fadd st0,st1

good!
26 00000020 DD4038 fld qword [eax+7*8]
27 00000023 D8C9 fmul st0,st1
28 00000025 F3 db 0F3H
29 00000026 D8C1 fadd st0,st1

good!
33 00000030 DC4C1818 fmul qword [eax+ebx+3*8]
34 00000034 DC4020 fadd qword [eax+4*8]
35 00000037 90 nop

(2) To keep instructions short and to achieve FP peak performance you MUST use 8 bit immediates only when
operating with memory operands. Otherwise it is not possible to put 1 fadd + 1 fmul + 1fld/fst into a
8 byte package. Example:

bad!
39 00000040 DD8038020000 fld qword [eax+71*8]
40 00000046 D8C9 fmul st0,st1
41 00000048 D8C1 fadd st0,st1

(3) In praxis it is not possible to achieve peak in 3dnow! because 3dnow! instructions are to long. Examples

45 00000050 0F6F01 movq mm0,[ecx]
46 00000053 0F0FC1B4 pfmul mm0,mm1
47 00000057 0F0FD09E pfadd mm2,mm0

(4) Athlon's prefetch instructions are not compatible with Intel's. Athlon can only prefetch into L1 cache and it
is prefetch=prefetchnta=prefetcht0=prefetcht1=prefetcht2

(5) Athlon instruction scheduler is very aggressive and register renaming is very intelligent. Let them do
most work. It is not problem to initiate the calculation of a dependency chain like b=b*a and c=c+b in one cycle.
Additionally you can use 'a' in the next cycle without waiting that b=b*a is complete, because Athlon's
register renaming makes a copy of 'a' for you internally. Example:

Assuming our stack looks already like this: c0 c1 c2 c3 a0 <-top
If we want to calculate the following:

c0=c0+a0*b0
c1=c1+a0*b1
c2=c2+a0*b2
c3=c3+a0*b3
c0=c0+a0*b4
c1=c1+a0*b5
[...]

.. the pseudo assembly code would look like this (with stack view)

fld b0 c0c1c2c3a0b0
fmul b0,a0 c0c1c2c3a0b0
faddp c0,b0 c0c1c2c3a0

fld b1 c0c1c2c3a0b1
fmul b1,a0 c0c1c2c3a0b1
faddp c1,b1 c0c1c2c3a0
[...]

this code runs with peak performance which is clear when you take a look on how the instructions are scheduled by the CPU

cycle 0 >|
|
fld oo
fmul oooo
faddp oooo <- c0 calculated
fld oo
fmul oooo <- uses copy of original 'a0'
faddp oooo <- c1 calculated
fld oo
fmul oooo
faddp oooo <- c2 calculated
fld oo
fmul oooo
faddp oooo <- c3 calculated
fld oo
fmul oooo
faddp oooo
|<- in this cycle c0 is free again (see above) and exactly here the new c0 calculation starts

As you see this code needs only 6 stack registers to achieve peak performance.

Follow-Ups:
- Re: ATHLON performance tips
  - From: Peter Soendergaard <soender@cs.utk.edu>

Prev by Date: Re: Atlas and Windows
Next by Date: >H3gGO<<?d
Prev by thread: >H3gGO<<?d
Next by thread: Re: ATHLON performance tips
Index(es):
- Date
- Thread