HPL Tuning
After having built the executable hpl/bin/<arch>/xhpl,
one may want to modify the input data file HPL.dat. This file
should reside in the same directory as the executable
hpl/bin/<arch>/xhpl. An example HPL.dat file is
provided by default. This file contains information about the
problem sizes, machine configuration, and algorithm features
to be used by the executable. It is 31 lines long. All the
selected parameters will be printed in the output generated
by the executable.
We first describe the meaning of each line of this input file
below. Finally, a few useful
experimental guide lines to set up the file are given at
the end of this page.
Line 1: (unused) Typically one would use
this line for its own good. For example, it could be used
to summarize the content of the input file. By default this
line reads:
HPL Linpack benchmark input file
Line 2: (unused) same as line 1. By default
this line reads:
Innovative Computing Laboratory, University of Tennessee
Line 3: the user can choose where the
output should be redirected to. In the case of a file, a
name is necessary, and this is the line where one wants to
specify it. Only the first name on this line is significant.
By default, the line reads:
HPL.out output file name (if any)
This means that if one chooses to redirect the output to a
file, the file will be called "HPL.out". The rest of the line
is unused, and this space to put some informative comment on
the meaning of this line.
Line 4: This line specifies where the output
should go. The line is formatted, it must begin with a
positive integer, the rest is unsignificant. 3 choices are
possible for the positive integer, 6 means that the output
will go the standard output, 7 means that the output will
go to the standard error. Any other integer means that the
output should be redirected to a file, which name has been
specified in the line above. This line by default reads:
6 device out (6=stdout,7=stderr,file)
which means that the output generated by the executable
should be redirected to the standard output.
Line 5: This line specifies the number of
problem sizes to be executed. This number should be less than
or equal to 20. The first integer is significant, the rest
is ignored. If the line reads:
3 # of problems sizes (N)
this means that the user is willing to run 3 problem sizes
that will be specified in the next line.
Line 6: This line specifies the problem sizes
one wants to run. Assuming the line above started with 3,
the 3 first positive integers are significant, the rest is
ignored. For example:
3000 6000 10000 Ns
means that one wants xhpl to run 3 (specified in line 5)
problem sizes, namely 3000, 6000 and 10000.
Line 7: This line specifies the number of
block sizes to be runned. This number should be less than or
equal to 20. The first integer is significant, the rest is
ignored. If the line reads:
5 # of NBs
this means that the user is willing to use 5 block sizes that
will be specified in the next line.
Line 8: This line specifies the block sizes
one wants to run. Assuming the line above started with 5,
the 5 first positive integers are significant, the rest is
ignored. For example:
80 100 120 140 160 NBs
means that one wants xhpl to use 5 (specified in line 7)
block sizes, namely 80, 100, 120, 140 and 160.
Line 9: This line specifies how the MPI
processes should be mapped onto the nodes of your platform.
There are currently two possible mappings, namely row- and
column-major. This feature is mainly useful when these nodes
are themselves multi-processor computers. A row-major mapping
is recommended.
Line 10: This line specifies the number of
process grid to be runned. This number should be less than
or equal to 20. The first integer is significant, the rest is
ignored. If the line reads:
2 # of process grids (P x Q)
this means that you are willing to try 2 process grid sizes
that will be specified in the next line.
Line 11-12: These two lines specify the
number of process rows and columns of each grid you want to
run on. Assuming the line above (10) started with 2, the 2
first positive integers of those two lines are significant,
the rest is ignored. For example:
1 2 Ps
6 8 Qs
means that one wants to run xhpl on 2 process grids (line
10), namely 1-by-6 and 2-by-8. Note: In this example, it is
required then to start xhpl on at least 16 nodes (max
of Pi-by-Qi). The runs on the two grids will be consecutive.
If one was starting xhpl on more than 16 nodes, say 52, only
6 would be used for the first grid (1x6) and then 16 (2x8)
would be used for the second grid. The fact that you started
the MPI job on 52 nodes, will not make HPL use all of them.
In this example, only 16 would be used. If one wants to run
xhpl with 52 processes one needs to specify a grid of 52
processes, for example the following lines would do the job:
4 2 Ps
13 8 Qs
Line 13: This line specifies the threshold
to which the residuals should be compared with. The residuals
should be or order 1, but are in practice slightly less than
this, typically 0.001. This line is made of a real number,
the rest is not significant. For example:
16.0 threshold
In practice, a value of 16.0 will cover most cases. For
various reasons, it is possible that some of the residuals
become slightly larger, say for example 35.6. xhpl will flag
those runs as failed, however they can be considered as
correct. A run should be considered as failed if the residual
is a few order of magnitude bigger than 1 for example 10^6 or
more. Note: if one was to specify a threshold of 0.0, all
tests would be flagged as failed, even though the answer is
likely to be correct. It is allowed to specify a negative
value for this threshold, in which case the checks will be
by-passed, no matter what the threshold value is, as soon as
it is negative. This feature allows to save time when
performing a lot of experiments, say for instance during the
tuning phase. Example:
-16.0 threshold
The remaning lines allow to specifies algorithmic features.
xhpl will run all possible combinations of those for each
problem size, block size, process grid combination. This is
handy when one looks for an "optimal" set of parameters. To
understand a little bit better, let say first a few words
about the algorithm implemented in HPL. Basically this is a
right-looking version with row-partial pivoting. The panel
factorization is matrix-matrix operation based and recursive,
dividing the panel into NDIV subpanels at each step. This
part of the panel factorization is denoted below by
"recursive panel fact. (RFACT)". The recursion stops when
the current panel is made of less than or equal to NBMIN
columns. At that point, xhpl uses a matrix-vector operation
based factorization denoted below by "PFACTs". Classic
recursion would then use NDIV=2, NBMIN=1. There are
essentially 3 numerically equivalent LU factorization
algorithm variants (left-looking, Crout and right-looking).
In HPL, one can choose every one of those for the RFACT, as
well as the PFACT. The following lines of HPL.dat allows you
to set those parameters.
Lines 14-21: (Example 1)
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try all variants of PFACT, 4 values for
NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3
and 4, and all variants for RFACT.
Lines 14-21: (Example 2)
2 # of panel fact
2 0 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try 2 variants of PFACT namely right
looking and left looking, 2 values for NBMIN, namely 4 and 8,
1 value for NDIV namely 2, and one variant for RFACT.
In the main loop of the algorithm, the current panel of
column is broadcast in process rows using a virtual ring
topology. HPL offers various choices and one most likely want
to use the increasing ring modified encoded as 1. 3 and 4 are
also good choices.
Lines 22-23: (Example 1)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the
increasing ring modified topology.
Lines 22-23: (Example 2)
2 # of broadcast
0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the
increasing ring virtual topology and the long message
algorithm.
Lines 24-25 allow to specify the look-ahead
depth used by HPL. A depth of 0 means that the next panel
is factorized after the update by the current panel is
completely finished. A depth of 1 means that the next
panel is immediately factorized after being updated. The
update by the current panel is then finished. A depth of k
means that the k next panels are factorized immediately after
being updated. The update by the current panel is then
finished. It turns out that a depth of 1 seems to give the
best results, but may need a large problem size before one
can see the performance gain. So use 1, if you do not know
better, otherwise you may want to try 0. Look-ahead of
depths 3 and larger will probably not give you better
results.
Lines 24-25: (Example 1):
1 # of lookahead depth
1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depth 1.
Lines 24-25: (Example 2):
2 # of lookahead depth
0 1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depths 0 and 1.
Lines 26-27 allow to specify the swapping
algorithm used by HPL for all tests. There are currently
two swapping algorithms available, one based on "binary
exchange" and the other one based on a "spread-roll"
procedure (also called "long" below). For large problem
sizes, this last one is likely to be more efficient. The user
can also choose to mix both variants, that is "binary-exchange"
for a number of columns less than a threshold value, and then
the "spread-roll" algorithm. This threshold value is then
specified on Line 27.
Lines 26-27: (Example 1):
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll"
swapping algorithm. Note that a threshold is specified in
that example but not used by HPL.
Lines 26-27: (Example 2):
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll"
swapping algorithm as soon as there is more than 60 columns
in the row panel. Otherwise, the "binary-exchange" algorithm
will be used instead.
Line 28 allows to specify whether the upper
triangle of the panel of columns should be stored in
no-transposed or transposed form. Example:
0 L1 in (0=transposed,1=no-transposed) form
Line 29 allows to specify whether the panel
of rows U should be stored in no-transposed or transposed
form. Example:
0 U in (0=transposed,1=no-transposed) form
Line 30 enables / disables the equilibration
phase. This option will not be used unless you selected 1 or
2 in Line 26. Example:
1 Equilibration (0=no,1=yes)
Line 31 allows to specify the alignment in
memory for the memory space allocated by HPL. On modern
machines, one probably wants to use 4, 8 or 16. This may
result in a tiny amount of memory wasted. Example:
8 memory alignment in double (> 0)
- Figure out a good block size for the matrix multiply
routine. The best method is to try a few out. If you happen
to know the block size used by the matrix-matrix multiply
routine, a small multiple of that block size will do fine.
This particular topic is discussed in the
FAQs section.
- The process mapping should not matter if the nodes of
your platform are single processor computers. If these nodes
are multi-processors, a row-major mapping is recommended.
- HPL likes "square" or slightly flat process grids. Unless
you are using a very small process grid, stay away from the
1-by-Q and P-by-1 process grids. This particular topic is also
discussed in the FAQs section.
- Panel factorization parameters: a good start are the
following for the lines 14-21:
1 # of panel fact
1 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
- Broadcast parameters: at this time it is far from obvious
to me what the best setting is, so i would probably try them
all. If I had to guess I would probably start with the
following for the lines 22-23:
2 # of broadcast
1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
The best broadcast depends on your problem size and harware
performance. My take is that 4 or 5 may be competitive for
machines featuring very fast nodes comparatively to the
network.
- Look-ahead depth: as mentioned above 0 or 1 are likely to
be the best choices. This also depends on the problem size
and machine configuration, so I would try "no look-ahead (0)"
and "look-ahead of depth 1 (1)". That is for lines 24-25:
2 # of lookahead depth
0 1 DEPTHs (>=0)
- Swapping: one can select only one of the three algorithm
in the input file. Theoretically, mix (2) should win, however
long (1) might just be good enough. The difference should be
small between those two assuming a swapping threshold of the
order of the block size (NB) selected. If this threshold is
very large, HPL will use bin_exch (0) most of the time and if
it is very small (< NB) long (1) will always be used. In
short and assuming the block size (NB) used is say 60, I
would choose for the lines 26-27:
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
I would also try the long variant. For a very small number
of processes in every column of the process grid (say < 4),
very little performance difference should be observable.
- Local storage: I do not think Line 28 matters. Pick 0 in
doubt. Line 29 is more important. It controls how the panel
of rows should be stored. No doubt 0 is better. The caveat is
that in that case the matrix-multiply function is called with
( Notrans, Trans, ... ), that is C := C - A B^T. Unless the
computational kernel you are using has a very poor (with
respect to performance) implementation of that case, and is
much more efficient with ( Notrans, Notrans, ... ) just pick
0 as well. So, my choice:
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
- Equilibration: It is hard to tell whether equilibration
should always be performed or not. Not knowing much about the
random matrix generated and because the overhead is so small
compared to the possible gain, I turn it on all the time.
1 Equilibration (0=no,1=yes)
- For alignment, 4 should be plenty, but just to be safe,
one may want to pick 8 instead.
8 memory alignment in double (> 0)
[Home]
[Copyright and Licensing Terms]
[Algorithm]
[Scalability]
[Performance Results]
[Documentation]
[Software]
[FAQs]
[Tuning]
[Errata-Bugs]
[References]
[Related Links]