HPL Scalability Analysis
The machine model used for the
analysis is first described. This crude model is then used to first
estimate the parallel running time of the various phases of the
algorithm namely
Finally the parallel efficiency
of the entire algorithm is estimated according to this machine model.
We show that for a given set of parameters HPL is scalable
not only with respect to the amount of computation, but also with
respect to the communication volume.
Distributed-memory computers consist of processors that are connected
using a message passing interconnection network. Each processor has
its own memory called the local memory, which is accessible only to
that processor. As the time to access a remote memory is longer than
the time to access a local one, such computers are often referred to
as Non-Uniform Memory Access (NUMA) machines.
The interconnection network of our machine model is static, meaning
that it consists of point-to-point communication links among
processors. This type of network is also referred to as a direct
network as opposed to dynamic networks. The latter are constructed
from switches and communication links. These links are dynamically
connected to one another by the switching elements to establish, at
run time, the paths between processors memories.
The interconnection network of the two-dimensional machine model
considered here is a static, fully connected physical topology. It
is also assumed that processors can be treated equally in terms
of local performance and that the communication rate between two
processors depends on the processors considered.
Our model assumes that a processor can send or receive data on only
one of its communication ports at a time (assuming it has more than
one). In the literature, this assumption is also referred to as the
one-port communication model.
The time spent to communicate a message between two given processors
is called the communication time Tc. In our machine model, Tc is
approximated by a linear function of the number L of double
precision (64-bits) items communicated. Tc is the sum of the time to
prepare the message for transmission (alpha) and the time (beta * L)
taken by the message of length L to traverse the network to its
destination, i.e.,
Tc = alpha + beta L.
Finally, the model assumes that the communication links are
bi-directional, that is, the time for two processors to send each
other a message of length L is also Tc. A processor can send and/or
receive a message on only one of its communication links at a time.
In particular, a processor can send a message while receiving another
message from the processor it is sending to at the same time.
Since this document is only concerned with regular local dense linear
algebra operations, the time taken to perform one floating point
operation is assumed to be summarized by three constants gam1,
gam2 and gam3. These quantitites are flop rates approximations of the
vector-vector, matrix-vector and matrix-matrix operations for each
processor. This very crude approximation summarizes all the steps
performed by a processor to achieve such a computation. Obviously,
such a model neglects all the phenomena occurring in the processor
components, such as cache misses, pipeline startups, memory load or
store, floating point arithmetic and so on, that may influence the
value of these constants as a function of the problem size for
example.
Similarly, the model does not make any assumption on the amount of
physical memory per node. It is assumed that if a process has been
spawn on a processor, one has ensured that enough memory was
available on that processor. In other words, swapping will not occur
during the modeled computation.
This machine model is a very crude approximation that is designed
specifically to illustrate the cost of the dominant factors of our
particular case.
Let consider an M-by-N panel distributed over a P-process column.
Because of the recursive formulation of the panel factorization, it
is reasonable to consider that the floating point operations will
be performed at matrix-matrix multiply "speed". For every column in
the panel a binary-exchange is performed on 2*N data items. When this
panel is broadcast, what matters is the time that the next process
column will spend in this communication operation. Assuming one
chooses the increasing-ring (modified)
variant, only one message needs to be taken into account. The
execution time of the panel factorization and broadcast can thus be
approximated by:
Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) +
alpha + beta M N / P.
Let consider the update phase of an N-by-N trailing submatrix
distributed on a P-by-Q process grid. From a computational point of
view one has to (triangular) solve N right-hand-sides and perform a
local rank-NB update of this trailing submatrix. Assuming one chooses
the long variant, the execution
time of the update operation can be approximated by:
Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) +
alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.
The constant "3" in front of the "beta" term is obtained by counting
one for the (logarithmic) spread phase and two for the rolling phase;
In the case of bi-directional links this constant 3 should therefore
be only a 2.
The number of floating point operations performed during the backward
substitution in given by N^2 / (P*Q). Because of the lookahead, the
communication cost can be approximated at each step by two messages
of length NB, i.e., the time to communicate the NB-piece of the
solution vector from one diagonal block of the matrix to another. It
follows that the execution time of the backward substitution can be
approximated by:
Tbacks( N, NB ) = gam2 N^2 / (P Q) + N ( alpha / NB + 2 beta ).
The total execution time of the algorithm described above is given by
Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] +
Tbacks( N, NB ).
That is, by only considering only the dominant term in alpha, beta and
gam3:
Thpl = 2 gam3 N^3 / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) +
alpha N ((NB + 1) log(P) + P) / NB.
The serial execution time is given by Tser = 2 gam3 N^3 / 3. If we
define the parallel efficiency E as the ratio Tser / ( P Q Thpl ), we
obtain:
E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) +
3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).
This last equality shows that when the memory usage per processor
N^2 / (P Q) is maintained constant, the parallel efficiency slowly
decreases only because of the alpha term. The communication volume
(the beta term) however remains constant. Due to these results, HPL
is said to be scalable not only with respect to the
amount of computation, but also with respect to the communication
volume.
[Home]
[Copyright and Licensing Terms]
[Algorithm]
[Scalability]
[Performance Results]
[Documentation]
[Software]
[FAQs]
[Tuning]
[Errata-Bugs]
[References]
[Related Links]