We suggest the following approach to obtain high performance with ScaLAPACK codes:
The standard data distribution will typically achieve 25-50%
of the peak performance possible (depending
in part on how many processors are ignored, i.e., the difference
between and
). We do not
recommend experimenting with different data distributions until
performance that is acceptable (or nearly so) has been achieved.
If each individual node requires a block size larger than 64 to
achieve near-peak performance on local matrix-matrix multiply,
the block size may have to be increased. This step is unlikely, however,
unless the computer has a shared-memory multiprocessor with
more than four processors on each node.