==================================================================
=== ===
=== GENESIS Distributed Memory Benchmarks ===
=== ===
=== FFT1 ===
=== ===
=== One-dimensional FFT ===
=== ===
=== Versions: Std F77, PARMACS, Subset HPF, ===
=== PVM 3.1 ===
=== ===
=== Author: Vladimir Getov ===
=== ===
=== Inquiries: HPC Centre ===
=== Computing Services ===
=== University of Southampton ===
=== Southampton SO17 4BJ, U.K. ===
=== ===
=== Fax: +44 703 593939 E-mail: support@par.soton.ac.uk ===
=== ===
=== Last update: Jul 1993; Release: 3.0 ===
=== ===
==================================================================
1. Description
--------------
The FFT1 benchmark is based on the classical radix-2 1-dimensional FFT
algorithm by Cooley and Tukey. The radix-2 FFT algorithm for length
N = 2^{R}, consists of m radix-2 stages. At each stage, N/2 radix-2
butterflies are computed by performing two nested DO-loops, with
uniform indexing within these. Depending on the stage of the FFT
computation, the length of the two DO-loops varies from N/2 to 1 for
the number of data points and from 1 to N/2 for the number of partial
transforms. It is possible to swap the order of these DO-loops half-way
through the transform to increase the average vector length.
The Fourier transform comprises two main phases:
- the generation of twiddle factors;
- the calculation of harmonic amplitudes (butterfly phase).
The latter is commonly critical for the speed of computation. The
first phase needs a smaller number of arithmetic operations and,
furthermore, it may be precomputed and used many times for different
problem sizes in some applications. Hence in this benchmark, although
both the generation of twiddle factors and the butterfly phases are
timed, only the operations and time for the butterfly phase are used
in calculating the benchmark performance.
The FFT algorithm offers an exceptional reduction of the number of
operations for the calculation of harmonic amplitudes. The price paid
for that reduction is the variable stride indexation for the classical
FFT derivation, or the disordered bit-reversed form of the resulting
output sequence for the `in-place' FFT algorithms. The first option
causes problems for both vectorisation and distribution. On the one hand
the distributed version of this option combines data from two processors
and puts the result in a third processor which is clearly inefficient.
Variable stride indexation on the other hand is usually inconvenient and
causes significant performance losses. Thus only the `in-place'
transforms are suitable for implementation on distributed memory
computers, but have unfortunately their output sequence in bit-reversed
order. Most frequently the solution of this problem includes a
bit-reversal of the input or the output sequence, which is actually a
third component in the transformation evaluation. The bit-reversal
comprises two steps:
- Preparation of a bit-reversed index vector. This calculation
depends only on the problem size and the data allocation across the
processors in distributed versions and may be precomputed for each
different problem size, as was the case for the twiddle factors.
- Shuffle of input(output) sequence in accordance with the
bit-reversed index vector, which can be done during the calculation
of harmonic amplitudes.
The natural communication structure (the minimal connectivity of a
network in which the communications in the optimal distributed
solution are local) for the butterfly phase is a D-dimensional
hypercube if 2^{D} processors are used. In the bit-reversal phase
the natural communication structure is a completely connected network.
The current version of the benchmark assumes, that the host
collects the output vectors (each N/P long), if necessary, in a
bit-reversal order without any additional communication. Therefore the
timings reported for the bit-reversal phase include the intra-
processor bit-reversal shuffle only.
2. Operating Instructions
-------------------------
Changing problem size and numbers of processes:
The problem size and number of processors used by the benchmark can be
adjusted by changing the parameters LOGN and LOGP in the include file
"fft1.inc". The user may also change a third parameter NITER to alter
the accuracy of the benchmark timing relative to the resolution of the
system clock. The use of these parameters is described below.
LOGN
The problem size is determined by the total number of data-points to
be transformed (N). The radix-2 algorithm used requires the total
number of points to be an integer power of two. Thus the problem size is
conveniently specified in terms of the parameter LOGN which is defined
as the logarithm to base two of the number of transform points.
ie N = 2**LOGN
LOGP
Similarly the number of processors (P) should also be an integer power
of two and is specified by the parameter LOGP such that P = 2**LOGP.
This parameter is not required for the sequential benchmark.
NITER
For small problem sizes it may be that the execution time of the FFT
is comparable with the resolution of the system clock. To achieve
a given accuracy in the timing measurement the calculation
can be repeated by use of the parameter NITER. This specifies the
number of repeats and should be set so that the benchmarked time is at
least 100 times the clock resolution (If the clock resolution is
unknown this can be determined using the TICK1 benchmark).
Suggested Problem Sizes :
It is recommended that the benchmark is run with four standard problem sizes,
given by LOGN = 11, 14, 17 and 20. These values give the following
numbers of transform points: N = 2048, 16384, 131072 and 1048576.
Note that it may not be possible to run the largest problem size on all
machines because of restrictions on the available memory.
The approximate memory required on each processor is given by the formula:
Memory required (bytes) = 10 * N * (1 + 2/P)
Where N is the total number of points in the transform and P is the
number of processors over which the transform is distributed.
From this it can be seen that to run the largest problem size requires
between 12 and 30 Mbyte of memory, depending on the number of processors used.
The number of processors to be used will obviously depend on the
system available. The most important measurement is likely to be for
the largest power of two that will fit in to the machine. If time permits
the variation of performance with number of processors should be
investigated by reducing the number of processors by successive
factors of two or four.
Compiling and Running The Benchmark:
1) Set required problem size (plus number of nodes in distributed
version) and number of iterations through the parameters LOGN, (LOGP)
and NITER in the include file fft1.inc.
2) To compile and link the benchmark type: `make' for the distributed
version or `make slave' for the single-node version.
(If any of the parameters in the include files have been changed, the
make-file will automatically send only affected files to the compiler.)
3) On some systems it may be necessary to allocate the appropriate
resources before running the benchmark, eg. on the iPSC/860
to reserve a cube of 8 processors, type: getcube -t8
4) To run either sequential or distributed version of the benchmark,
type: fft1
A permanent copy of the benchmark output is written to a file
called 'fft1.res'.
5) If the run is successful and a permanent record is required, the
file 'fft1.res' should be copied to another file before the next run
overwrites it.
$Id: ReadMe,v 1.3 1994/07/05 19:19:06 igl Exp igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.