================================================================== === === === GENESIS Distributed Memory Benchmarks === === === === FFT1 === === === === One-dimensional FFT === === === === Versions: Std F77, PARMACS, Subset HPF, === === PVM 3.1 === === === === Author: Vladimir Getov === === === === Inquiries: HPC Centre === === Computing Services === === University of Southampton === === Southampton SO17 4BJ, U.K. === === === === Fax: +44 703 593939 E-mail: support@par.soton.ac.uk === === === === Last update: Jul 1993; Release: 3.0 === === === ================================================================== 1. Description -------------- The FFT1 benchmark is based on the classical radix-2 1-dimensional FFT algorithm by Cooley and Tukey. The radix-2 FFT algorithm for length N = 2^{R}, consists of m radix-2 stages. At each stage, N/2 radix-2 butterflies are computed by performing two nested DO-loops, with uniform indexing within these. Depending on the stage of the FFT computation, the length of the two DO-loops varies from N/2 to 1 for the number of data points and from 1 to N/2 for the number of partial transforms. It is possible to swap the order of these DO-loops half-way through the transform to increase the average vector length. The Fourier transform comprises two main phases: - the generation of twiddle factors; - the calculation of harmonic amplitudes (butterfly phase). The latter is commonly critical for the speed of computation. The first phase needs a smaller number of arithmetic operations and, furthermore, it may be precomputed and used many times for different problem sizes in some applications. Hence in this benchmark, although both the generation of twiddle factors and the butterfly phases are timed, only the operations and time for the butterfly phase are used in calculating the benchmark performance. The FFT algorithm offers an exceptional reduction of the number of operations for the calculation of harmonic amplitudes. The price paid for that reduction is the variable stride indexation for the classical FFT derivation, or the disordered bit-reversed form of the resulting output sequence for the `in-place' FFT algorithms. The first option causes problems for both vectorisation and distribution. On the one hand the distributed version of this option combines data from two processors and puts the result in a third processor which is clearly inefficient. Variable stride indexation on the other hand is usually inconvenient and causes significant performance losses. Thus only the `in-place' transforms are suitable for implementation on distributed memory computers, but have unfortunately their output sequence in bit-reversed order. Most frequently the solution of this problem includes a bit-reversal of the input or the output sequence, which is actually a third component in the transformation evaluation. The bit-reversal comprises two steps: - Preparation of a bit-reversed index vector. This calculation depends only on the problem size and the data allocation across the processors in distributed versions and may be precomputed for each different problem size, as was the case for the twiddle factors. - Shuffle of input(output) sequence in accordance with the bit-reversed index vector, which can be done during the calculation of harmonic amplitudes. The natural communication structure (the minimal connectivity of a network in which the communications in the optimal distributed solution are local) for the butterfly phase is a D-dimensional hypercube if 2^{D} processors are used. In the bit-reversal phase the natural communication structure is a completely connected network. The current version of the benchmark assumes, that the host collects the output vectors (each N/P long), if necessary, in a bit-reversal order without any additional communication. Therefore the timings reported for the bit-reversal phase include the intra- processor bit-reversal shuffle only. 2. Operating Instructions ------------------------- Changing problem size and numbers of processes: The problem size and number of processors used by the benchmark can be adjusted by changing the parameters LOGN and LOGP in the include file "fft1.inc". The user may also change a third parameter NITER to alter the accuracy of the benchmark timing relative to the resolution of the system clock. The use of these parameters is described below. LOGN The problem size is determined by the total number of data-points to be transformed (N). The radix-2 algorithm used requires the total number of points to be an integer power of two. Thus the problem size is conveniently specified in terms of the parameter LOGN which is defined as the logarithm to base two of the number of transform points. ie N = 2**LOGN LOGP Similarly the number of processors (P) should also be an integer power of two and is specified by the parameter LOGP such that P = 2**LOGP. This parameter is not required for the sequential benchmark. NITER For small problem sizes it may be that the execution time of the FFT is comparable with the resolution of the system clock. To achieve a given accuracy in the timing measurement the calculation can be repeated by use of the parameter NITER. This specifies the number of repeats and should be set so that the benchmarked time is at least 100 times the clock resolution (If the clock resolution is unknown this can be determined using the TICK1 benchmark). Suggested Problem Sizes : It is recommended that the benchmark is run with four standard problem sizes, given by LOGN = 11, 14, 17 and 20. These values give the following numbers of transform points: N = 2048, 16384, 131072 and 1048576. Note that it may not be possible to run the largest problem size on all machines because of restrictions on the available memory. The approximate memory required on each processor is given by the formula: Memory required (bytes) = 10 * N * (1 + 2/P) Where N is the total number of points in the transform and P is the number of processors over which the transform is distributed. From this it can be seen that to run the largest problem size requires between 12 and 30 Mbyte of memory, depending on the number of processors used. The number of processors to be used will obviously depend on the system available. The most important measurement is likely to be for the largest power of two that will fit in to the machine. If time permits the variation of performance with number of processors should be investigated by reducing the number of processors by successive factors of two or four. Compiling and Running The Benchmark: 1) Set required problem size (plus number of nodes in distributed version) and number of iterations through the parameters LOGN, (LOGP) and NITER in the include file fft1.inc. 2) To compile and link the benchmark type: `make' for the distributed version or `make slave' for the single-node version. (If any of the parameters in the include files have been changed, the make-file will automatically send only affected files to the compiler.) 3) On some systems it may be necessary to allocate the appropriate resources before running the benchmark, eg. on the iPSC/860 to reserve a cube of 8 processors, type: getcube -t8 4) To run either sequential or distributed version of the benchmark, type: fft1 A permanent copy of the benchmark output is written to a file called 'fft1.res'. 5) If the run is successful and a permanent record is required, the file 'fft1.res' should be copied to another file before the next run overwrites it. $Id: ReadMe,v 1.3 1994/07/05 19:19:06 igl Exp igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.