GBIS Benchmark Header File: fft1

   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                            FFT1                            ===
   ===                                                            ===
   ===                     One-dimensional FFT                    ===
   ===                                                            ===
   ===                 Versions: Std F77, PARMACS, Subset HPF,    ===
   ===                           PVM 3.1                          ===
   ===                                                            ===
   ===                   Author: Vladimir Getov                   ===
   ===                                                            ===
   ===                Inquiries: HPC Centre                       ===
   ===                           Computing Services               ===
   ===                           University of Southampton        ===
   ===                           Southampton SO17 4BJ, U.K.       ===
   ===                                                            ===
   ===   Fax: +44 703 593939   E-mail: support@par.soton.ac.uk    ===
   ===                                                            ===
   ===          Last update: Jul 1993; Release: 3.0               ===
   ===                                                            ===
   ==================================================================


1. Description
--------------

The FFT1 benchmark is based on the classical radix-2 1-dimensional FFT
algorithm by Cooley and Tukey. The radix-2 FFT algorithm for length 
N = 2^{R}, consists of m radix-2 stages. At each stage, N/2 radix-2 
butterflies are computed by performing two nested DO-loops, with 
uniform indexing within these. Depending on the stage of the FFT 
computation, the length of the two DO-loops varies from N/2 to 1 for 
the number of data points and from 1 to N/2 for the number of partial 
transforms. It is possible to swap the order of these DO-loops half-way
through the transform to increase the average vector length.

The Fourier transform comprises two main phases:
	- the generation of twiddle factors;
	- the calculation of harmonic amplitudes (butterfly phase).
The latter is commonly critical for the speed of computation. The 
first phase needs a smaller number of arithmetic operations and, 
furthermore, it may be precomputed and used many times for different 
problem sizes in some applications. Hence in this benchmark, although
both the generation of twiddle factors and the butterfly phases are 
timed, only the operations and time for the butterfly phase are used
in calculating the benchmark performance.

The FFT algorithm offers an exceptional reduction of the number of 
operations for the calculation of harmonic amplitudes. The price paid 
for that reduction is the variable stride indexation for the classical 
FFT derivation, or the disordered bit-reversed form of the resulting 
output sequence for the `in-place' FFT algorithms. The first option 
causes problems for both vectorisation and distribution. On the one hand 
the distributed version of this option combines data from two processors 
and puts the result in a third processor which is clearly inefficient.  
Variable stride indexation on the other hand is usually inconvenient and 
causes significant performance losses.  Thus only the `in-place' 
transforms are suitable for implementation on distributed memory 
computers, but have unfortunately their output sequence in bit-reversed
order. Most frequently the solution of this problem includes a 
bit-reversal of the input or the output sequence, which is actually a 
third component in the transformation evaluation. The bit-reversal 
comprises two steps:
	- Preparation of a bit-reversed index vector. This calculation
depends only on the problem size and the data allocation across the
processors in distributed versions and may be precomputed for each
different problem size, as was the case for the twiddle factors.
	- Shuffle of input(output) sequence in accordance with the
bit-reversed index vector, which can be done during the calculation 
of harmonic amplitudes.


The natural communication structure (the minimal connectivity of a
network in which the communications in the optimal distributed 
solution are local) for the butterfly phase is a D-dimensional 
hypercube if 2^{D} processors are used. In the bit-reversal phase 
the natural communication structure is a completely connected network.
The current version of the benchmark assumes, that the host 
collects the output vectors (each N/P long), if necessary, in a
bit-reversal order without any additional communication. Therefore the
timings reported for the bit-reversal phase include the intra-
processor bit-reversal shuffle only.



2. Operating Instructions
-------------------------

Changing problem size and numbers of processes:

The problem size and number of processors used by the benchmark can be
adjusted by changing the parameters LOGN and LOGP in the include file
"fft1.inc". The user may also change a third parameter NITER to alter
the accuracy of the benchmark timing relative to the resolution of the
system clock. The use of these parameters is described below.


LOGN
	The problem size is determined by the total number of data-points to
	be transformed (N). The radix-2 algorithm used requires the total
	number of points to be an integer power of two. Thus the problem size is
	conveniently specified in terms of the parameter LOGN which is defined
	as the logarithm to base two of the number of transform points.

	ie	N = 2**LOGN


LOGP
	Similarly the number of processors (P) should also be an integer power 
	of two and is specified by the parameter LOGP such that P = 2**LOGP.
	This parameter is not required for the sequential benchmark.


NITER
	For small problem sizes it may be that the execution time of the FFT
	is comparable with the resolution of the system clock. To achieve
	a given accuracy in the timing measurement the calculation 
	can be repeated by use of the parameter NITER. This specifies the
	number of repeats and should be set so that the benchmarked time is at
	least 100 times the clock resolution (If the clock resolution is
	unknown this can be determined using the TICK1 benchmark).


Suggested Problem Sizes :

It is recommended that the benchmark is run with four standard problem sizes,
given by LOGN  = 11, 14, 17 and 20. These values give the following
numbers of transform points: N = 2048, 16384, 131072 and 1048576.

Note that it may not be possible to run the largest problem size on all 
machines because of restrictions on the available memory. 
The approximate memory required on each processor is given by the formula:

	Memory required (bytes) = 10 * N * (1 + 2/P)

Where N is the total number of points in the transform and P is the
number of processors over which the transform is distributed.
From this it can be seen that to run the largest problem size requires 
between 12 and 30 Mbyte of memory, depending on the number of processors used.

The number of processors to be used will obviously depend on the
system available. The most important measurement is likely to be for
the largest power of two that will fit in to the machine. If time permits 
the variation of performance with number of processors should be
investigated by reducing the number of processors by successive
factors of two or four.

Compiling and Running The Benchmark:

1) Set required problem size (plus number of nodes in distributed
   version) and number of iterations through the parameters LOGN, (LOGP)
   and NITER in the include file fft1.inc.

2) To compile and link the benchmark type:   `make' for the distributed
   version or `make slave' for the single-node version.

   (If any of the parameters in the include files have been changed, the 
   make-file will automatically send only affected files to the compiler.)

3) On some systems it may be necessary to allocate the appropriate
   resources before running the benchmark, eg. on the iPSC/860
   to reserve a cube of 8 processors, type:    getcube -t8

4) To run either sequential or distributed version of the benchmark,
   type:    fft1

   A permanent copy of the benchmark output is written to a file
   called 'fft1.res'.

5) If the run is successful and a permanent record is required, the
   file 'fft1.res' should be copied to another file before the next run
   overwrites it.

$Id: ReadMe,v 1.3 1994/07/05 19:19:06 igl Exp igl $
High Performance Computing Centre
Submitted by Mark Papiani,
last updated on 10 Jan 1995.