THE PSTSWM COMPACT APPLICATION

The Parallel Spectral Transform Shallow Water Model (PSTSWM) application was submitted by Patrick H. Worley of Oak Ridge National Laboratory. The following papers related to this application are available:
A users' guide to PSTSWM, P. H. Worley, I. T. Foster, and B. Toonen, ORNL Technical Report ORNL/TM-12779 (DRAFT). (180119 bytes)
Performance of parallel computers for spectral atmospheric models, I. T. Foster, B. Toonen, and P. H. Worley, ORNL Technical Report ORNL/TM-12986, April 1995. (228517 bytes)
Parallel Spectral Transform Shallow Water Model: A runtime-tunable parallel benchmark code, P. H. Worley and I. T. Foster, in Proceedings of the Scalable High Performance Computing Conference, eds. J. J. Dongarra and D. W. Walker, IEEE Computer Society Press, 1994, pp. 207-214. (118239 bytes)
Parallel algorithms for the spectral transform method, Ian T. Foster and Patrick H. Worley, ORNL Technical Report ORNL/TM-12507, April 1994. (334962 bytes)
Parallelizing the spectral transform method. Part II, David W. Walker, Patrick H. Worley, and John B. Drake, Concurrency: Practice and Experience, 4(7), October 1992, pp. 509-531. (256974 bytes)
Parallelizing the spectral transform method, Patrick H. Worley and John B. Drake, Concurrency: Practice and Experience, 4(4), June 1992, pp. 269-291. (758711 bytes)
A PSTSWM web page maintained by Pat Worley is available. The code may be obtained in the current distribution from the netlib repository.
Further details, provided by Pat Worley and last updated on 5/20/95, are given below.
- -------------------------------------------------------------------------------
Name of Program         : PSTSWM 
                        : (Parallel Spectral Transform Shallow Water Model)
- -------------------------------------------------------------------------------
Submitter's Name        : Patrick H. Worley
Submitter's Organization: Oak Ridge National Laboratory
Submitter's Address     : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
Submitter's Telephone # : (615) 574-3128
Submitter's Fax #       : (615) 574-0680
Submitter's Email       : worleyph@ornl.gov
- -------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Climate Modeling
- -------------------------------------------------------------------------------
Application "pedigree"  :

PSTSWM Version 4.0 is a message-passing benchmark code and parallel algorithm
testbed that solves the nonlinear shallow water equations using the spectral
transform method. The spectral transform algorithm of the code follows
closely how CCM2, the NCAR Community Climate Model, handles the dynamical
part of the primitive equations, and the parallel algorithms implemented in
the model include those currently used in the message-passing parallel
implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge
National Laboratory and Ian Foster of Argonne National Laboratory, and is
based partly on previous parallel algorithm research by John Drake, David
Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code
development and parallel algorithms research were funded by the DOE Computer
Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program.

PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written
by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations 
on a sphere using the spectral transform method. STSWM evolved from a
spectral shallow water model written by Hack (NCAR/CGD) to compare numerical
schemes designed to solve the divergent barotropic equations in spherical
geometry. STSWM was developed to provide the reference solutions
to the test cases proposed by Williamson et. al. (see citation [4] below),
which were chosen to test the ability of numerical methods to simulate
important flow phenomena. These test cases are embedded in the code and 
are selectable at run-time via input parameters, specifying initial conditions,
forcing, and analytic solutions (for error analysis). The solutions are also
published in a Technical Note by Jakob et. al. [3]. In addition, this code is
meant to serve as an educational tool for numerical studies of the shallow
water equations. A detailed description of the spectral transform method, and
a derivation of the equations used in this software, can be found in the
Technical Note by Hack and Jakob [2].  

In addition to modifications for parallelization, we also modified
STSWM to add vertical levels (in order to get the correct communication and
computation granularity for 3-D weather and climate codes), to increase
modularity and support code reuse, and to allow the problem size to be
selected at runtime without depending on dynamic memory allocation. PSTSWM
is meant to be a compromise between paper benchmarks and the usual fixed
benchmarks by allowing a significant amount of runtime-selectable algorithm
tuning. Thus, the goal is to see how quickly the numerical simulation can be
run on different machines without fixing the parallel implementation, but
forcing all implementations to execute the same numerical code (to guarantee
fairness). The code has also been written in such a way that linking in
optimized library functions for common operations instead of the "portable"
code will simple. 

- -------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the authors (Worley and
Foster) and the program that supported the development of the code
(DOE CHAMMP program) in any resulting research or publications, and are
encouraged to send reprints of their work with this code to the authors.
Also, the authors would appreciate being notified of any modifications to 
the code. Finally, the code has been written to allow easy reuse of code in
other applications, and for educational purposes. The authors encourage this,
but also request that they be notified when pieces of the code are used.

- -------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

The program currently uses either (a) INTEGER, REAL*8, and COMPLEX*16, or
(b) INTEGER, REAL*4, and COMPLEX*8, and REAL*8. In option (b), REAL*8
variables are used only in set-up, computing Gauss weights and nodes and
Legendre polynomial values, not in the body of the computation. 
Option (a) is the default, but option (b) can be chosen at compile time by
setting a make variable appropriately. The required precision for
meteorological studies is a matter of debate in the scientific community, so
we support both options.  

The length in bytes of INTEGER needs to be set in the platform-specific
makefile. Note that parameter values may need to be modified for floating
point number systems with large mantissas, e.g., PI, TWOPI.

- -------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :

The sequential code is documented in a file included in the distribution of the
code from NCAR:

Jakob, Ruediger, Description of Software for the Spectral Transform Shallow
Water Model Version 2.0. National Center for Atmospheric Research,
Boulder, CO 80307-3000, August 1992

and in 

Hack, J.J. and R. Jakob, Description of a global shallow water model based on
the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. 

A draft user guide for the parallel code is available from

http://www.epm.ornl.gov/chammp/pstswm/index.html.

The code distribution also comes with a README file, and 
extensive documentation is present in the code.

- -------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :

1) Browning, G.L., J.J. Hack and P.N. Swarztrauber, A comparison of
   three numerical methods for solving differential equations on
   the sphere, Monthly Weather Review, 117:1058-1075, 1989.

2) Hack, J.J. and R. Jakob, Description of a global
   shallow water model based on the spectral transform method,
   NCAR Technical Note TN-343+STR, January 1992.

3) Jakob, R., J.J. Hack and D.L. Williamson, Reference solutions to
   shallow water test set using the spectral transform method,
   NCAR Technical Note TN-388+STR.

4) Williamson, D.L., J.B. Drake, J.J. Hack, R. Jakob and P.S. Swarztrauber,
   A standard test set for numerical approximations to the shallow
   water equations in spherical geometry, Journal of Computational Physics,
   Vol. 102, pp.211-224, 1992.
- -------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :

5) Worley, P. H. and J. B. Drake, Parallelizing the Spectral Transform Method,
   Concurrency: Practice and Experience, Vol. 4, No. 4 (June 1992), 
   pp. 269-291.

6) Walker, D. W., P. H. Worley, and J. B. Drake, Parallelizing the Spectral
   Transform Method. Part II, 
   Concurrency: Practice and Experience, Vol. 4, No. 7 (October 1992), 
   pp. 509-531.

7) Worley, P. H. and I. T. Foster, Parallel Spectral Transform Shallow Water
   Model: A runtime-tunable parallel benchmark code, 
   in Proceedings of the Scalable High Performance Computing Conference,
   eds. J. J. Dongarra and D. W. Walker, IEEE Computer Society Press, 1994,
   pp. 207-214.

8) Foster, I. T. and P. H. Worley, Parallel algorithms for the spectral
   transform method,
   ORNL Technical report ORNL/TM-12507, April 1994.

9) Foster, I. T., B. Toonen, and P. H. Worley, Performance of parallel
   computers for spectral atmospheric models,
   ORNL Technical Report ORNL/TM-12986, April 1995.

- -------------------------------------------------------------------------------
Other relevent research papers:

10) Foster I., W. Gropp, and R. Stevens, 
    The parallel scalability of the spectral transform method, 
    Mon. Wea. Rev., 120(5), 1992, pp. 835--850. 

11) Drake, J. B., R. E. Flanery, I. T. Foster, J. J. Hack, J. G. Michalakes,
    R. L. Stevens, D. W. Walker, D. L. Williamson, and P. H. Worley,
    The Message-Passing Version of the Parallel Community Climate Model,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 500-513.

12) Sato, R. K. and R. D. Loft,
    Implementation of the NCAR CCM2 on the Connection Machine,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 371-393.

13) Barros, S. R. M. and T. Kauranne, ,
    On the Parallelization of Global Spectral Eulerian Shallow-Water Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 36-43.

14) Kauranne, T. and S. R. M. Barros,
    Scalability Estimates of Parallel Spectral Atmospheric Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 312-328.

15) Pelz, R. B. and W. F. Stern,
    A Balanced Parallel Algorithm for Parallel Processing,
    Proceedings of the Sixth SIAM Conference on Parallel Processing for
    Scientific Computing (March22-24, 1993), pp. 126-128.

16) Drake, J. B., I. T. Foster, J. J. Hack, J. G. Michalakes,
    B. D. Semeraro, B. Toonen, D. L. Williamson, and P. H. Worley,
    PCCM2: A GCM adapted for scalable parallel computers,
    Proceedings of the Fifth Global Change Symposium.
    American Meteorological Society, 1994. 

- -------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :

PSTSWM is written in Fortran 77 with VMS extensions and a small number of C
preprocessor directives. PSTSWM currently comes in three forms: a full
distribution, a PVM-only distribution, and a MPI-only distribution.
In the full distribution, message passing is implemented using MPI, PICL, PVM,
and/or native message passing libraries, with the choice being made at
compile time. Additionally, all message passing is encapsulated in three high
level routines for broadcast, global minimum and global maximum, and in two
classes of low level routines representing variants and/or stages of the swap
operation and the send/receive operation. Porting the code to another message
passing system requires either porting the PICL, MPI, or PVM libraries and/or
implementing the (few) communication routines in PSTSWM using native message
passing primitives. The PVM-only and MPI-only distributions of the code are
simply stripped down versions, simplifying the makefile structure and
directory layout of the distribution.

As of 4/1/95, the full distribution has been run on the Cray Research T3D,
the IBM SP-1 and SP-2, the Intel iPSC/2, iPSC/860, DELTA, and Paragon (on
both GP and MP nodes and using either the NX or SUNMOS operating systems),
the nCUBE/2 and nCUBE/2S, across a network of workstations, and on a Cray
vector machine (as a serial application), using both native and PICL
communication libraries. The PVM-only distribution has been run across a
network of IBM and SUN workstations, and the MPI-only distribution has been
run on an Intel Paragon (using MPICH) and on an IBM SP-2 (using both MPICH
and MPI-F). 

- -------------------------------------------------------------------------------
Total number of lines in source code:    34,385 in PVM-only version
Number of lines excluding comments  :    14,118 in PVM-only version
Size in bytes of source code        : 1,242,410 in PVM-only version
- -------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :

problem:      24 lines, 561 bytes, ascii
algorithm:    20 lines, 532 bytes, ascii
measurements: 13 lines, 339 bytes, ascii

- -------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :

standard output: Number of lines and bytes is a function of the input
                 specifications, but for benchmarking would normally be
                 63 lines (2000 bytes) of meaningful output. 

timings:         Each run produces one line of output, containing approx.
                 200 bytes.

Both files are ascii.


- -------------------------------------------------------------------------------
Brief, high-level description of what application does:

(P)STSWM solves the nonlinear shallow water equations on the sphere.
The nonlinear shallow water equations constitute a simplified
atmospheric-like fluid prediction model that exhibits many of the features of
more complete models, and that has been used to investigate numerical
methods and benchmark a number of machines. Each run of PSTSWM uses one of 6
embedded initial conditions and forcing functions. These cases were chosen to
stress test numerical methods for this problem, and to represent important
flows that develop in atmospheric modeling. STSWM also supports reading in
arbitrary initial conditions, but this was removed from the parallel code to
simplify the development of the parallel implementation. 

- -------------------------------------------------------------------------------
Main algorithms used:

PSTSWM uses the spectral transform method to solve the shallow water
equations. During each timestep, the state variables of the problem are
transformed between the physical domain, where most of the physical forces
are calculated, and the spectral domain, where the terms of the differential
equation are evaluated. The physical domain is a tensor product
longitude-latitude grid. The spectral domain is the set of spectral
coefficients in a spherical harmonic expansion of the state variables, and
is normally characterized as a triangular array (using a "triangular"
truncation of spectral coefficients).  

Transforming from physical coordinates to spectral coordinates involves
performing a real FFT for each line of constant latitude, followed by
integration over latitude using Gaussian quadrature (approximating the
Legendre transform) to obtain the spectral coefficients. The inverse
transformation involves evaluating sums of spectral harmonics and inverse
real FFTs, analogous to the forward transform. 

Parallel algorithms are used to compute the FFTs and to compute the vector
sums used to approximate the forward and inverse Legendre transforms. Two
major alternatives are available for both transforms, distributed algorithms,
using a fixed data decompostion and computing results where they are
assigned, and transpose algorithms, remapping the domains to allow the
transforms to be calculated sequentially. This translates to four major
parallel algorithms: 

a) distributed FFT/distributed Legendre transform (LT)
b) transpose FFT/distributed LT
c) distributed FFT/transpose LT
d) transpose FFT/transpose LT

Multiple implementations are supported for each type of algorithm, and the
assignment of processors to transforms is also determined by input
parameters. For example, input parameters specify a logical 2-D processor
grid and define the data decomposition of the physical and spectral domains
onto this grid. If 16 processors are used, these can be arranged as a 4x4
grid, an 8x2 grid, a 16x1 grid, a 2x8 grid, or a 1x16 grid. This
specification determines how many processors are used to calculate each
parallel FFT and how many are used to calculate each parallel LT. 

- -------------------------------------------------------------------------------
Skeleton sketch of application:

The main program calls INPUT to read problem and algorithm parameters and set
up arrays for spectral transformations, and then calls INIT to set up the
test case parameters. Routines ERRANL and NRGTCS are called once before the
main timestepping loop for error normalization, once after the main
timestepping for  calculating energetics data and errors, and periodically
during  the timestepping, as requested. The prognostic fields are initialized
using routine ANLYTC, which provides the analytic solution. Each call to STEP
advances the computed fields by a timestep DT. Timing logic surrounds the
timestepping loop, so the initialization phase is not timed. Also, a fake
timestep is calculated before beginning timing to eliminate the first time
"paging" effect seen on some systems. 

STEP computes the first two time levels by two semi-implicit timesteps; all
other timesteps are calculated using a centered leapfrog scheme. STEP calls
COMP1, which choses between an explicit numerical algorithm, a semi-implicit
algorithm, and a simplified algorithm associated with solving the advection
equation, one of the embedded test cases. The numerical algorithm used is an
input parameter. 

The basic outline of each timestep is the following:
1) Evaluate non-linear product and forcing terms.
2) Fourier transform non-linear terms in place as a block transform.
3) Compute and update divergence, geopotential, and vorticity spectral
   coefficients. (Much of the calculation of the time update is "bundled"
   with the Legendre transform.)
4) Compute velocity fields and transform divergence, geopotential,
   and vorticity back to gridpoint space using 
   a) an inverse Legendre transform and associated computations and
   b) an inverse real block FFT.

PSTSWM has "fictitious" vertical levels, and all computations are duplicated
on the different levels, potentially significantly increasing the granularity
of the computation. (The number of vertical levels is an input parameter.)
For error analysis, a single vertical level is extracted and analyzed. 

- -------------------------------------------------------------------------------
Brief description of I/O behavior:

Processor 0 reads the input files and broadcasts the input parameters to
the rest of the processors. Processor 0 also receives the error analysis and
timing results from the other processors and writes them out.

- -------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :

The processors are treated as a logical 2-D grid. There are 3 domains to be
distributed:
 a) physical domain: tensor product longitude-latitude grid
 b) Fourier domain:  tensor product wavenumber-latitude grid
 c) spectral domain: triangular array, where each column contains the
                     spectral coefficients associated with a given
                     wavenumber. The larger the wavenumber is, the shorter
                     the column is.
An unordered FFT is used, and the Fourier and spectral domains use the
"unordered" permutation when the data is being distributed.

I) distributed FFT/distributed LT
   1) The tensor-product longitude-latitude grid is mapped onto the processor
      grid by assigning a block of contiguous longitudes to each processor
      column and by assigning one or two blocks of contiguous latitudes to
      each processor row. The vertical dimension is not distributed.   
   2) After the FFT, the subsequent wavenumber-latitude grid is similarly
      distributed over the processor grid, with a block of the permuted
      wavenumbers assigned to each processor column. 
   3) After the LT, the wavenumbers are distributed as before and the
      spectral coefficients associated with any given wavenumber are either
      distributed evenly over the processors in the column containing that
      wavenumber, or are duplicated over the column. What happens is a
      function of the particular distributed LT algorithm used. 

IIa) single transpose FFT/distributed LT
   1) same as in (I)
   2) Before the FFT, the physical domain is first remapped to a vertical
      layer-latitude decomposition, with a block of contiguous vertical
      layers assigned to each processor column and the longitude dimension
      not distributed. After the transform, the vertical level-latitude grid
      is distributed as before, and the wavenumber dimension is not
      distributed.  
   3) After the LT, the spectral coefficients for a given vertical layer are
      either distributed evenly over the processors in a column, or are
      duplicated over that column. What happens is a function of the
      particular distributed LT algorithm used.  

IIb) double transpose FFT/distributed LT
   1) same as in (I)
   2) Before the FFT, the physical domain is first remapped to a (vertical
      layer/field)-latitude decomposition, with a subset of the vertical
      layers and the fields being transformed assigned to each processor
      column and the longitude dimension not distributed. After the FFT, the
      Fourier domain is remapped to a wavenumber-latitude decomposition,
      reordering the wavenumbers to improve load balance.
   3) After the LT, the wavenumbers are distributed as before and the
      spectral coefficients associated with any given wavenumber are either
      distributed evenly over the processors in the column containing that
      wavenumber, or are duplicated over the column. What happens is a
      function of the particular distributed LT algorithm used. 

III) distributed FFT/transpose LT
   1) same as (I)
   2) same as (I)
   3) Before the LT, the wavenumber-latitude grid is first remapped to
      a wavenumber-vertical layer decomposition, with a block of contiguous
      vertical layers assigned to each processor row and the latitude
      dimension not distributed. After the transform, the spectral
      coefficients associated with a given wavenumber and vertical layer
      are all on one processor, and the wavenumbers and vertical layers are
      distributed as before.

IV) transpose FFT/transpose LT
   1) same as (I)
   2) same as (IIa)
   3) Before the LT, the vertical level-latitude grid is first remapped to
      a vertical level-wavenumber decomposition, reordering the wavenumbers
      to improve load balance, with a block of (load-balanced) wavenumbers
      now assigned to each processor row and the latitude dimension not
      distributed. After the transform, the spectral coefficients associated
      with a given wavenumber and vertical layer are all on one processor,
      and the wavenumbers and vertical layers are distributed as before.

- -------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :

The distribution is a function of the problem size (longitude, latitude,
vertical levels), the logical processor grid (PX, PY), and the algorithm
(transpose vs. distributed for FFT and LT).

- -------------------------------------------------------------------------------
Brief description of load balance behavior :

The load is fairly well balanced. If PX and PY evenly divide the number of
longitudes, latitudes, and vertical levels, then all load imbalances are due
to the unequal distribution of spectral coefficients. As described above, the
spectral coefficients are laid out as a triangular array in most runs, where
each column corresponds to a different Fourier wavenumber. The wavenumbers are
partitioned among the processors in most of the parallel algorithms. Since
each column is a different length, a modified wrap mapping of the the columns
will approximately balance the load. For the distributed FFT algorithms, the
natural "unordered" ordering of the FFT is used with a block partitioning,
which also does a reasonable job of load balancing without any additional
data movement. The load imbalance is quantified in Walker, et al [5]
and in Foster and Worley [8] for more details.

If PX and PY do not evenly divide the dimensions of the physical domain,
then other load imbalances may be as large as a factor of PX or PY in the
worst (reasonable) case. See Foster and Worley [8] for more details.

- -------------------------------------------------------------------------------
Give parameters that determine the problem size :

MM, NN, KK - specifes number of Fourier wavenumber and spectral truncation
             used. For a triangular truncation, MM = NN = KK.
NLON, NLAT, NVER
           - number of longitudes, latitudes, and vertical levels. There
             are required relationships between NLON, NLAT, and NVER, and
             between these and MM. These relationships are checked in the
             code. We will also provide a selection of input files that
             specify legal (and interesting) problems.
DT         - timestep (in seconds). (Must be small enough to satisfy Courant
             condition stability condition. Code warns if too large, but does
             not abort.)
TAUE       - end of model run (in hours)

- -------------------------------------------------------------------------------
Give memory as function of problem size :

Executable size is determined at compile time by setting the makefile
parameter WORKSPACE. Per node memory requirements are approximately
(in REALs)

associated Legendre polynomial values:
   MM*MM*NLAT/(2*PX*PY)
 or (if vertical dimension of spectral domain is decomposed)
   MM*MM*NLAT/(2*PY)
physical grid fields: 
   8*NLON*NLAT*NVER/(PX*PY)
spectral grid fields: 
   3*MM*MM*NVER/(PX*PY) 
 or (if spectral coefficients duplicated within a processor column)
   3*MM*MM*NVER/PX        
work space:
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/(PX*PY)
 or (if spectral coefficients duplicated within a processor column)
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/PX

where BUFS1 and BUFS2 are input parameters (number of communication buffers).
BUFS1 and BUFS2 can be as small as 0 and as large as PX or PY.

In standard test cases, NLON=2*NLAT and NLON=3*MM+1, so memory requirements
are approximately: 

    (3/4)*(MM**3)/(PX*PY) + (36*(1+BUFS1) + 3*(1+BUFS2))*(MM**2)*NVER/(PX*PY)
  or
    (3/4)*(MM**3)/PX      + (36*(1+BUFS1) + 3*(1+BUFS2))*(MM**2)*NVER/(PX*PY)
  or
    (3/4)*(MM**3)/(PX*PY) + 36*(1+BUFS1)*(MM**2)*NVER/(PX*PY)
                          +  3*(1+BUFS2)*(MM**2)*NVER/PX

- -------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :

for a serial run per timestep (approximate):
  nonlinear terms:
        12*NLON*NLAT*NVER
  forward FFT:
        20*NLON*NLAT*NVER*LOG2(NLON) + 36*NLON*NLAT*NVER
  forward LT and time update:
        7*(MM**2)*NLAT*NVER + 61*MM*NLAT*NVER + 7*(MM**2)*NVER
  inverse LT and calculation of velocities:
        7*(MM**2)*NLAT*NVER + 10*MM*NLAT*NVER + 9*(MM**2)*NVER 
  inverse FFT:
        13*NLON*NLAT*NVER*LOG2(NLON) + 18*NLON*NLAT*NVER

Using standard assumptions (NLON=2*NLAT and NLON=3*MM+1):

approx: 21*(MM**3)*NVER + 149*(MM**2)*LOG2(M)*NVER + 662*(MM**2)*NVER

For a total run, multiply by TAUE/DT.

- -------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :

This is a function of the algorithm chosen.

Ia) single transpose FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*(PX-1) steps, 2*D volume
      or
        2*LOG2(PX) steps, D*LOG2(PX) volume 

I) double transpose FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        4*(PX-1) steps, 4*D volume
      or
        4*LOG2(PX) steps, 2*D*LOG2(PX) volume 

II) distributed FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*LOG2(PX) steps, D*LOG2(PX) volume

III) transpose LT

   a) forward LT:  let D = 16*MM*NLAT*NVER/(PX*PY)
        PY-1 steps, D volume
      or
        LOG2(PY) steps, (D/2)*LOG2(PY) volume 

   b) inverse LT:  let D = 10*MM*NLAT*NVER/(PX*PY)
        PY-1 steps, D volume
       or
        LOG2((PY) steps, (D/2)*PY volume

IV) distributed LT

   a) forward + inverse LT:  let D = 3*(MM**2)*NVER/(PX*PY)
        2*(PY-1) steps, D*PY volume
       or
        2*LOG2((PY) steps, D*PY volume

These are per timestep costs. Multiply by TAUE/DT for total communication
overhead. 

- -------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :

Standard input files will be provided for 

T42: MM=KK=NN=42      T170: MM=KK=NN=170        T340: MM=NN=KK=340
     NLON=64                NLON=256                  NLON=512
     NLAT=128               NLAT=512                  NVER=1024
     NVER=16                NVER=32                   NVER=64
     ICOND=2                ICOND=2                   ICOND=2
     DT=2000.0              DT=500.0                  DT=250.0
     TAUE=120.0             TAUE=30.0                 TAUE=15.0

These are runs of the "benchmark" case specified in Williamson, et al
[3] for different resolutions. The length of each run has been chosen
in accordance with the ParkBench committee's suggestions for the sizes of
small, medium, and large example problems.

Flops and memory requirements for serial runs are as follows (approx.):

T42:  30 MBytes of workspace when run with 64 bit precision
      13 Gflop

T170: 1000 MBytes of workspace when run with 64 bit precision
      1000 Gflop

T340: 1300 MBytes of workspace when run with 64 bit precision
      13000 Gflop

Both memory and flops scale well, so, for example, the T42 run fits in
approx. 8MB of memory for a 4 processor run. But different algorithms and 
different aspect ratios of the processor grid use different amounts of memory.

- -------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :

The T42 and T170 counts were determined using the hardware monitor on
a Cray vector machine, in the following fashion. The code was compiled
and run using a number of different compiler options. The counts here
are based on the minimum number of floating point operations reported during
these experiments. The T340 run is based on extrapolation using T42, T85, and
T170 hardware monitor values, and verified using the complexity results
described above.

- -------------------------------------------------------------------------------
PARKBENCH compact applications page
Last Modified May 14, 1996