THE POLMP COMPACT APPLICATION

The Proudman Oceanographic Laboratory Multiprocessing Program (POLMP) by Mike Ashworth of the U.K. Natural Environment Research Council. It may be obtained in the future compact applications directory from the netlib repository. Further details, provided by Mike Ashworth, are given below.
-------------------------------------------------------------------------------
Name of Program         : POLMP
                 (Proudman Oceanographic Laboratory Multiprocessing Program)
-------------------------------------------------------------------------------
Submitter's Name        : Mike Ashworth
Submitter's Organization: NERC Computer Services
Submitter's Address     : Bidston Observatory
			  Birkenhead, L43 7RA, UK
Submitter's Telephone # : +44-51-653-8633
Submitter's Fax #       : +44-51-653-6269
Submitter's Email       : M.Ashworth@ncs.nerc.ac.uk
-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Ocean and Shallow Sea Modeling
-------------------------------------------------------------------------------
Application "pedigree" (origin, history, authors, major mods) :

     The POLMP project was created to develop numerical
     algorithms for shallow sea 3D hydrodynamic models that run
     efficiently on modern parallel computers. A code was
     developed, using a set of portable programming conventions
     based upon standard Fortran 77, which follows the wind
     induced flow in a closed rectangular basin including a number
     of arbitrary land areas. The model solves a set of
     hydrodynamic partial differential equations, subject to a set of
     initial conditions, using a mixed explicit/implicit forward time
     integration scheme. The explicit component corresponds to a
     horizontal finite difference scheme and the implicit to a
     functional expansion in the vertical (Davies, Grzonka and
     Stephens, 1989).

     By the end of 1989 the code had been implemented on the RAL
     4 processor Cray X-MP using Cray's microtasking system,
     which provides parallel processing at the level of the Fortran
     DO loop. Acceptable parallel performance was achieved by
     integrating each of the vertical modes in parallel, referred to
     in Ashworth and Davies (1992) as vertical partitioning. In
     particular, a speed-up of 3.15 over single processor execution
     was obtained, with an execution rate of 548 Megaflops
     corresponding to 58 per cent of the peak theoretical
     performance of the machine. Execution on an 8 processor Cray
     Y-MP gave a speed-up efficiency of 7.9 and 1768 Megaflops or
     67 per cent of the peak (Davies, Proctor and O'Neill, 1991).
     The latter resulted in the project being presented with a
     prize in the 1990 Cray Gigaflop Performance Awards .

     The project has been extended by implementing the shallow
     sea model in a form which is more appropriate to a variety of
     parallel architectures, especially distributed memory
     machines, and to a larger number of processors. It is especially
     desirable to be able to compare shared memory parallel
     architectures with distributed memory architectures. Such a
     comparison is currently relevant to NERC science generally
     and will be a factor in the considerations for the purchase of
     new machines, bids for allocations on other academic
     machines, and for the design of new codes and the
     restructuring of existing codes.

     In order to simplify development of the new code and to ensure
     a proper comparison between machines, a restructured version
     of the original code was designed which will
     perform partitioning of the region in the horizontal dimension.
     This has the advantage over vertical partitioning that the
     communication between processors is limited to a few points
     at the boundaries of each sub-domain. The ratio of interior
     points to boundary points, which determines the ratio of
     computation to communication and hence the efficiency on
     message passing, distributed memory machines, may be
     increased by increasing the size of the individual sub-domains.
     This design may also improve the efficiency on shared memory
     machines by reducing the time of the critical section and
     reducing memory conflicts between processors. In addition, the
     required number of vertical modes is only about 16, which,
     though well suited to a 4 or 8 processor machine, does not
     contain sufficient parallelism for more highly parallel
     machines.

     The code has been designed with portability in mind, so that
     essentially the same code may be run on parallel computers
     with a range of architectures. 

-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the author (Mike Ashworth)
in any resulting research or publications, and are encouraged to send 
reprints of their work with this code to the authors. Also, the authors 
would appreciate being notified of any modifications to the code. 

-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

Some 8 byte floating point numbers are used in some of the initialization
code, but calculations on the main field arrays may be done using
4 byte floating point variables without grossly affecting the solution.
Nevertheless, precision conversion is facilitated by a switch supplied
to the C preprocessor. By specifying -DSINGLE, variables will be declared
as REAL, normally 4 bytes, whereas -DDOUBLE will cause declarations to be
DOUBLE PRECISION, normally 8 bytes.

-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :

The documentation supplied with the code describes how the various versions
of the code should be built. Extensive documentation, including the 
definition of all variables in COMMON is present as comments in the code.

-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :

1) Davies, A.M., Formulation of a linear three-dimensional hydrodynamic
   sea model using a Galerkin-eigenfunction method, Int. J. Num. Meth.
   in Fliuds, 1983, Vol. 3, 33-60.

2) Davies, A.M., Solution of the 3D linear hydrodynamic equations using
   an enhanced eigenfunction approach, Int. J. Num. Meth. in Fluids,
   1991, Vol. 13, 235-250.

-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :

1) Ashworth, M. and Davies, A.M., Restructuring three-dimensional
   hydrodynamic models for computers with low and high degrees of
   parallelism, in Parallel Computing '91, eds D.J.Evans, G.R.Joubert
   and H.Liddell (North Holland, 1992), 553-560.
   
2) Ashworth, M., Parallel Processing in Environmental Modelling, in
   Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
   Meteorology (Nov. 23-27, 1992)
   Hoffman, G.-R and T. Kauranne, ed., 
   World Scientific Publishing Co. Pte. Ltd, Singapore, 1993.

3) Ashworth, M. and Davies, A.M., Performance of a Three Dimensional
   Hydrodynamic Model on a Range of Parallel Computers, in
   Proceedings of the Euromicro Workshop on Parallel and Distributed
   Computing, Gran Canaria 27-29 January 1993, pp 383-390, (IEEE
   Computer Society Press)
   
4) Davies, A.M., Ashworth, M., Lawrence, J., O'Neill, M.,
   Implementation of three dimensional shallow sea models on vector
   and parallel computers, 1992a, CFD News, Vol. 3, No. 1, 18-30.
   
5) Davies, A.M., Grzonka, R.G. and Stephens, C.V., The implementation
   of hydrodynamic numerical sea models on the Cray X-MP, 1992b, in
   Advances in Parallel Computing, Vol. 2, edited D.J. Evans.
   
6) Davies, A.M., Proctor, R. and O'Neill, M., "Shallow Sea
   Hydrodynamic Models in Environmental Science", Cray Channels,
   Winter 1991.

-------------------------------------------------------------------------------
Other relevant research papers:

-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :

Fortran 77 version

     A sequential version of POLMP is available, which conforms
     to the Fortran 77 standard. This version has been run on a
     large number of machines from workstations to supercomputers 
     and any code which caused problems, even if it conformed to 
     the standard, has been changed or removed. Thus its conformance 
     to the Fortran 77 standard is well established.

     In order to allow the code to run on a wide range of problem
     sizes without recompilation, the major arrays are defined
     dynamically by setting up pointers, with names starting with
     IX, which point to locations in a single large data array: SA.
     Most pointers are allocated in subroutine MODSUB and the
     starting location passed down into subroutines in which they
     are declared as arrays. For example :

     IX1 = 1
     IX2 = IX1 + N*M
     CALL SUB ( SA(IX1), SA(IX2), N, M )

     SUBROUTINE SUB ( A1, A2, N, M )
     DIMENSION A1(N,M), A2(N,M)
     END

     Although this is probably against the spirit of the Fortran 77
     standard, it is considered the best compromise between
     portability and utility, and has caused no problems on any of
     the machines on which it has been tried. 

     The code has been run on a number of traditional vector
     supercomputers, mainframes and workstations. In addition,
     key loops are able to be parallelized automatically by some
     compilers on shared (or virtual shared) memory MIMD machines, 
     allowing parallel execution on the Convex C2 and C3, Cray X-MP, 
     Y-MP, and Y-MP/C90, and Kendall Square Research KSR-1. Cray 
     macrotasking calls may also be enabled for an alternative
     mode of parallel execution on Cray multiprocessors.

PVM and Parmacs versions
 
     POLMP has been implemented on a number of message-passing machines:
     Intel iPSC/2 and iPSC/860, Meiko CS-1 i860 and CS-2 and nCUBE 2.
     Code is also present for the PVM and Parmacs portable message
     passing systems, and POLMP has run successfully, though not
     efficiently, on a network of Silicon Graphics workstations.
     Calls to message passing routines are concentrated
     in a small number of routines for ease of portability and
     maintenance. POLMP performs housekeeping tasks on one node of the
     parallel machine, usually node zero, referred to in the code as the
     driver process, the remaining processes being workers. 
 
     PVM and Parmacs are software environments for implementing the
     message-passing parallel programming model. Portability is
     achieved in PVM using library calls and in Parmacs using macro
     definitions which then are translated into machine dependent code,
     or calls to machine dependent library routines. They have both been
     implemented on a wide range of parallel architectures, and have thus
     become de facto standards for message-passing codes. However, they
     are both likely to be superceded by the Message Passing Interface
     when it comes into widespread usage.
 
     Parmacs version 5 requires the use of a host process
     which had not been used previously in the POLMP code. Message-
     passing POLMP performs housekeeping tasks on one node of the
     parallel machine, referred to in the code as the driver process,
     the remaining processes being workers. For Parmacs, a simple
     host program has been provided which loads the node program
     onto a two dimensional torus and then takes no further part
     in the run, other than to receive a completion code from the
     driver, in case terminating the host early would interfere with
     execution of the nodes.

Subset-HPF version
 
     A data parallel version of the code has been run on the
     Thinking Machines CM-2, CM-200 and MasPar MP-1 machines.

     High Performance Fortran (HPF) defines extensions to the
     Fortran 90 language in order to provide support for parallel
     execution on a wide variety of machines using a data parallel
     programming model.
 
     The subset-HPF version of the POLMP code has been written
     to the draft standard specified by the High Performance
     Fortran Forum in the HPF Language Specification version 0.4
     dated November 6, 1992. Fortran 90 code was developed on a
     Thinking Machines CM-200 machine and checked for
     conformance with the Fortran 90 standard using the
     NAGWare Fortran 90 compiler. HPF directives were inserted
     by translating from the CM Fortran directives, but have not
     been tested due to the lack of access to an HPF compiler. The
     only HPF features used are the PROCESSORS, TEMPLATE,
     ALIGN and DISTRIBUTE directives and the system inquiry
     intrinsic function NUMBER_OF_PROCESSORS.
 
     In order to allow the code to run on a wide range of problem
     sizes without recompilation, the major arrays are defined in
     subroutine MODSUB using automatic array declarations with
     the dimensions of the arrays being passed as subroutine
     arguments.
 
-------------------------------------------------------------------------------
Total number of lines in source code: 26,000 (approx.)
Number of lines excluding comments  : 13,000
Size in bytes of source code        : 700,000

-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :

steering file:   48 lines, 2170 bytes, ascii (typical size)

-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :

standard output: 700 lines, 62,000 bytes, ascii (typical size)

-------------------------------------------------------------------------------
Brief, high-level description of what application does:

POLMP solves the linear three-dimensional hydrodynamic equations 
for the wind induced flow in a closed rectangular basin of constant depth
which may include an arbitrary number of land areas. 

-------------------------------------------------------------------------------
Main algorithms used:

The discretized form of the hydrodynamic equations are solved for field 
variables, z, surface elevation, and u and v, horizontal components of
velocity. The fields are represented in the horizontal by a staggered 
finite difference grid. The profile of vertical velocity with depth
is represented by the superposition of a number of spectral components.
The functions used in the vertical are arbitrary, although the 
computational advantages of using eigenfunctions (modes) of the eddy
viscosity profile have been demonstrated (Davies, 1983). Velocities
at the closed boundaries are set to zero.

Each timestep in the forward time integration of the model, involves
successive updates to the three fields, z, u and v. New field values 
computed in each update are used in the subsequent calculations. A
five point finite difference stencil is used, requiring only nearest 
neighbours on the grid. 

A number of different data storage and data processing methods is 
included mainly for handling cases with significant amounts of land, 
e.g. index array, packed data. In particular the program may be 
switched between masked operation, more suitable for vector processors, 
in which computation is done on all points, but land and boundary points
are masked out, and strip-mining, more suitable for scalar and RISC 
processors, in which calculations are only done for sea points.

-------------------------------------------------------------------------------
Skeleton sketch of application:

The call chart of the major subroutines is represented thus:

 AAAPOL -> APOLMP -> INIT
                  -> RUNPOL -> INIT2  -> MAP    -> GLOBAL -> SETDEP -> INDP
                                                -> IRISH  -> INTERP
                                                -> EIRISH
                                      -> GRAPH
                                      -> DIVIDE
                                      -> PRMAP
                            -> SNDWRK
                            -> RCVWRK
                            -> SETUP  -> GENSTP
                                      -> SPEC   -> ROOTS  -> TRANS
                            -> MODSUB -> MODEL  -> ASSIGN -> GENMSK
                                                          -> GENSTP
                                                          -> GENIND
                                                          -> GENPAC
                                                          -> METRIC -> STATS
                                                                    -> SEQ
                                                -> CLRFLD
                                                -> TIME*  -> SNDBND
                                                          -> PSTBND
                                                          -> RCVBND
                                                -> RESULT
                            -> SNDRES
                            -> RCVRES
                            -> MODOUT -> OZUVW  -> OUTFLD -> GETRES
                                                          -> OUTARR
                                                          -> GRYARR
                                      -> WSTATE

AAAPOL is a dummy main program calling APOLMP. APOLMP calls INIT which
reads parameters from the steering file, checks and monitors them.
RUNPOL is then called which calls another initialization routine INIT2.
Called from INIT2, MAP forms a map of the domain to be modelled, GLOBAL,
IRISH and EIRISH are hard-wired maps of three real-worls models, GRAPH
generates a connection graph for the network of grid points, DIVIDE
divides the domain between processors, and PRMAP maps sub-domains onto
processes.

SNDWRK on the driver process sends details of the sub-domain to be
worked on to each worker. RCVWRK receives that information. SETUP
does some array allocation, GENSTP generates indexes for strip-mining 
and SPEC, ROOTS and TRANS set up the coefficients for the spectral 
expansion. MODSUB does the main allocation of array space to the field 
and ancillary arrays. MODEL is the main driver subroutine for the model. 
ASSIGN calls routines to generate masks strip-mining indexes, packing 
indexes and measurement metrics. CLRFLD initializes the main data arrays. 

One of seven time-stepping routines, TIME*, is chosen dependent 
on the vectorization and packing/indexing method used to cope with the 
presence of land and boundary data. SNDBND and RCVBND handle the sending 
and reception of boundary data between sub-domains. After the required 
number of time-steps is complete, RESULT saves results from the desired 
region, and SNDRES, on the workers and RCVRES on the driver collect the 
result data. MODOUT handles the writing of model output to standard output 
and disk files, as required.

For a non-trivial run, 99% of time is spent in whichever of the 
timestepping routines, TIME*, has been chosen.

-------------------------------------------------------------------------------
Brief description of I/O behavior:

The driver process, usually processor 0, reads in the input parameters 
and broadcasts them to the rest of the processors. The driver also receives 
the results from the other processors and writes them out.

-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :

The processors are treated as a logical 2-D grid. The simulation domain
is divided into a number of sub-domains which are allocated, one sub-domain
per processor.

-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :

The number of processors, p, and the number of sub-domains are provided 
as steering parameters, as is a switch which requests either one-dimensional
or two-dimensional partitioning. 

Partitioning is only actually carried out for the message passing versions
of the code. For two-dimensional partitioning p is factored into px and py 
where px and py are as close as possible to sqrt(p). 

For the data parallel version the number of sub-domains is set to one 
and decomposition is performed by the compiler via data distribution 
directives.

-------------------------------------------------------------------------------
Brief description of load balance behavior :

Unless land areas are specified, the load is fairly well balanced. 
If px and py evenly divide the number of grid points, then the
model is perfectly balanced except that boundary sub-domains have 
fewer communications.

No tests with land areas have yet been performed with the parallel 
code, and more sophisticated domain decomposition algorithms have
not yet been included.

-------------------------------------------------------------------------------
Give parameters that determine the problem size :

nx, ny      Size of horizontal grid
m           Number of vertical modes
nts         Number of timesteps to be performed

-------------------------------------------------------------------------------
Give memory as function of problem size :

See below for specific examples.

-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :

Assuming stanrdard compiler optimizations, there is a requirement for
29 floating point operations (18 add/subtracts and 11 multiplies) per 
grid point, so the total computational load is

          29 * nx * ny * m * nts

-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :

During each timestep each sub-domain of size nsubx=nx/px by nsuby=ny/py 
requires the following communications in words :

             nsubx * m     from N
             nsubx         from S
             nsubx * m     from S
             nsuby * m     from W
             nsuby         from E
             nsuby * m     from E
             m             from NE
             m             from SW

making a total of 

             (2 * m + 1)*(nsubx * nsuby) + 2*m words 

in eight messages from six directions.

-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :

     The data sizes and computational requirements for the three
     benchmarking problems supplied are :

     Name      nx x ny x m x nts        Computational    Memory
                                        Load (Gflop)     (Mword)

     v200      512 x  512 x 16 x 200        24             14 

     wa200    1024 x 1024 x 40 x 200       226            126

     xb200    2048 x 2048 x 80 x 200      1812            984

     The memory sizes are the number of Fortran real elements
     (words) required for the strip-mined case on a single processor.
     For the masked case the memory requirement is approximately doubled 
     for the extra mask arrays. For the message passing versions, the 
     total memory requirement will also tend to increase slightly (<10%) 
     with the number of processors employed.

-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :

Count by hand looking at inner loops and making reasonable assumptions
about common compiler optimizations.

-------------------------------------------------------------------------------
Other relevant information:



-------------------------------------------------------------------------------
PARKBENCH future compact applications page
Last Modified May 14, 1996