------------------------------------------------------------------------------- Name of Program : POLMP (Proudman Oceanographic Laboratory Multiprocessing Program) ------------------------------------------------------------------------------- Submitter's Name : Mike Ashworth Submitter's Organization: NERC Computer Services Submitter's Address : Bidston Observatory Birkenhead, L43 7RA, UK Submitter's Telephone # : +44-51-653-8633 Submitter's Fax # : +44-51-653-6269 Submitter's Email : M.Ashworth@ncs.nerc.ac.uk ------------------------------------------------------------------------------- Major Application Field : Fluid Dynamics Application Subfield(s) : Ocean and Shallow Sea Modeling ------------------------------------------------------------------------------- Application "pedigree" (origin, history, authors, major mods) : The POLMP project was created to develop numerical algorithms for shallow sea 3D hydrodynamic models that run efficiently on modern parallel computers. A code was developed, using a set of portable programming conventions based upon standard Fortran 77, which follows the wind induced flow in a closed rectangular basin including a number of arbitrary land areas. The model solves a set of hydrodynamic partial differential equations, subject to a set of initial conditions, using a mixed explicit/implicit forward time integration scheme. The explicit component corresponds to a horizontal finite difference scheme and the implicit to a functional expansion in the vertical (Davies, Grzonka and Stephens, 1989). By the end of 1989 the code had been implemented on the RAL 4 processor Cray X-MP using Cray's microtasking system, which provides parallel processing at the level of the Fortran DO loop. Acceptable parallel performance was achieved by integrating each of the vertical modes in parallel, referred to in Ashworth and Davies (1992) as vertical partitioning. In particular, a speed-up of 3.15 over single processor execution was obtained, with an execution rate of 548 Megaflops corresponding to 58 per cent of the peak theoretical performance of the machine. Execution on an 8 processor Cray Y-MP gave a speed-up efficiency of 7.9 and 1768 Megaflops or 67 per cent of the peak (Davies, Proctor and O'Neill, 1991). The latter resulted in the project being presented with a prize in the 1990 Cray Gigaflop Performance Awards . The project has been extended by implementing the shallow sea model in a form which is more appropriate to a variety of parallel architectures, especially distributed memory machines, and to a larger number of processors. It is especially desirable to be able to compare shared memory parallel architectures with distributed memory architectures. Such a comparison is currently relevant to NERC science generally and will be a factor in the considerations for the purchase of new machines, bids for allocations on other academic machines, and for the design of new codes and the restructuring of existing codes. In order to simplify development of the new code and to ensure a proper comparison between machines, a restructured version of the original code was designed which will perform partitioning of the region in the horizontal dimension. This has the advantage over vertical partitioning that the communication between processors is limited to a few points at the boundaries of each sub-domain. The ratio of interior points to boundary points, which determines the ratio of computation to communication and hence the efficiency on message passing, distributed memory machines, may be increased by increasing the size of the individual sub-domains. This design may also improve the efficiency on shared memory machines by reducing the time of the critical section and reducing memory conflicts between processors. In addition, the required number of vertical modes is only about 16, which, though well suited to a 4 or 8 processor machine, does not contain sufficient parallelism for more highly parallel machines. The code has been designed with portability in mind, so that essentially the same code may be run on parallel computers with a range of architectures. ------------------------------------------------------------------------------- May this code be freely distributed (if not specify restrictions) : Yes, but users are requested to acknowledge the author (Mike Ashworth) in any resulting research or publications, and are encouraged to send reprints of their work with this code to the authors. Also, the authors would appreciate being notified of any modifications to the code. ------------------------------------------------------------------------------- Give length in bytes of integers and floating-point numbers that should be used in this application: Some 8 byte floating point numbers are used in some of the initialization code, but calculations on the main field arrays may be done using 4 byte floating point variables without grossly affecting the solution. Nevertheless, precision conversion is facilitated by a switch supplied to the C preprocessor. By specifying -DSINGLE, variables will be declared as REAL, normally 4 bytes, whereas -DDOUBLE will cause declarations to be DOUBLE PRECISION, normally 8 bytes. ------------------------------------------------------------------------------- Documentation describing the implementation of the application (at module level, or lower) : The documentation supplied with the code describes how the various versions of the code should be built. Extensive documentation, including the definition of all variables in COMMON is present as comments in the code. ------------------------------------------------------------------------------- Research papers describing sequential code and/or algorithms : 1) Davies, A.M., Formulation of a linear three-dimensional hydrodynamic sea model using a Galerkin-eigenfunction method, Int. J. Num. Meth. in Fliuds, 1983, Vol. 3, 33-60. 2) Davies, A.M., Solution of the 3D linear hydrodynamic equations using an enhanced eigenfunction approach, Int. J. Num. Meth. in Fluids, 1991, Vol. 13, 235-250. ------------------------------------------------------------------------------- Research papers describing parallel code and/or algorithms : 1) Ashworth, M. and Davies, A.M., Restructuring three-dimensional hydrodynamic models for computers with low and high degrees of parallelism, in Parallel Computing '91, eds D.J.Evans, G.R.Joubert and H.Liddell (North Holland, 1992), 553-560. 2) Ashworth, M., Parallel Processing in Environmental Modelling, in Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993. 3) Ashworth, M. and Davies, A.M., Performance of a Three Dimensional Hydrodynamic Model on a Range of Parallel Computers, in Proceedings of the Euromicro Workshop on Parallel and Distributed Computing, Gran Canaria 27-29 January 1993, pp 383-390, (IEEE Computer Society Press) 4) Davies, A.M., Ashworth, M., Lawrence, J., O'Neill, M., Implementation of three dimensional shallow sea models on vector and parallel computers, 1992a, CFD News, Vol. 3, No. 1, 18-30. 5) Davies, A.M., Grzonka, R.G. and Stephens, C.V., The implementation of hydrodynamic numerical sea models on the Cray X-MP, 1992b, in Advances in Parallel Computing, Vol. 2, edited D.J. Evans. 6) Davies, A.M., Proctor, R. and O'Neill, M., "Shallow Sea Hydrodynamic Models in Environmental Science", Cray Channels, Winter 1991. ------------------------------------------------------------------------------- Other relevant research papers: ------------------------------------------------------------------------------- Application available in the following languages (give message passing system used, if applicable, and machines application runs on) : Fortran 77 version A sequential version of POLMP is available, which conforms to the Fortran 77 standard. This version has been run on a large number of machines from workstations to supercomputers and any code which caused problems, even if it conformed to the standard, has been changed or removed. Thus its conformance to the Fortran 77 standard is well established. In order to allow the code to run on a wide range of problem sizes without recompilation, the major arrays are defined dynamically by setting up pointers, with names starting with IX, which point to locations in a single large data array: SA. Most pointers are allocated in subroutine MODSUB and the starting location passed down into subroutines in which they are declared as arrays. For example : IX1 = 1 IX2 = IX1 + N*M CALL SUB ( SA(IX1), SA(IX2), N, M ) SUBROUTINE SUB ( A1, A2, N, M ) DIMENSION A1(N,M), A2(N,M) END Although this is probably against the spirit of the Fortran 77 standard, it is considered the best compromise between portability and utility, and has caused no problems on any of the machines on which it has been tried. The code has been run on a number of traditional vector supercomputers, mainframes and workstations. In addition, key loops are able to be parallelized automatically by some compilers on shared (or virtual shared) memory MIMD machines, allowing parallel execution on the Convex C2 and C3, Cray X-MP, Y-MP, and Y-MP/C90, and Kendall Square Research KSR-1. Cray macrotasking calls may also be enabled for an alternative mode of parallel execution on Cray multiprocessors. PVM and Parmacs versions POLMP has been implemented on a number of message-passing machines: Intel iPSC/2 and iPSC/860, Meiko CS-1 i860 and CS-2 and nCUBE 2. Code is also present for the PVM and Parmacs portable message passing systems, and POLMP has run successfully, though not efficiently, on a network of Silicon Graphics workstations. Calls to message passing routines are concentrated in a small number of routines for ease of portability and maintenance. POLMP performs housekeeping tasks on one node of the parallel machine, usually node zero, referred to in the code as the driver process, the remaining processes being workers. PVM and Parmacs are software environments for implementing the message-passing parallel programming model. Portability is achieved in PVM using library calls and in Parmacs using macro definitions which then are translated into machine dependent code, or calls to machine dependent library routines. They have both been implemented on a wide range of parallel architectures, and have thus become de facto standards for message-passing codes. However, they are both likely to be superceded by the Message Passing Interface when it comes into widespread usage. Parmacs version 5 requires the use of a host process which had not been used previously in the POLMP code. Message- passing POLMP performs housekeeping tasks on one node of the parallel machine, referred to in the code as the driver process, the remaining processes being workers. For Parmacs, a simple host program has been provided which loads the node program onto a two dimensional torus and then takes no further part in the run, other than to receive a completion code from the driver, in case terminating the host early would interfere with execution of the nodes. Subset-HPF version A data parallel version of the code has been run on the Thinking Machines CM-2, CM-200 and MasPar MP-1 machines. High Performance Fortran (HPF) defines extensions to the Fortran 90 language in order to provide support for parallel execution on a wide variety of machines using a data parallel programming model. The subset-HPF version of the POLMP code has been written to the draft standard specified by the High Performance Fortran Forum in the HPF Language Specification version 0.4 dated November 6, 1992. Fortran 90 code was developed on a Thinking Machines CM-200 machine and checked for conformance with the Fortran 90 standard using the NAGWare Fortran 90 compiler. HPF directives were inserted by translating from the CM Fortran directives, but have not been tested due to the lack of access to an HPF compiler. The only HPF features used are the PROCESSORS, TEMPLATE, ALIGN and DISTRIBUTE directives and the system inquiry intrinsic function NUMBER_OF_PROCESSORS. In order to allow the code to run on a wide range of problem sizes without recompilation, the major arrays are defined in subroutine MODSUB using automatic array declarations with the dimensions of the arrays being passed as subroutine arguments. ------------------------------------------------------------------------------- Total number of lines in source code: 26,000 (approx.) Number of lines excluding comments : 13,000 Size in bytes of source code : 700,000 ------------------------------------------------------------------------------- List input files (filename, number of lines, size in bytes, and if formatted) : steering file: 48 lines, 2170 bytes, ascii (typical size) ------------------------------------------------------------------------------- List output files (filename, number of lines, size in bytes, and if formatted) : standard output: 700 lines, 62,000 bytes, ascii (typical size) ------------------------------------------------------------------------------- Brief, high-level description of what application does: POLMP solves the linear three-dimensional hydrodynamic equations for the wind induced flow in a closed rectangular basin of constant depth which may include an arbitrary number of land areas. ------------------------------------------------------------------------------- Main algorithms used: The discretized form of the hydrodynamic equations are solved for field variables, z, surface elevation, and u and v, horizontal components of velocity. The fields are represented in the horizontal by a staggered finite difference grid. The profile of vertical velocity with depth is represented by the superposition of a number of spectral components. The functions used in the vertical are arbitrary, although the computational advantages of using eigenfunctions (modes) of the eddy viscosity profile have been demonstrated (Davies, 1983). Velocities at the closed boundaries are set to zero. Each timestep in the forward time integration of the model, involves successive updates to the three fields, z, u and v. New field values computed in each update are used in the subsequent calculations. A five point finite difference stencil is used, requiring only nearest neighbours on the grid. A number of different data storage and data processing methods is included mainly for handling cases with significant amounts of land, e.g. index array, packed data. In particular the program may be switched between masked operation, more suitable for vector processors, in which computation is done on all points, but land and boundary points are masked out, and strip-mining, more suitable for scalar and RISC processors, in which calculations are only done for sea points. ------------------------------------------------------------------------------- Skeleton sketch of application: The call chart of the major subroutines is represented thus: AAAPOL -> APOLMP -> INIT -> RUNPOL -> INIT2 -> MAP -> GLOBAL -> SETDEP -> INDP -> IRISH -> INTERP -> EIRISH -> GRAPH -> DIVIDE -> PRMAP -> SNDWRK -> RCVWRK -> SETUP -> GENSTP -> SPEC -> ROOTS -> TRANS -> MODSUB -> MODEL -> ASSIGN -> GENMSK -> GENSTP -> GENIND -> GENPAC -> METRIC -> STATS -> SEQ -> CLRFLD -> TIME* -> SNDBND -> PSTBND -> RCVBND -> RESULT -> SNDRES -> RCVRES -> MODOUT -> OZUVW -> OUTFLD -> GETRES -> OUTARR -> GRYARR -> WSTATE AAAPOL is a dummy main program calling APOLMP. APOLMP calls INIT which reads parameters from the steering file, checks and monitors them. RUNPOL is then called which calls another initialization routine INIT2. Called from INIT2, MAP forms a map of the domain to be modelled, GLOBAL, IRISH and EIRISH are hard-wired maps of three real-worls models, GRAPH generates a connection graph for the network of grid points, DIVIDE divides the domain between processors, and PRMAP maps sub-domains onto processes. SNDWRK on the driver process sends details of the sub-domain to be worked on to each worker. RCVWRK receives that information. SETUP does some array allocation, GENSTP generates indexes for strip-mining and SPEC, ROOTS and TRANS set up the coefficients for the spectral expansion. MODSUB does the main allocation of array space to the field and ancillary arrays. MODEL is the main driver subroutine for the model. ASSIGN calls routines to generate masks strip-mining indexes, packing indexes and measurement metrics. CLRFLD initializes the main data arrays. One of seven time-stepping routines, TIME*, is chosen dependent on the vectorization and packing/indexing method used to cope with the presence of land and boundary data. SNDBND and RCVBND handle the sending and reception of boundary data between sub-domains. After the required number of time-steps is complete, RESULT saves results from the desired region, and SNDRES, on the workers and RCVRES on the driver collect the result data. MODOUT handles the writing of model output to standard output and disk files, as required. For a non-trivial run, 99% of time is spent in whichever of the timestepping routines, TIME*, has been chosen. ------------------------------------------------------------------------------- Brief description of I/O behavior: The driver process, usually processor 0, reads in the input parameters and broadcasts them to the rest of the processors. The driver also receives the results from the other processors and writes them out. ------------------------------------------------------------------------------- Describe the data distribution (if appropriate) : The processors are treated as a logical 2-D grid. The simulation domain is divided into a number of sub-domains which are allocated, one sub-domain per processor. ------------------------------------------------------------------------------- Give parameters of the data distribution (if appropriate) : The number of processors, p, and the number of sub-domains are provided as steering parameters, as is a switch which requests either one-dimensional or two-dimensional partitioning. Partitioning is only actually carried out for the message passing versions of the code. For two-dimensional partitioning p is factored into px and py where px and py are as close as possible to sqrt(p). For the data parallel version the number of sub-domains is set to one and decomposition is performed by the compiler via data distribution directives. ------------------------------------------------------------------------------- Brief description of load balance behavior : Unless land areas are specified, the load is fairly well balanced. If px and py evenly divide the number of grid points, then the model is perfectly balanced except that boundary sub-domains have fewer communications. No tests with land areas have yet been performed with the parallel code, and more sophisticated domain decomposition algorithms have not yet been included. ------------------------------------------------------------------------------- Give parameters that determine the problem size : nx, ny Size of horizontal grid m Number of vertical modes nts Number of timesteps to be performed ------------------------------------------------------------------------------- Give memory as function of problem size : See below for specific examples. ------------------------------------------------------------------------------- Give number of floating-point operations as function of problem size : Assuming stanrdard compiler optimizations, there is a requirement for 29 floating point operations (18 add/subtracts and 11 multiplies) per grid point, so the total computational load is 29 * nx * ny * m * nts ------------------------------------------------------------------------------- Give communication overhead as function of problem size and data distribution : During each timestep each sub-domain of size nsubx=nx/px by nsuby=ny/py requires the following communications in words : nsubx * m from N nsubx from S nsubx * m from S nsuby * m from W nsuby from E nsuby * m from E m from NE m from SW making a total of (2 * m + 1)*(nsubx * nsuby) + 2*m words in eight messages from six directions. ------------------------------------------------------------------------------- Give three problem sizes, small, medium, and large for which the benchmark should be run (give parameters for problem size, sizes of I/O files, memory required, and number of floating point operations) : The data sizes and computational requirements for the three benchmarking problems supplied are : Name nx x ny x m x nts Computational Memory Load (Gflop) (Mword) v200 512 x 512 x 16 x 200 24 14 wa200 1024 x 1024 x 40 x 200 226 126 xb200 2048 x 2048 x 80 x 200 1812 984 The memory sizes are the number of Fortran real elements (words) required for the strip-mined case on a single processor. For the masked case the memory requirement is approximately doubled for the extra mask arrays. For the message passing versions, the total memory requirement will also tend to increase slightly (<10%) with the number of processors employed. ------------------------------------------------------------------------------- How did you determine the number of floating-point operations (hardware monitor, count by hand, etc.) : Count by hand looking at inner loops and making reasonable assumptions about common compiler optimizations. ------------------------------------------------------------------------------- Other relevant information: -------------------------------------------------------------------------------
PARKBENCH future compact applications page