Further details, provided by Pat Worley and last updated on 5/20/95, are given below.
- ------------------------------------------------------------------------------- Name of Program : PSTSWM : (Parallel Spectral Transform Shallow Water Model) - ------------------------------------------------------------------------------- Submitter's Name : Patrick H. Worley Submitter's Organization: Oak Ridge National Laboratory Submitter's Address : Bldg. 6012/MS-6367 P. O. Box 2008 Oak Ridge, TN 37831-6367 Submitter's Telephone # : (615) 574-3128 Submitter's Fax # : (615) 574-0680 Submitter's Email : worleyph@ornl.gov - ------------------------------------------------------------------------------- Major Application Field : Fluid Dynamics Application Subfield(s) : Climate Modeling - ------------------------------------------------------------------------------- Application "pedigree" : PSTSWM Version 4.0 is a message-passing benchmark code and parallel algorithm testbed that solves the nonlinear shallow water equations using the spectral transform method. The spectral transform algorithm of the code follows closely how CCM2, the NCAR Community Climate Model, handles the dynamical part of the primitive equations, and the parallel algorithms implemented in the model include those currently used in the message-passing parallel implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge National Laboratory and Ian Foster of Argonne National Laboratory, and is based partly on previous parallel algorithm research by John Drake, David Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code development and parallel algorithms research were funded by the DOE Computer Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program. PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations on a sphere using the spectral transform method. STSWM evolved from a spectral shallow water model written by Hack (NCAR/CGD) to compare numerical schemes designed to solve the divergent barotropic equations in spherical geometry. STSWM was developed to provide the reference solutions to the test cases proposed by Williamson et. al. (see citation [4] below), which were chosen to test the ability of numerical methods to simulate important flow phenomena. These test cases are embedded in the code and are selectable at run-time via input parameters, specifying initial conditions, forcing, and analytic solutions (for error analysis). The solutions are also published in a Technical Note by Jakob et. al. [3]. In addition, this code is meant to serve as an educational tool for numerical studies of the shallow water equations. A detailed description of the spectral transform method, and a derivation of the equations used in this software, can be found in the Technical Note by Hack and Jakob [2]. In addition to modifications for parallelization, we also modified STSWM to add vertical levels (in order to get the correct communication and computation granularity for 3-D weather and climate codes), to increase modularity and support code reuse, and to allow the problem size to be selected at runtime without depending on dynamic memory allocation. PSTSWM is meant to be a compromise between paper benchmarks and the usual fixed benchmarks by allowing a significant amount of runtime-selectable algorithm tuning. Thus, the goal is to see how quickly the numerical simulation can be run on different machines without fixing the parallel implementation, but forcing all implementations to execute the same numerical code (to guarantee fairness). The code has also been written in such a way that linking in optimized library functions for common operations instead of the "portable" code will simple. - ------------------------------------------------------------------------------- May this code be freely distributed (if not specify restrictions) : Yes, but users are requested to acknowledge the authors (Worley and Foster) and the program that supported the development of the code (DOE CHAMMP program) in any resulting research or publications, and are encouraged to send reprints of their work with this code to the authors. Also, the authors would appreciate being notified of any modifications to the code. Finally, the code has been written to allow easy reuse of code in other applications, and for educational purposes. The authors encourage this, but also request that they be notified when pieces of the code are used. - ------------------------------------------------------------------------------- Give length in bytes of integers and floating-point numbers that should be used in this application: The program currently uses either (a) INTEGER, REAL*8, and COMPLEX*16, or (b) INTEGER, REAL*4, and COMPLEX*8, and REAL*8. In option (b), REAL*8 variables are used only in set-up, computing Gauss weights and nodes and Legendre polynomial values, not in the body of the computation. Option (a) is the default, but option (b) can be chosen at compile time by setting a make variable appropriately. The required precision for meteorological studies is a matter of debate in the scientific community, so we support both options. The length in bytes of INTEGER needs to be set in the platform-specific makefile. Note that parameter values may need to be modified for floating point number systems with large mantissas, e.g., PI, TWOPI. - ------------------------------------------------------------------------------- Documentation describing the implementation of the application (at module level, or lower) : The sequential code is documented in a file included in the distribution of the code from NCAR: Jakob, Ruediger, Description of Software for the Spectral Transform Shallow Water Model Version 2.0. National Center for Atmospheric Research, Boulder, CO 80307-3000, August 1992 and in Hack, J.J. and R. Jakob, Description of a global shallow water model based on the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. A draft user guide for the parallel code is available from http://www.epm.ornl.gov/chammp/pstswm/index.html. The code distribution also comes with a README file, and extensive documentation is present in the code. - ------------------------------------------------------------------------------- Research papers describing sequential code and/or algorithms : 1) Browning, G.L., J.J. Hack and P.N. Swarztrauber, A comparison of three numerical methods for solving differential equations on the sphere, Monthly Weather Review, 117:1058-1075, 1989. 2) Hack, J.J. and R. Jakob, Description of a global shallow water model based on the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. 3) Jakob, R., J.J. Hack and D.L. Williamson, Reference solutions to shallow water test set using the spectral transform method, NCAR Technical Note TN-388+STR. 4) Williamson, D.L., J.B. Drake, J.J. Hack, R. Jakob and P.S. Swarztrauber, A standard test set for numerical approximations to the shallow water equations in spherical geometry, Journal of Computational Physics, Vol. 102, pp.211-224, 1992. - ------------------------------------------------------------------------------- Research papers describing parallel code and/or algorithms : 5) Worley, P. H. and J. B. Drake, Parallelizing the Spectral Transform Method, Concurrency: Practice and Experience, Vol. 4, No. 4 (June 1992), pp. 269-291. 6) Walker, D. W., P. H. Worley, and J. B. Drake, Parallelizing the Spectral Transform Method. Part II, Concurrency: Practice and Experience, Vol. 4, No. 7 (October 1992), pp. 509-531. 7) Worley, P. H. and I. T. Foster, Parallel Spectral Transform Shallow Water Model: A runtime-tunable parallel benchmark code, in Proceedings of the Scalable High Performance Computing Conference, eds. J. J. Dongarra and D. W. Walker, IEEE Computer Society Press, 1994, pp. 207-214. 8) Foster, I. T. and P. H. Worley, Parallel algorithms for the spectral transform method, ORNL Technical report ORNL/TM-12507, April 1994. 9) Foster, I. T., B. Toonen, and P. H. Worley, Performance of parallel computers for spectral atmospheric models, ORNL Technical Report ORNL/TM-12986, April 1995. - ------------------------------------------------------------------------------- Other relevent research papers: 10) Foster I., W. Gropp, and R. Stevens, The parallel scalability of the spectral transform method, Mon. Wea. Rev., 120(5), 1992, pp. 835--850. 11) Drake, J. B., R. E. Flanery, I. T. Foster, J. J. Hack, J. G. Michalakes, R. L. Stevens, D. W. Walker, D. L. Williamson, and P. H. Worley, The Message-Passing Version of the Parallel Community Climate Model, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 500-513. 12) Sato, R. K. and R. D. Loft, Implementation of the NCAR CCM2 on the Connection Machine, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 371-393. 13) Barros, S. R. M. and T. Kauranne, , On the Parallelization of Global Spectral Eulerian Shallow-Water Models, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 36-43. 14) Kauranne, T. and S. R. M. Barros, Scalability Estimates of Parallel Spectral Atmospheric Models, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 312-328. 15) Pelz, R. B. and W. F. Stern, A Balanced Parallel Algorithm for Parallel Processing, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing (March22-24, 1993), pp. 126-128. 16) Drake, J. B., I. T. Foster, J. J. Hack, J. G. Michalakes, B. D. Semeraro, B. Toonen, D. L. Williamson, and P. H. Worley, PCCM2: A GCM adapted for scalable parallel computers, Proceedings of the Fifth Global Change Symposium. American Meteorological Society, 1994. - ------------------------------------------------------------------------------- Application available in the following languages (give message passing system used, if applicable, and machines application runs on) : PSTSWM is written in Fortran 77 with VMS extensions and a small number of C preprocessor directives. PSTSWM currently comes in three forms: a full distribution, a PVM-only distribution, and a MPI-only distribution. In the full distribution, message passing is implemented using MPI, PICL, PVM, and/or native message passing libraries, with the choice being made at compile time. Additionally, all message passing is encapsulated in three high level routines for broadcast, global minimum and global maximum, and in two classes of low level routines representing variants and/or stages of the swap operation and the send/receive operation. Porting the code to another message passing system requires either porting the PICL, MPI, or PVM libraries and/or implementing the (few) communication routines in PSTSWM using native message passing primitives. The PVM-only and MPI-only distributions of the code are simply stripped down versions, simplifying the makefile structure and directory layout of the distribution. As of 4/1/95, the full distribution has been run on the Cray Research T3D, the IBM SP-1 and SP-2, the Intel iPSC/2, iPSC/860, DELTA, and Paragon (on both GP and MP nodes and using either the NX or SUNMOS operating systems), the nCUBE/2 and nCUBE/2S, across a network of workstations, and on a Cray vector machine (as a serial application), using both native and PICL communication libraries. The PVM-only distribution has been run across a network of IBM and SUN workstations, and the MPI-only distribution has been run on an Intel Paragon (using MPICH) and on an IBM SP-2 (using both MPICH and MPI-F). - ------------------------------------------------------------------------------- Total number of lines in source code: 34,385 in PVM-only version Number of lines excluding comments : 14,118 in PVM-only version Size in bytes of source code : 1,242,410 in PVM-only version - ------------------------------------------------------------------------------- List input files (filename, number of lines, size in bytes, and if formatted) : problem: 24 lines, 561 bytes, ascii algorithm: 20 lines, 532 bytes, ascii measurements: 13 lines, 339 bytes, ascii - ------------------------------------------------------------------------------- List output files (filename, number of lines, size in bytes, and if formatted) : standard output: Number of lines and bytes is a function of the input specifications, but for benchmarking would normally be 63 lines (2000 bytes) of meaningful output. timings: Each run produces one line of output, containing approx. 200 bytes. Both files are ascii. - ------------------------------------------------------------------------------- Brief, high-level description of what application does: (P)STSWM solves the nonlinear shallow water equations on the sphere. The nonlinear shallow water equations constitute a simplified atmospheric-like fluid prediction model that exhibits many of the features of more complete models, and that has been used to investigate numerical methods and benchmark a number of machines. Each run of PSTSWM uses one of 6 embedded initial conditions and forcing functions. These cases were chosen to stress test numerical methods for this problem, and to represent important flows that develop in atmospheric modeling. STSWM also supports reading in arbitrary initial conditions, but this was removed from the parallel code to simplify the development of the parallel implementation. - ------------------------------------------------------------------------------- Main algorithms used: PSTSWM uses the spectral transform method to solve the shallow water equations. During each timestep, the state variables of the problem are transformed between the physical domain, where most of the physical forces are calculated, and the spectral domain, where the terms of the differential equation are evaluated. The physical domain is a tensor product longitude-latitude grid. The spectral domain is the set of spectral coefficients in a spherical harmonic expansion of the state variables, and is normally characterized as a triangular array (using a "triangular" truncation of spectral coefficients). Transforming from physical coordinates to spectral coordinates involves performing a real FFT for each line of constant latitude, followed by integration over latitude using Gaussian quadrature (approximating the Legendre transform) to obtain the spectral coefficients. The inverse transformation involves evaluating sums of spectral harmonics and inverse real FFTs, analogous to the forward transform. Parallel algorithms are used to compute the FFTs and to compute the vector sums used to approximate the forward and inverse Legendre transforms. Two major alternatives are available for both transforms, distributed algorithms, using a fixed data decompostion and computing results where they are assigned, and transpose algorithms, remapping the domains to allow the transforms to be calculated sequentially. This translates to four major parallel algorithms: a) distributed FFT/distributed Legendre transform (LT) b) transpose FFT/distributed LT c) distributed FFT/transpose LT d) transpose FFT/transpose LT Multiple implementations are supported for each type of algorithm, and the assignment of processors to transforms is also determined by input parameters. For example, input parameters specify a logical 2-D processor grid and define the data decomposition of the physical and spectral domains onto this grid. If 16 processors are used, these can be arranged as a 4x4 grid, an 8x2 grid, a 16x1 grid, a 2x8 grid, or a 1x16 grid. This specification determines how many processors are used to calculate each parallel FFT and how many are used to calculate each parallel LT. - ------------------------------------------------------------------------------- Skeleton sketch of application: The main program calls INPUT to read problem and algorithm parameters and set up arrays for spectral transformations, and then calls INIT to set up the test case parameters. Routines ERRANL and NRGTCS are called once before the main timestepping loop for error normalization, once after the main timestepping for calculating energetics data and errors, and periodically during the timestepping, as requested. The prognostic fields are initialized using routine ANLYTC, which provides the analytic solution. Each call to STEP advances the computed fields by a timestep DT. Timing logic surrounds the timestepping loop, so the initialization phase is not timed. Also, a fake timestep is calculated before beginning timing to eliminate the first time "paging" effect seen on some systems. STEP computes the first two time levels by two semi-implicit timesteps; all other timesteps are calculated using a centered leapfrog scheme. STEP calls COMP1, which choses between an explicit numerical algorithm, a semi-implicit algorithm, and a simplified algorithm associated with solving the advection equation, one of the embedded test cases. The numerical algorithm used is an input parameter. The basic outline of each timestep is the following: 1) Evaluate non-linear product and forcing terms. 2) Fourier transform non-linear terms in place as a block transform. 3) Compute and update divergence, geopotential, and vorticity spectral coefficients. (Much of the calculation of the time update is "bundled" with the Legendre transform.) 4) Compute velocity fields and transform divergence, geopotential, and vorticity back to gridpoint space using a) an inverse Legendre transform and associated computations and b) an inverse real block FFT. PSTSWM has "fictitious" vertical levels, and all computations are duplicated on the different levels, potentially significantly increasing the granularity of the computation. (The number of vertical levels is an input parameter.) For error analysis, a single vertical level is extracted and analyzed. - ------------------------------------------------------------------------------- Brief description of I/O behavior: Processor 0 reads the input files and broadcasts the input parameters to the rest of the processors. Processor 0 also receives the error analysis and timing results from the other processors and writes them out. - ------------------------------------------------------------------------------- Describe the data distribution (if appropriate) : The processors are treated as a logical 2-D grid. There are 3 domains to be distributed: a) physical domain: tensor product longitude-latitude grid b) Fourier domain: tensor product wavenumber-latitude grid c) spectral domain: triangular array, where each column contains the spectral coefficients associated with a given wavenumber. The larger the wavenumber is, the shorter the column is. An unordered FFT is used, and the Fourier and spectral domains use the "unordered" permutation when the data is being distributed. I) distributed FFT/distributed LT 1) The tensor-product longitude-latitude grid is mapped onto the processor grid by assigning a block of contiguous longitudes to each processor column and by assigning one or two blocks of contiguous latitudes to each processor row. The vertical dimension is not distributed. 2) After the FFT, the subsequent wavenumber-latitude grid is similarly distributed over the processor grid, with a block of the permuted wavenumbers assigned to each processor column. 3) After the LT, the wavenumbers are distributed as before and the spectral coefficients associated with any given wavenumber are either distributed evenly over the processors in the column containing that wavenumber, or are duplicated over the column. What happens is a function of the particular distributed LT algorithm used. IIa) single transpose FFT/distributed LT 1) same as in (I) 2) Before the FFT, the physical domain is first remapped to a vertical layer-latitude decomposition, with a block of contiguous vertical layers assigned to each processor column and the longitude dimension not distributed. After the transform, the vertical level-latitude grid is distributed as before, and the wavenumber dimension is not distributed. 3) After the LT, the spectral coefficients for a given vertical layer are either distributed evenly over the processors in a column, or are duplicated over that column. What happens is a function of the particular distributed LT algorithm used. IIb) double transpose FFT/distributed LT 1) same as in (I) 2) Before the FFT, the physical domain is first remapped to a (vertical layer/field)-latitude decomposition, with a subset of the vertical layers and the fields being transformed assigned to each processor column and the longitude dimension not distributed. After the FFT, the Fourier domain is remapped to a wavenumber-latitude decomposition, reordering the wavenumbers to improve load balance. 3) After the LT, the wavenumbers are distributed as before and the spectral coefficients associated with any given wavenumber are either distributed evenly over the processors in the column containing that wavenumber, or are duplicated over the column. What happens is a function of the particular distributed LT algorithm used. III) distributed FFT/transpose LT 1) same as (I) 2) same as (I) 3) Before the LT, the wavenumber-latitude grid is first remapped to a wavenumber-vertical layer decomposition, with a block of contiguous vertical layers assigned to each processor row and the latitude dimension not distributed. After the transform, the spectral coefficients associated with a given wavenumber and vertical layer are all on one processor, and the wavenumbers and vertical layers are distributed as before. IV) transpose FFT/transpose LT 1) same as (I) 2) same as (IIa) 3) Before the LT, the vertical level-latitude grid is first remapped to a vertical level-wavenumber decomposition, reordering the wavenumbers to improve load balance, with a block of (load-balanced) wavenumbers now assigned to each processor row and the latitude dimension not distributed. After the transform, the spectral coefficients associated with a given wavenumber and vertical layer are all on one processor, and the wavenumbers and vertical layers are distributed as before. - ------------------------------------------------------------------------------- Give parameters of the data distribution (if appropriate) : The distribution is a function of the problem size (longitude, latitude, vertical levels), the logical processor grid (PX, PY), and the algorithm (transpose vs. distributed for FFT and LT). - ------------------------------------------------------------------------------- Brief description of load balance behavior : The load is fairly well balanced. If PX and PY evenly divide the number of longitudes, latitudes, and vertical levels, then all load imbalances are due to the unequal distribution of spectral coefficients. As described above, the spectral coefficients are laid out as a triangular array in most runs, where each column corresponds to a different Fourier wavenumber. The wavenumbers are partitioned among the processors in most of the parallel algorithms. Since each column is a different length, a modified wrap mapping of the the columns will approximately balance the load. For the distributed FFT algorithms, the natural "unordered" ordering of the FFT is used with a block partitioning, which also does a reasonable job of load balancing without any additional data movement. The load imbalance is quantified in Walker, et al [5] and in Foster and Worley [8] for more details. If PX and PY do not evenly divide the dimensions of the physical domain, then other load imbalances may be as large as a factor of PX or PY in the worst (reasonable) case. See Foster and Worley [8] for more details. - ------------------------------------------------------------------------------- Give parameters that determine the problem size : MM, NN, KK - specifes number of Fourier wavenumber and spectral truncation used. For a triangular truncation, MM = NN = KK. NLON, NLAT, NVER - number of longitudes, latitudes, and vertical levels. There are required relationships between NLON, NLAT, and NVER, and between these and MM. These relationships are checked in the code. We will also provide a selection of input files that specify legal (and interesting) problems. DT - timestep (in seconds). (Must be small enough to satisfy Courant condition stability condition. Code warns if too large, but does not abort.) TAUE - end of model run (in hours) - ------------------------------------------------------------------------------- Give memory as function of problem size : Executable size is determined at compile time by setting the makefile parameter WORKSPACE. Per node memory requirements are approximately (in REALs) associated Legendre polynomial values: MM*MM*NLAT/(2*PX*PY) or (if vertical dimension of spectral domain is decomposed) MM*MM*NLAT/(2*PY) physical grid fields: 8*NLON*NLAT*NVER/(PX*PY) spectral grid fields: 3*MM*MM*NVER/(PX*PY) or (if spectral coefficients duplicated within a processor column) 3*MM*MM*NVER/PX work space: 8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/(PX*PY) or (if spectral coefficients duplicated within a processor column) 8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/PX where BUFS1 and BUFS2 are input parameters (number of communication buffers). BUFS1 and BUFS2 can be as small as 0 and as large as PX or PY. In standard test cases, NLON=2*NLAT and NLON=3*MM+1, so memory requirements are approximately: (3/4)*(MM**3)/(PX*PY) + (36*(1+BUFS1) + 3*(1+BUFS2))*(MM**2)*NVER/(PX*PY) or (3/4)*(MM**3)/PX + (36*(1+BUFS1) + 3*(1+BUFS2))*(MM**2)*NVER/(PX*PY) or (3/4)*(MM**3)/(PX*PY) + 36*(1+BUFS1)*(MM**2)*NVER/(PX*PY) + 3*(1+BUFS2)*(MM**2)*NVER/PX - ------------------------------------------------------------------------------- Give number of floating-point operations as function of problem size : for a serial run per timestep (approximate): nonlinear terms: 12*NLON*NLAT*NVER forward FFT: 20*NLON*NLAT*NVER*LOG2(NLON) + 36*NLON*NLAT*NVER forward LT and time update: 7*(MM**2)*NLAT*NVER + 61*MM*NLAT*NVER + 7*(MM**2)*NVER inverse LT and calculation of velocities: 7*(MM**2)*NLAT*NVER + 10*MM*NLAT*NVER + 9*(MM**2)*NVER inverse FFT: 13*NLON*NLAT*NVER*LOG2(NLON) + 18*NLON*NLAT*NVER Using standard assumptions (NLON=2*NLAT and NLON=3*MM+1): approx: 21*(MM**3)*NVER + 149*(MM**2)*LOG2(M)*NVER + 662*(MM**2)*NVER For a total run, multiply by TAUE/DT. - ------------------------------------------------------------------------------- Give communication overhead as function of problem size and data distribution : This is a function of the algorithm chosen. Ia) single transpose FFT a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY) 2*(PX-1) steps, 2*D volume or 2*LOG2(PX) steps, D*LOG2(PX) volume I) double transpose FFT a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY) 4*(PX-1) steps, 4*D volume or 4*LOG2(PX) steps, 2*D*LOG2(PX) volume II) distributed FFT a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY) 2*LOG2(PX) steps, D*LOG2(PX) volume III) transpose LT a) forward LT: let D = 16*MM*NLAT*NVER/(PX*PY) PY-1 steps, D volume or LOG2(PY) steps, (D/2)*LOG2(PY) volume b) inverse LT: let D = 10*MM*NLAT*NVER/(PX*PY) PY-1 steps, D volume or LOG2((PY) steps, (D/2)*PY volume IV) distributed LT a) forward + inverse LT: let D = 3*(MM**2)*NVER/(PX*PY) 2*(PY-1) steps, D*PY volume or 2*LOG2((PY) steps, D*PY volume These are per timestep costs. Multiply by TAUE/DT for total communication overhead. - ------------------------------------------------------------------------------- Give three problem sizes, small, medium, and large for which the benchmark should be run (give parameters for problem size, sizes of I/O files, memory required, and number of floating point operations) : Standard input files will be provided for T42: MM=KK=NN=42 T170: MM=KK=NN=170 T340: MM=NN=KK=340 NLON=64 NLON=256 NLON=512 NLAT=128 NLAT=512 NVER=1024 NVER=16 NVER=32 NVER=64 ICOND=2 ICOND=2 ICOND=2 DT=2000.0 DT=500.0 DT=250.0 TAUE=120.0 TAUE=30.0 TAUE=15.0 These are runs of the "benchmark" case specified in Williamson, et al [3] for different resolutions. The length of each run has been chosen in accordance with the ParkBench committee's suggestions for the sizes of small, medium, and large example problems. Flops and memory requirements for serial runs are as follows (approx.): T42: 30 MBytes of workspace when run with 64 bit precision 13 Gflop T170: 1000 MBytes of workspace when run with 64 bit precision 1000 Gflop T340: 1300 MBytes of workspace when run with 64 bit precision 13000 Gflop Both memory and flops scale well, so, for example, the T42 run fits in approx. 8MB of memory for a 4 processor run. But different algorithms and different aspect ratios of the processor grid use different amounts of memory. - ------------------------------------------------------------------------------- How did you determine the number of floating-point operations (hardware monitor, count by hand, etc.) : The T42 and T170 counts were determined using the hardware monitor on a Cray vector machine, in the following fashion. The code was compiled and run using a number of different compiler options. The counts here are based on the minimum number of floating point operations reported during these experiments. The T340 run is based on extrapolation using T42, T85, and T170 hardware monitor values, and verified using the complexity results described above. - -------------------------------------------------------------------------------
PARKBENCH compact applications page