================================================================== === === === GENESIS Distributed Memory Benchmarks === === === === LPM1 === === === === Local Particle-Mesh Device Simulation === === === === Author: Roger Hockney === === Department of Electronics and Computer Science === === University of Southampton === === Southampton SO9 5NH, U.K. === === fax.:+44-703-593045 e-mail:rwh@uk.ac.soton.ecs === === === === Copyright: SNARC, University of Southampton === === === === Last update: June 1993; Release: 2.2 === === === ================================================================== 1. Description -------------- This benchmark is the simulation of an electronic device using a particle-mesh (PM) method, often also called a particle-in-cell (PIC) simulation. In each timestep the electric and magnetic fields on an (LMAX x MMAX) mesh are advanced explicitly in time using Maxwell's equations, and the particles (electrons) are advanced in the fields using Newton's equations. The benchmark is described as local because the time scale is such that the fields may be computed explicitly, using fields only local to each mesh point. four benchmark cases are provided (NBEN3=1,2,3,4), giving four problem sizes described by the size factor alpha=1,2,4,8 and mesh numbers (75*alpha,33). the number of particles at the end of the run of 1 picosecond is given empirically by 628*alpha**1.172 As the number of mesh-points increases for the same physical dimension, the time-step must be reduced to satisfy the CFL stability criterion. This effect has an important influence on the meaning of the performance metrics. The performance is expressed in several different metrics (and units) for comparison purposes. As well as the traditional Speedup and Efficiency, we give the Temporal (tstep/s), Simulation (sim-ps/s), and Benchmark (mflop/s(lpm1)) performance, which are much more meaningful and useful measures. Parallelisation is by one-dimensional domain decomposition, in the first coordinate. Each processor is responsible for a slab of space, and stores the mesh-ponts and coordinates of particles in its region of space. During each timestep, particle coordinates are transferred between processors as the particles move from region to region. Error Check ----------- Because the simulation uses random numbers, the multi-processor calculation cannot be expected to give identical results to the uni-processor calculation. however, the percentage difference in particle number, NP, and average B-field, BAV, in the last timestep, should not exceed a few percent. Calculations are accepted if differences < 10% Temporal Performance -------------------- Temporal performance is the inverse of the execution time, here expressed in units of timestep per second (tstep/s). This is the fundamental metric of performance, because it is in absolute units and one can guarantee that the code with the highest temporal performance executes in the least time. Speedup and Efficiency ---------------------- Speedup, Sp, has the traditional definition of the ratio of 1-proc to n-proc. execution time, and Efficiency, Ep, is Speedup per processor. Because Speedup is a relative measure, the program with the highest Speedup may not execute in the least time! Be warned. Simulation Performance ---------------------- This metric measures the amount of simulated time computed in one real wall-clock second. It is the most meaningful metric for a simulation, because it is what the user actually wishes to maximise. For this benchmark, the units are simulated picosecond per second (sim-ps/s). In this metric larger problems with more mesh points run slower (which in fact they do), although they generate more Speedup and Mflop/s! This metric also includes the fact that problems with a smaller space step often must use a smaller timestep, and therefore take more timesteps to cover the same amount of simulated time. Benchmark Performance --------------------- This metric is calculated from the nominal number of floating-point operations needed to perform the benchmark on a single processor. For the one-nanosecond benchmark setup here, the average number of floating-point operations per timestep is defined to be: F_b(alpha) = 46*75*33*alpha + 58*628*alpha**1.172 where the size factor alpha=1,2,4,8 for cases NBEN3=1,2,3,4. The first term above is the work to update the fields on the mesh, and the second term is the work to move the particles. Then the benchmark performance is R_b(alpha,p) = F_b(alpha)/Tp(alpha,p) Performance calculated in this way has the units Mflop/s(LPM1). Different parallel implementations may, in fact, perform more or fewer operations than the above, but they are only credited with the number given by the formula. Because F_b is fixed for all codes, we can quarantee that the code with the highest benchmark performance executes in the least time. Operating Instructions ---------------------- To compile and link the benchmark type: `make' for the distributed version or `make slave' for the single-node version. To run a recompiled program (e.g. on Intel iPSC), type: getcube -t4 ! to allocate cube lpm1 ! to run benchmark. In some systems the allocate command may not be necessary. Then answer one question: (1) Number of nodes for mimd run is, at maximum, equal to the number of nodes allocated by getcube (4 in above example). This is the number of nodes (processors) to be used in the calculation. Value maybe: 1, 2, ... , maximum nodes (here 4) Note: For every problem size, the 1-processor calculation must be performed once, to obtain reference time for Speedup measure. The timing results are stored in the four check result files: res1p.size1, ... , res1p.size4. We recommend your first run is with 1-processor, otherwise speedup will be printed as zero. You do not have to rerun when you change the number of processors allocated The results for the four problem sizes, cases 1,2,3 and 4, and different number of processors are put automatically in different output files, with notation (for example): lpm1c3p25 - output for lpm1 benchmark, case 3 for 25 processors If you wish to put the files elsewhere there is a prompt to tell you when to do it with a Unix cp command. Files ----- lpm1.u - host program, contains PARMACS for host. node.u - node main program and all communication interface routines, therefore all node PARMACS calls are here. benctl.f - benchmark control, may be changed to modify output, but usually left alone. No PARMACS here. lpm1bk.f - body of benchmark code. Not to be touched. res1p.size1 - correct results on one processor for standard size problem, case1, (75x33) mesh. res1p.size2 - results for case2 problem (150x33) mesh. res1p.size3 - results for case3 problem (300x33) mesh. res1p.size4 - results for case4 problem (600x33) mesh. secowa.f - LPM1 program second timer, which calls timer.f - the standard benchmark system timer header.f - standard header information setdat.f - puts date on results setdtl.f - compiler and system details lpm1c4p100 - etc, output files generated by program $Id: ReadMe,v 1.2 1994/04/20 17:33:35 igl Rel igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.