.. pic file | tbl | eqn | troff -ms

.LP
.ds HT "\h'-.65m'\v'-.40m'^\v'.40m' 
.ds TD "\h'-.65m'\v'-.40m'~\v'.40m' 
.ds CT "\(mu\h'-.66m'\(ci 
.ds NI "\(mo\h'-.65m'/
.nr EP 1
.nr VS 14
.EQ
gsize 11
delim @@
define times '\(mu'
define ctimes '\(mu back 80 \(ci ^ '
define bl '\0'
.EN
.TL
SCHEDULE USERS GUIDE
.AU
.ps 11
.in 0
J. J. Dongarra and D. C. Sorensen
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
9700 South Cass Avenue\h'.20i'
Argonne, Illinois 60439-4844\h'.20i'
.nr PS 11
.nr VS 16
.nr PD 0.5v
.FS
* Work supported in part by the Applied Mathematical Sciences
subprogram of the Office of Energy Research, 
U.S. Department of Energy, under Contracts W-31-109-Eng-38,
DE-AC05-840R21400.
.FE
.NH
Introduction
.PP
Many new parallel computers are now emerging as commercial products
Exploitation of the parallel capabilities requires either extensions to an 
existing language such as Fortran or development of an entirely new 
language.  A number of activities
are under way to develop new languages 
that promise to provide the ability to exploit parallelism without the 
considerable effort that may be required in using an inherently serial 
language that has been extended for parallelism.  We applaud such activities
and expect they will offer a true solution to the software dilemma in the
future.  However, in the short term we feel there is a need to confront some
of the software issues, with particular emphasis placed on transportability
and use of existing software.
.PP
Our interests lie mainly with mathematical software typically associated
with scientific computations.  Therefore, we concentrate here on using
the Fortran language.
Each vendor of a parallel machine designed primarily for numerical calculations
has provided its own
parallel extensions to Fortran.  These extensions have taken many forms already
and are usually dictated by the underlying hardware and by the capabilities
that the vendor wishes to supply the user.   This has led to widely 
different extensions ranging from the ability to synchronize on every assignment
of a variable with a full empty property
to attempts at automatically
detecting loop-based parallelism with a preprocessing compiler aided by 
user directives
The act of getting a parallel process executing on
a physical processor ranges from a simple "create" statement
which imposes
the overhead of a subroutine call, to "tskstart" 
which imposes an 
overhead on the order of @10 sup 6 @ machine cycles, to no formal mechanism
whatsoever
These different approaches reflect characteristics 
of underlying hardware and operating systems and to a large extent are dictated 
by the vendors view of which aspects of parallelism are marketable.  
It is too early to impose
a standard on these vendors, yet it is disconcerting that
there is no agreement among any of them
on which extensions should be included.  There is not even an agreed-upon
naming convention for extensions that have identical functionality.
Program developers interested in producing
implementations of parallel algorithms that will
run on a number of different parallel machines are therefore faced with
an overwhelming task.
The process of developing portable parallel packages is complicated by 
additional factors that lie beyond 
each computer manufacturer supplying its own, very different 
mechanism for parallel processing.
A given implementation may require several different 
communicating parallel
processes, perhaps with different levels of granularity.
An efficient implementation may require the ability to 
dynamically start processes, perhaps many more than the number of physical
processors in the system.  This feature is either lacking or prohibitively
expensive on most commercially available parallel computers.  Instead, many of 
the manufacturers have limited themselves to providing one-level loop-based
parallelism.
.PP
This paper describes
an environment for the transportable implementation of parallel
algorithms in a Fortran setting.  By this we mean that a user's code is 
virtually identical for each machine.  
The main tool in this environment is a package called
SCHEDULE which has been designed to aid a programmer familiar with a 
Fortran programming
environment to implement a parallel algorithm in a manner
that will lend itself to transporting the resulting program across a wide
variety of parallel machines. 
The package is designed to allow existing Fortran subroutines to be called
through SCHEDULE, without modification, thereby permitting users 
access to a wide body of existing library software in a parallel setting.
Machine intrinsics are invoked within the SCHEDULE package, and 
considerable effort may be required on our part to move SCHEDULE from 
one machine to another.  On the other hand, the user of SCHEDULE is 
relieved of the burden
of modifying each code he disires to transport from one machine to another.
.PP
Our work has primarily been influenced by the work of R. Babb 
J.C. Browne 
and
E. Lusk and R. Overbeek .
We present here our approach, which 
aids in the programming of explicitly parallel algorithms in Fortran and which
allows one to make use of existing Fortran
libraries in the parallel setting.
The approach taken here should be regarded as minimalist:  it has a very 
limited scope.  There are two reasons for this.  First, the goal of 
portability of user code will be less difficult to achieve.  Second,
the real hope for a solution to the software problems associated with
parallel programming lies with new programming languages or perhaps
with the "right" extension to Fortran.  Our approach is expected to 
have a limited lifetime.  Its purpose is to allow us to exploit 
existing hardware immediately.  

.NH
Terminology
.PP
Within the science of parallel computation there seems to 
be no standard definition of terms.  A certain terminology will be adopted here
for the sake of dialogue.  It will not be "standard" and is 
intended only 
to apply within the scope of this document.
.R

Process - A unit of computation, an independently executable Fortran 
          subroutine together with calling sequence parameters, common 
          data, and externals.

Task    - A main program, processes, and a virtual processor.

Virtual Processor - A process designed to assume the identity of every 
          process within a given task (through an appropriate subroutine call).

Processor - A physical device capable of executing a main program or a 
         virtual processor.

Shared Data - Variables that are read and/or written by 
         more than one process (including copies of processes).  

Data Dependency - A situation wherein one process (A) reads any shared 
        data that another process (B) writes.  This data dependency
         is satisfied when B has written the shared data.

Schedulable Process - A process whose data dependencies have all
         been satisfied.

.NH
Parallel Programming Ideas
.PP
When designing a parallel algorithm one is required to describe the
data dependencies, parallel structures, and shared variables 
involved in the solution.
Typically, such algorithms are first designed at a conceptual level and 
later implemented
in Fortran and its extensions.
Each manufacturer provides a different set of extensions and targets
these extensions at different implementation levels. For example,
some manufacturers allow only test-and-set along with spawn-a-process, while
others allow concurrent execution of different loop iterations.
.PP
Our attempt here is to allow the user to define the 
data dependencies, parallel structures, and shared variables 
in his application and
then to implement these ideas in a Fortran program 
written in terms of subroutine calls to SCHEDULE.
Each set of subroutine calls 
specifies a unit of computation
or process which consists of a subroutine name
along with the calling parameters 
and the
data dependencies necessary to 
coordinate the parallel execution.
.PP
The basic philosophy here is that Fortran programs are naturally
broken into subroutines that identify 
self-contained 
units of computation 
which operate on shared data structures.  
This allows one to call on 
existing library subroutines in a parallel setting
without modification, and without having
to write an envelope around the library subroutine call in order to conform
to some unusual data-passing conventions imposed by a given parallel programming
environment.  
.PP
A parallel(izable) 
program is written in terms of calls to subroutines which, in principle,
may be performed either independently or according to data dependency 
requirements that the user is responsible for defining.  
The result is a serial program that can run in parallel given
a way to schedule the units of computation on a system of parallel 
processors while obeying the data dependencies.
.NH
Parallel Programming Using SCHEDULE
.PP
The package SCHEDULE requires a user
to specify the subroutine calls along with the execution dependencies
in order to carry out a parallel computation.  Each of these calls 
represents a process, and the user must take the responsibility of 
ensuring that the data dependencies represented by the graph are valid.
This concept is perhaps difficult to grasp without some experience with
writing parallel programs.  We shall try to explain it in this section
by example; in the following section we shall describe the
underlying concepts and the SCHEDULE mechanism.
.PP
To use SCHEDULE, one must be able to express (i.e., program) 
an algorithm in 
terms of processes and execution dependencies among the processes.  A convenient
way to view this is through a computational graph.  For example, the following
graph 
.KS
.sp 2
.PS
.ft R
.ps 8
.vs 9
scale = 1
circlerad = 0.282i
boxwid = 0.918i
boxht = 0.282i
circle  rad 0.169 at (2.485i, 6.995i) "A"
circle  rad 0.169 at (2.033i, 6.544i) "B"
circle  rad 0.169 at (2.936i, 6.544i) "C"
circle  rad 0.169 at (2.033i, 5.866i) "D"
circle  rad 0.169 at (1.355i, 5.866i) "D"
circle  rad 0.169 at (2.711i, 5.866i) "E"
arrow from (2.623i, 6.850i) to (2.598i, 6.875i)
line  from (2.824i, 6.664i) to (2.623i, 6.850i)
arrow from (2.347i, 6.850i) to (2.372i, 6.875i)
line  from (2.146i, 6.664i) to (2.347i, 6.850i)
arrow from (1.895i, 6.399i) to (1.920i, 6.424i)
line  from (1.468i, 5.986i) to (1.895i, 6.399i)
arrow from (2.033i, 6.346i) to (2.033i, 6.381i)
line  from (2.033i, 6.028i) to (2.033i, 6.346i)
arrow from (2.171i, 6.399i) to (2.146i, 6.424i)
line  from (2.598i, 5.986i) to (2.171i, 6.399i)
.ps
.vs
.ft
.PE
.sp
.I
.ce
Figure 4.1.  
.ce
Data Dependency Graph
.KE
.R
.sp
denotes five subroutines A, B, C, D, and E  (here with two 
"copies" of subroutine D operating on different data).
We intend the execution to start simultaneously on subroutines C,D,D, and E 
since they appear as leaves in the data dependency graph
(D will be initiated twice with different data).
Once D,D,and E have completed then B may execute.  When B and C have 
completed execution then A may start and the entire computation is finished 
when A has completed.
To use SCHEDULE, one is required to specify the subroutine calling
sequence of each of the six units of computation, along with a
representation of this dependency graph.
.PP
For each node in the graph, SCHEDULE requires two subroutine calls.
One contains information about the user's routine to be called, such as
the name of the routine, calling sequence parameters, and a simple tag 
to identify the process.
The second subroutine call defines the dependency in the
graph to nodes above and below the one being specified, and specifies the
tag to identify the process.
In this example, after an initial call to set up the environment
for SCHEDULE, six pairs of calls would be made to define the relationships
and data in the computational graph.
.PP
These concepts are perhaps more easily grasped through an actual
Fortran example.  A very simple example is a parallel algorithm for computing
the inner product of two vectors.   The intention here is to illustrate the
mechanics of using SCHEDULE.  This algorithm and the use of SCHEDULE on
a problem of such small granularity are not necessarily recommended.
.nf

Problem: Given real vectors @a@ and @b@, each of length @n@, compute @sigma ~=~ a sup T  b @.

Parallel Algorithm:  
                     Let @ a sup T ~=~ ( a sub 1  sup T , a sub 2 sup T ,...,a sub k sup T ) @ and @ b sup T ~=~ ( b sub 1 sup T , b sub 2 sup T ,...,b sub k sup T ) @ 
                     be a partitioning of the vectors @a@ and @b@ into smaller vectors @ a sub i @ and @ b sub i @.
.sp
                     Compute ( in parallel )
           
                         @sigma sub j ~=~ a sub j sup T b sub j@ , @ j ~=~ 1,k@.

                     When all done

                         @ sigma ~=~ sigma sub 1 ~+~ sigma sub 2 ~+~ ... ~+~ sigma sub k @.

Each of the parallel processes will execute code of the following form:

.nf
.cs 1 24
.vs 10
.ps 9

      subroutine inprod(m,a,b,sig)
      integer m
      real a(*),b(*),sig
      sig = 0.0
      do 100 j = 1,m
         sig = sig + a(j)*b(j)
  100 continue
      return
      end

.fi
.ps 12
.cs 1
.vs 14
The following routine is used
to accumulate the result:
.nf
.cs 1 24
.vs 10
.ps 9

      subroutine addup(k,sigma,temp)
      integer k
      real sigma,temp(*)
      sigma = 0.0
      do 100 j = 1,k
         sigma = sigma + temp(j)
  100 continue
      return
      end

.fi
.ps 12
.cs 1
.vs 14
.PP
The first step in constructing a code is to understand the parallel
algorithm in terms of schedulable processes and a data dependency graph.
Then the algorithm is expressed in a standard (serial) Fortran code.
This code consists of a main program which initializes the shared data and
a "parallel" subroutine \f3parprd\f1 to compute the inner product by invoking
the parallel processes \f3inprd\f1 and \f3addup\f1.  The program and associated
data dependency graph are shown below.

Serial Code:

.nf
.cs 1 24
.vs 10
.ps 9
      program main
      integer n,k
      real a(1000),b(1000),temp(50),sigma
      read (5,*) n,k
      do 100 j = 1,n 
         a(j) = j
         b(j) = 1
  100 continue
c
      call parprd(n,k,a,b,temp,sigma)
c
      write(6,*) ' sigma = ',sigma
      stop
      end


      subroutine parprd(n,k,a,b,temp,sigma)
c
c     declare shared variables
c
      integer n,k
      real a(*),b(*),temp(*),sigma
c
c     declare local variables
c
      integer m,indx,j   
c
      m = n/k
      indx = 1
      do 200 j = 1,k
c
         call inprod(m,a(indx),b(indx),temp(j))
c
         indx = indx + m
         if (j .eq. k-1) m = n - indx + 1
  200 continue
c
      call addup(k,sigma,temp)
c
      return
      end
.fi
.ps 12
.cs 1
.vs 14
.fi
.PS
.ft R
.ps 8
.vs 9
scale = 1
circlerad = 0.282i
boxwid = 0.918i
boxht = 0.282i
circle  rad 0.176 at (2.485i, 6.769i) "k+1"
circle  rad 0.155 at (1.355i, 6.092i) "1"
circle  rad 0.155 at (1.807i, 6.092i) "2"
circle  rad 0.155 at (2.259i, 6.092i) "3"
circle  rad 0.155 at (3.162i, 6.092i) "k-1"
circle  rad 0.155 at (3.614i, 6.092i) "k"
arrow from (2.306i, 6.659i) to (2.336i, 6.678i)
line  from (1.482i, 6.169i) to (2.306i, 6.659i)
arrow from (2.340i, 6.617i) to (2.365i, 6.642i)
line  from (1.913i, 6.198i) to (2.340i, 6.617i)
arrow from (2.424i, 6.574i) to (2.435i, 6.607i)
line  from (2.301i, 6.233i) to (2.424i, 6.574i)
arrow from (2.630i, 6.617i) to (2.605i, 6.642i)
line  from (3.056i, 6.198i) to (2.630i, 6.617i)
arrow from (2.663i, 6.659i) to (2.633i, 6.678i)
line  from (3.487i, 6.169i) to (2.663i, 6.659i)
.ps
.vs
.ft
.PE
.I
.ce
Figure 4.2.  
.ce
Depedency Graph for Parallel Inner Product
.R
In this data dependency graph we have identified @k@ processes 
.EQ
         inprod(~ m,~ a(indx),~ b(indx),~ temp(j)~ ) ~,~~~ j~=~1,~ 2,...,~ k ~ ~~, 
          ~ ~indx~=~1~+~ (j-1)*m
.EN
which are not data dependent.  Each of them reads a segment of the 
shared data @a ~,~b @ and 
writes on its own entry of the array @temp@, but none of them needs to
read data that some other process will write.  This fact is evident in the
graphical representation shown in Figure 4.2  where they are @leaves@.
One process,
.EQ
      addup(k,sigma,temp),
.EN
labeled @k+1@ is data dependent on each of the 
processes @ 1,2,...,k@.  This is because \f3addup\f1 needs to read each entry of
the array @temp@ in order to compute the sum and place it into @sigma@.
.PP
From this data dependency graph we may proceed to write the parallel
program.   Once we have understood the computation well enough to 
have carried out these two steps, the invocation of SCHEDULE to provide for
the parallel execution of schedulable processes is straightforward.
Calls to \f3parprd\f1, \f3inprod\f1, and \f3addup\f1 are replaced by calls to 
SCHEDULE to identify the routines to be executed as well
as the information relating to the dependency graph.  
The modified code follows.

 
.nf
Parallel Main:

.cs 1 24
.vs 10
.ps 9
      program main
      integer n,k
c
      EXTERNAL PARPRD
c
      real a(1000),b(1000),temp(50),sigma
      read (5,*) n,k,NPROCS
      do 100 j = 1,n 
         a(j) = j
         b(j) = 1
  100 continue
c
      CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma)
c
      write(6,*) ' sigma = ',sigma
      stop
      end


      subroutine parprd(n,k,a,b,temp,sigma)
c
c     declare shared variables
c
      integer n,k
      real a(*),b(*),temp(*),sigma
c
c     declare local variables
c
      integer m1,m2,indx,j,jobtag,icango,ncheks,mychkn(2) 
c
      EXTERNAL INPROD,ADDUP
      save m1,m2
c
      m1   = n/k
      indx = 1
      do 200 j = 1,k-1
c
c        express data dependencies
c
         JOBTAG = j
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jobtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m1,a(indx),b(indx),temp(j))
c
         indx = indx + m1   
  200 continue
      m2 = n - indx + 1
c
c     express data dependencies for clean up step
c
         JOBTAG = k
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jobtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m2,a(indx),b(indx),temp(k))
c
         indx = indx + m1   
c
         JOBTAG = k+1
         ICANGO = k
         NCHEKS = 0
c
      CALL  DEP(jobtag,icango,ncheks,mychkn)
      CALL  PUTQ(jobtag,ADDUP,k,sigma,temp)
c
      return
      end
.ps 12
.cs 1
.vs 14

.fi
.PP
The code that will execute in parallel has been derived from the serial
code by replacing calls to \f3parprd\f1, \f3inprd\f1, \f3addup\f1 with calls to 
SCHEDULE routines that invoke these routines.  The modifications are
signified by putting calls to SCHEDULE routines in capital letters.
Let us now describe the purpose of each of these calls.
.nf
.cs 1 24
.vs 10
.ps 9

      CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma)
.ps 12
.cs 1
.vs 14

.fi
This replaces the call to \f3parprd\f1 in the serial code.  The effect is 
to devote @nprocs@ virtual processors to the parallel subroutine \f3parprd\f1.
The parameter list following the subroutine name consist of the calling
sequence one would use to make a normal call to \f3parprd\f1.  Each of 
these parameters must be called by reference and not by value.  No 
constants or arithmetic expressions should be passed as parameters
through a call to \f3sched\f1.  This call to \f3sched\f1 will 
activate @nprocs@ copies of a virtual processor \f3work\f1.  This virtual
processor is a SCHEDULE procedure (written in C) that is internal to 
the package and not explicitly available to the user. 
.nf
.cs 1 24
.vs 10
.ps 9

         JOBTAG = j
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jogtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m,a(indx),b(indx),temp(j))

.cs 1
.vs 14
.ps 12
.fi
This code fragment shows the @j-th@ instance of the process \f3inprod\f1
being placed on a queue.  The information needed to schedule this process
is contained in the data dependency graph.  In this case, the @j-th@ instance
of a call to \f3inprod\f1 is being placed on the queue, so @jobtag@ is set to @j@.
The value zero is placed in @icango@, indicating that this process does not 
depend on any others.  If this process were dependent on @p@, other processes
then @icango@ would be set to @p@.
.PP
The mechanism just described allows static scheduling of parallel 
processes.  In this program the partitioning and data dependencies 
are known in advance even though they are parameterized.  It is possible 
to dynamically allocate processes; this capability will be explained later.  It 
might be worthwhile at this point to discuss the mechanism that
this package relies on.  
.NH
The SCHEDULE Mechanism
.PP

The call to the SCHEDULE routines \f3dep\f1 and \f3putq\f1, respectively, places process
dependencies and process descriptors on a queue.  A unique user 
supplied identifier 
@jobtag@ is associated with each node of the dependency graph.  This identifier
is a positive integer.  Internally it 
represents a pointer to a process.  The items needed to specify a 
data dependency are non-negative integers @icango@ and @ncheks@ and an integer 
array @mychkn@.  The @icango@ specifies
the number of processes that process @jobtag@ depends on.  The @ncheks@ 
specifies the number of processes that depend on process @jobtag@.  The 
@mychkn@ is an integer array whose first @ncheks@ entries contain
the identifiers (i.e., @jobtag@s) of the processes that depend on 
process @jobtag@.
.KS
.PS
.ft R
.ps 8
.vs 9
scale = 1
circlerad = 0.282i
boxwid = 0.918i
boxht = 0.282i
circle  rad 0.169 at (1.807i, 6.769i)
circle  rad 0.169 at (1.807i, 6.318i)
circle  rad 0.169 at (1.807i, 5.866i)
circle  rad 0.282 at (0.904i, 6.318i)
box invis ht 0.918 wid 0.282 at (0.452i, 5.640i)
box invis ht 0.918 wid 0.282 at (1.129i, 5.640i)
box invis wid 0.918 ht 0.282 at (0.000i, 6.995i) "icango = 2"
box invis wid 0.918 ht 0.282 at (0.000i, 6.769i) "ncheks = 3"
spline -> from (1.024i, 6.565i) to (1.129i, 6.769i)to (1.645i, 6.769i)
spline -> from (1.179i, 6.318i) to (1.355i, 6.318i)to (1.694i, 5.986i)
spline -> from (1.151i, 6.438i) to (1.355i, 6.544i)to (1.659i, 6.388i)
arrow from (0.729i, 6.055i) to (0.748i, 6.085i)
line  from (0.452i, 5.640i) to (0.729i, 6.055i)
arrow from (0.999i, 6.023i) to (0.988i, 6.056i)
line  from (1.129i, 5.640i) to (0.999i, 6.023i)
.ps
.vs
.ft
.PE
.I
.ce
Figure 5.1.  
.ce
A Node in a Dependency Graph
.R
.KE
.sp
In Figure 5.1 a typical node of a data dependency graph is shown.  This node
has two incoming arcs and three outgoing arcs.  As shown to the left of the
node one would set @icango ~=~ 2@, @ncheks ~=~ 3@, and the first three 
entries of @mychkn@ to the identifiers of the processes pointed to by 
the outgoing arcs.
.PP
The initial call to @sched@(nprocs,subname,<parms>) results in @nprocs@ 
virtual processors called \f3work\f1 to begin executing on @nprocs@ separate
physical processors.  Typically @nprocs@ should be set to a value that
is less than or equal to the number of physical processors available
on the given system.  These \f3work\f1 routines access a ready queue of @jobtag@s
for schedulable
processes.  Recall that a schedulable process is one whose data dependencies
have been satisfied.  After a \f3work\f1 routine has been successful in 
obtaining the @jobtag@ of a schedulable process, it makes the subroutine
call associated with that @jobtag@ during the call to \f3putq\f1.
When this subroutine executes a @return@, control is returned to \f3work\f1,
and a SCHEDULE routine \f3chekin\f1 is called which decrements the @icango@ 
counter of each of the @ncheks@ processes that depend on process @jobtag@.
If any of these @icango@ values has been decremented to zero, the identifier
of that process is placed on the ready queue immediately.
.PP
We depict this mechanism in Figure 5.2 .  The array labeled \f3parmq\f1
holds a process descriptor for each @jobtag@.  A process descriptor consists
of data dependency information and a subroutine name together with a
calling sequence for that subroutine.  This information is placed
on \f3parmq\f1 through the two calls
.sp
.cs 1 24
.vs 10
.ps 9
      CALL DEP(jobtag,icango,ncheks,mychkn)
      CALL PUTQ(jobtag,subname,<parms>).

.fi
.cs 1
.vs 14
.ps 12
When making these two calls the user has assured that a call to @subname@
with the argument list @<parms>@  is valid in a data dependency sense 
whenever the
counter @icango@ has been decremented to the value zero.  When a \f3work\f1 
routine has finished a call to \f3chekin\f1 , it gets the @jobtag@ of the 
next available 
schedulable process off the @readyq@  and then assumes the identity of 
the appropriate subroutine by making a call to @subname@ with the argument list
@<parms>@.  
.KS

.PS
.ft R
.ps 8
.vs 9
scale = 1
circlerad = 0.282i
boxwid = 0.918i
boxht = 0.282i
box invis ht 0.918 wid 0.282 at (1.581i, 7.673i)
circle  rad 0.282 at (2.033i, 6.544i) "SCHEDULE"
box  wid 0.275 ht 0.762 at (1.355i, 5.414i) "W" "O" "R" "K"
box  wid 0.275 ht 0.762 at (1.807i, 5.414i) "W" "O" "R" "K"
box  wid 0.275 ht 0.762 at (2.711i, 5.414i) "W" "O" "R" "K"
box invis ht 0.918 wid 0.282 at (2.711i, 6.995i) "Head"
box invis ht 0.918 wid 0.282 at (1.355i, 6.995i) "Tail"
box invis ht 0.918 wid 0.282 at (1.581i, 7.673i)
box  wid 0.918 ht 0.282 at (2.033i, 6.995i) "READYQ"
box  wid 1.285 ht 0.593 at (2.033i, 7.673i) "PARMQ"
box invis ht 0.918 wid 0.282 at (3.614i, 6.092i) "10 myjob = getprb(jobtag)"
box invis ht 0.918 wid 0.282 at (3.614i, 6.092i)
box invis ht 0.918 wid 0.282 at (3.656i, 5.633i) "call checkin(jobtag)"
box invis ht 0.918 wid 0.282 at (3.346i, 5.456i) "go to 10"
box invis ht 0.918 wid 0.282 at (3.656i, 5.866i) "call subname(<parms>)"
box invis ht 0.918 wid 0.282 at (2.936i, 6.318i)
box invis ht 0.918 wid 0.282 at (3.614i, 6.318i)
box invis ht 0.918 wid 0.282 at (2.936i, 5.188i)
box invis ht 0.918 wid 0.282 at (3.614i, 5.188i)
box invis ht 0.918 wid 0.282 at (2.259i, 5.414i)
box invis ht 0.918 wid 0.282 at (2.033i, 5.414i)
box invis ht 0.918 wid 0.282 at (2.259i, 5.414i)
box invis ht 0.918 wid 0.282 at (2.259i, 5.414i)
box invis ht 0.918 wid 0.282 at (2.485i, 5.414i)
spline -> from (2.252i, 6.374i) to (2.485i, 6.318i)to (2.661i, 5.795i)
spline -> from (1.864i, 6.325i) to (1.807i, 6.092i)to (1.807i, 5.795i)
spline -> from (1.793i, 6.402i) to (1.355i, 6.318i)to (1.355i, 5.795i)
spline -> from (1.595i, 6.854i) to (1.355i, 6.769i)to (1.772i, 6.628i)
line dashed from (2.809i, 5.795i) to (2.936i, 6.318i)
line dashed from (2.936i, 6.318i) to (3.614i, 6.318i)
line dashed from (2.845i, 5.273i) to (2.936i, 5.188i)
line dashed from (2.936i, 5.188i) to (3.614i, 5.188i)
spline -> from (2.301i, 6.600i) to (2.929i, 6.727i)to (2.492i, 6.854i)
.ps
.vs
.ft
.PE
.I
.ce
Figure 5.2.  
.ce
The SCHEDULE Mechanism
.R
.KE
.sp
.NH
How to Use Schedule
.PP
As explained in the previous sections the basic idea in writing a 
Schedule program is
to partition a computation into units of computation and data dependencies
between these units.  In this section we assume familiarity with these 
concepts.  We also assume familiarity with the notion of a data dependence
graph and the association of nodes in this graph with units of computation.
Data dependencies are really assertions made by
the user and the responsibility for the correctness of these assertions
rests with the user.  Schedule provides a mechanism to record the units of
computation, express the 
dependencies between them, and then execute the computation described by
the Schedule program.  It is very important
to clearly define and understand the shared data of a problem and how it
is to be partitioned for parallel processing.  
.sp  
\fBPartitioning a Problem\fR
.PP
The first thing that must be understood is the distinction between
local and global variables.  Global variables are shared amongst
all the processes which reference them.  Local variables are defined 
within a subroutine and a local version of these variables is obtained
each time a unit of computation based upon this subroutine is executed.
.PP
All shared data should be grouped in named COMMON. This is not
a strict requirement but it is a good programming practice
and can increase performance on some machines.  Before discussing
how data is partitioned it is usefule to have an idea of what
a Schedule program will look like.  The structure of a
Schedule program generally takes the following form:
.sp2
Main Program
.sp
.nf
.cs 1 24
      common /prbdef/<array declarations >
      external \fBparalg\fR
.sp
      \fIinitialize variables\fR
            .
            .
            .
         call \fBsched\fR(nprocs,\fBparalg\fR,<parms>)
            .
            .
            .
      \fIoutput result\fR
.sp
      stop
      end
.cs 1
.fi
.sp2
Static Data Dependency Graph
.sp1
.nf
.cs 1 24
      subroutine paralg(<parms>)
      \fIdeclare global variables\fR
.sp1
      \fIdeclare local variables\fR
      external \fBsubname\fR
.sp1
      do 100 j = 1,nunits
               .
               .
               .
c
         jobtag = @the~identifier~of~this~unit~of~computation@
         icango = @number~of~nodes~jobtag~depends~on @
         ncheks = @number~of~nodes~which~depend~on~jobtag@
         list   = @list~of~identifiers~of~these~ncheks~dependents@
c
         call \fBdep\fR(jobtag,icango,ncheks,list)
         call \fBputq\fR(jobtag,\fBsubname\fR,<parms>)
               .
               .
               .
  100 continue
      return
      end
.cs 1
.fi
.sp2
Unit of Computation with Dynamic Spawning
.sp1
.nf
.cs 1 24
      subroutine subname(<parms>)
      \fIdeclare global variables\fR
.sp1
      \fIdeclare local variables\fR
      external \fBsublow\fR
.sp1
      do 200 j = 1,nkids
         .
         .
         .
         call \fBnxtag\fR(mytag,jdummy)
         call \fBspawn\fR(mytag,jdummy,\fBsublow\fR,<parms>)
         .
         .
         .
  200 continue
      return
      end
.cs 1
.fi
.sp2
In this generic example @nunits@ units of computation were defined
in the the 100 loop of the subroutine @paralg@.  Here, 
all of the units of computation involve the subroutine @subname@
operating on different data.  There may be several different
subroutines associated with various units of computation.  
The initial call to @sched@
acquires @nprocs@ physical processors devoted to the execution of 
the computation defined by subroutine @paralg@.  There can be 
several calls to @sched@ in the main program.  Each call involves the
overhead of acquiring @nprocs@ physical processors on the given
system. Care must be exercised on some systems because the expense of
acquiring a number of physical processors may be substantial. 
.PP
This example shows the three levels of parallelism typically
available in a Schedule program.  
The subroutine @paralg@
defines the static data dependency graph which is to be obeyed throughout
the course of the computation.  The subroutine @subname@ exploits
the lowest level of parallelism through dynamic spawning of processes
involving @sublow@.
This subroutine was constructed to dynamically spawn processes 
at run time using the calls to @nxtag@ and @spawn@ in the 200 loop.
.PP
More will be said about the details of this mechanism below.  At this
point it should be stressed that the subroutine @paralg@ only records
the data dependencies and the unit of computation associated
with each node in the data dependency graph.  None of these will execute
until @paralg@ completes.  When @paralg@ does complete (ie. executes a
return statement) then the computation begins by executing those nodes 
which have @icango~=~0@.  During the execution, these units
of computation will execute as their data dependencies are satisfied.  
Additional units of computation may be placed on the pool of work to be done 
during execution through the @spawn@ mechanism.  
.sp2
\fBStatic Partition\fR
.PP 
In the generic example above, the subroutine @paralg@ defines the static
data dependency graph.  It is important to understand that none
of the units of computation defined through @dep@ and @putq@ will
execute until a return has been executed by @paralg@.  Thus
the parameters passed through the calling sequence to @subname@
must be global variables.  That is, every variable passed through
the calling sequence to @subname@ with a call to @putq@ must
either reference a variable passed through the
calling sequence of @paralg@ or must be in shared common.
.PP
To define the static data dependency graph one must write a program
that includes a call to @dep@ followed by a call to @putq@ corresponding
to every node on the graph.  
Each node of the graph should be assigned a @jobtag@.  These jobtags
must be successive positive integers  beginning at 1 and not exceeding 1000.
In the program defining the static data dependency graph there must be the 
statements
.sp
         call \fBdep\fR(jobtag,icango,ncheks,list)
         call \fBputq\fR(jobtag,\fBsubname\fR,<parms>)
.sp
corresponding to each node.  The call to @dep@ records the data dependencies
for the node that has been labeled @jobtag@.  The dependencies are
defined through specification of 
.sp
.nf
.cs 1 24
         icango = @an~integer~specifying~the~number~of~nodes~jobtag~depends~on @
.sp
         ncheks = @an~integer~specifying~the~number~of~nodes~which~depend~on 
                   ~jobtag@
.sp
         list   = @an~integer~array~containing~list~of~identifiers~
                   (ie.~jobtags)@
                  @~of~the~ncheks~processes~dependent~upon~completion~of~
                   jobtag@.
.sp
.cs 1
.fi
.PP
It is essential that at least one node
has @ icango~=~0 @ set.  Without this condition the program cannot start.
There can be several nodes with @icango~=~0@.  Exactly one node must have
the condition @ncheks = 0@ set.  The program cannot finish without this.
Error messages will be output and the Schedule program will execute a Fortran 
stop if either of these conditions are not met.
A unit of computation is "scheduled" for execution when its @icango@
has been set to 0 indicating that it is not waiting upon the completion
of any other unit of computation.  The @icango@ counter  of each of the
@ncheks@ dependents will be decremented automatically when @jobtag@ completes.
The unit of computation labeled @jobtag@ is recorded through the 
call to @putq@.  Once the data dependencies for @jobtag@ have been
satisfied a 
.sp
        call \fBsubname\fR ( <parms> ) 
.sp
will be executed where <parms> represents the list of parameters
to be passed to the subroutine @subname@.  The call to @putq@ records
the entry point to @subname@ and the addresses of each parameter 
in the calling sequence <parms>.  Execution of this unit of 
computation only occurs after the data dependencies have
been satisfied and a physical processor has become available.
The order of execution is dictated by the order in which schedulable
processes get placed on a "ready to execute" queue.  When the number
of physical processors is 1 then this order is predetermined, but
when more than one processor is active the order is somewhat 
nondeterministic.  This queue is served by 
@nprocs@ workers on a self scheduled.
.sp2
\fBSimple Spawn\fR
.PP
If one wishes to do dynamic spawning within one of the executing processes 
that has been defined statically, there are three options.  The simplest
situation is when the spawning process does not do any additional work
which depends upon the results of the spawned processes.
In this case the spawning process has the form
.sp2
.nf
.cs 1 24
      subroutine subname(<parms>)
      \fIdeclare global variables\fR
.sp1
      \fIdeclare local variables\fR
.sp1
      external \fBsublow\fR
      do 200 j = 1,nkids
         .
         .
         .
         call \fBnxtag\fR(mytag,jdummy)
         call \fBspawn\fR(mytag,jdummy,\fBsublow\fR,<parms>)
         .
         .
         .
  200 continue
      return
      end
.cs 1
.fi
.sp2
Let us call the executing process defined by @subname@ together with
its arguments the parent.  In the computation above, the parent process will
spawn @nkids@ child processes.  The parameter @mytag@ which appears in the
calling sequence of both @nxtag@ and @spawn@ must contain the @jobtag@
of the parent process.
The parameter @mytag@ must be stored in a global variable which has been
passed to the parent @subname@ either through the calling sequence or 
through named
common.  The calls to @nxtag@ and @spawn@ must be done together.  
The statement
.sp
         call \fBnxtag\fR(mytag,jdummy)
.sp
returns a @jobtag@ in the parameter @jdummy@ which has been assigned
by Schedule internally.  The data dependencies are automatically 
defined: the each spawned process will be forced to complete 
before the parent can finish.  The statement 
.sp
         call \fBspawn\fR(mytag,jdummy,\fBsublow\fR,<parms>)
.sp
is similar in nature to a call to @putq@ in the sense that
will invoke a call to subroutine @sublow@(<parms>).  
.PP
If this mechanism for dynamic spawning is used then the parent
@subname@ must not do any computations dependent upon completion of 
any of the @nkids@ invocations of the subroutine @sublow@ associated with
the spawning within the 200 loop.  Any processes dependent upon completion of
@subname@ will be assured that each of these @nkids@ spawned processes have
completed when @subname@ has completed.
.sp1
\fBSpawning with Multiple Entry Points\fR .
.PP
On many occasions one wishes return execution control to parent   
process after the child processes have completed and resume computations which 
are dependent upon the completion of these child processes.
This may be accomplished in the following way.
.sp2
.nf
.cs 1 24
      subroutine subname(<parms>)
      \fIdeclare global variables\fR
.sp
      \fIdeclare local variables\fR
.sp
      external \fBsublow\fR
c
c     jump to entry point L000 if integer function ientry
c     returns the value L 
c
      go to (1000,2000,...,N000),ientry(mytag,N)
 1000 continue
         .
         .
         .
      nkids = \fIthe number of child process to be spawned\fR
              @ this~number~may~be~determined~at~run~time@
      do 200 j = 1,nkids
         .
         .
         .
         call \fBnxtag\fR(mytag,jdummy)
         call \fBspawn\fR(mytag,jdummy,\fBsublow\fR,<parms>)
         .
         .
         .
  200 continue
.sp
      return
 2000 continue
         .
         .
         .
      return
 N000 continue
         .
         .
         .
      return
      end
.cs 1
.fi
.sp2
In this subroutine the labels 1000,2000,...,N000 should be considered
as entry points.   The 
entry points are taken in succession beginning with label 1000 and continuing
through label N000.  A @return@ statement must precede each of the 
labels in this list.  The subroutine @ientry(myid,N) @ returns the number
of times process @myid@ has been entered so that the label L000 will
be the entry point on the L -th time this process is entered.
The integer N must correspond to the total number of entry points on
this list.  It is important to understand that this process will be exited
at the @return@ just before each entry point.
Each time the subroutine is reentered execution begins at the first executable 
statement just as a normal subroutine, but when the 
computed \fIgo to\fR statement is reached
the function @ientry@ will force the computation to execute at the label
following the last return.
Upon completion of the @nkids@ spawned processes in the 200 loop computation
will resume at the entry point 2000 just as in the simple case.  
The user is responsible for restoring
the state of any local variables defined in subname at this point.  
In this method the subroutine @subname@ is 
exited after each round of spawning.  Another mechanism may 
be used to avoid this forced exit and also to specify a specific return 
label.  This mechanism uses the logical function @wait@ .
.sp2
.nf
.cs 1 24
      subroutine subname(<parms>)
      \fIdeclare global variables\fR
.sp
      \fIdeclare local variables\fR
.sp
      logical \fBwait\fR
      external \fBsublow\fR
      go to (1000,2000,...,N000),ientry(mytag,N)
 1000 continue
         .
         .
         .
      do 200 j = 1,nkids
         .
         .
         .
         call \fBnxtag\fR(mytag,jdummy)
         call \fBspawn\fR(mytag,jdummy,\fBsublow\fR,<parms>)
         .
         .
         .
  200 continue
.sp
      if (wait(mytag,2)) return
 2000 continue
         .
         .
         .
      return
 N000 continue
         .
         .
         .
      return
      end
.sp2
.cs 1
.fi
Of course @wait@ must be declared logical in @subname@ .   The arguments
to @wait@ are @mytag@ which has been set to the @jobtag@ of the particular
instance of @subname@ followed by an integer which has been set to
the number that ientry will return when @subname@ is reentered.  This 
mechanism sets a logical barrier which is not a busy wait.  The functionality
provides for exiting @subname@ only if a copy of @sublow@ has not yet 
completed.  
.sp2
\fBRecursive Spawn\fR      
To be supplied later.
.sp5
.NH
Low-Level Synchronization
.PP
Ideally,  the mechanism we have just described will relieve the user of 
explicitly invoking any synchronization primitives.  Unfortunately, 
some powerful parallel constructs are not so easily described by
this mechanism.  It may be desirable to have two processes executing 
simultaneously that are not truly data independent of each other.
A typical example is in pipelining a computation, that is, when several
parallel processes are writing on the same data in a specified order
which is coordinated through explicit synchronization.  To provide this
capability,
two low-level synchronization primitives have been made available within
SCHEDULE.  They are \f3lockon\f1
and \f3lockoff\f1.  Each takes an integer argument.  An example of 
usage is
.sp2
.nf
.ps 9
.cs 1 24
.vs 10
      call lockon(ilock)
           ilocal = indx
           indx = indx + 5
      call unlock(ilock)
.fi
.ps 12
.cs 1
.vs 14
.sp2
In this example a critical section has been placed around the act of getting
a local copy of the shared variable @indx@ and updating the value of @indx@.
If several concurrent processes are executing this code, then only one
of them will be able to occupy this critical section at any given time.
The variable @ilock@ must be a globally shared variable and it must be 
initialized before any usage as an argument to \f3lockon\f1 or \f3unlock\f1
by calling the routine \f3lckasn\f1.
In the above example the statement 
.sp
.nf
.ps 9
.cs 1 24
.vs 10
       call lckasn(ilock)
.fi
.ps 12
.cs 1
.vs 14
.sp
must execute exactly once and before any of the calls to \f3lockon\f1 are made.
After execution of this statement the lock variable @ilock@ has been declared
as a lock variable and has been initiated to be @off@.  
If there are low-level
data dependencies among any of the processes that will be scheduled, then
it will be necessary to enforce those data dependencies using locks. 
It is preferable to avoid using locks if possible.  However, in certain cases
such as pipelining, locks will be required.
mmmmm
.NH
Standard Tricks
.PP
In this section we discuss some standard graphs, performance
issues and frequent errors.
   A. Typical Graphs

       d and c
       fork join
       tsolve

\fBCommon Errors and How to Avoid Them\fR
.PP
Certainly the most common error is failing to declare a subroutine external 
before referencing it in a call to putq or spawn.  This will cause unusual
behavior which usually results in a bus error.  It is the first thing
one should check when such unexplained behaviour occurs.
.PP
Perhaps the most serious error is to pass an argument to static or spawned 
subroutine which is not global.  These errors are difficult to trace
and find.   The first clue of this type of error is to have a correctly running
program when only one physical processor is active which fails
when more than one processor is active.
.PP
Another error which gives similar behavior is to povide incorrect analysis of 
data dependencies.  This is when the user describes the data dependency graph 
incorrectly in the sense that two processes which read and write the same
data item are allowed to execute simultaneously.  Again the symptom
is a correctly running program on one processor that fails on
more than one processor.
.PP
A related error which is more easily found is the improper specification of 
the graph through gross programmingerror.  This will usually result
in a deadlock situation or in incorrect results even with only
one processor active.  An easy way to check for such error is to use the 
graphics facility described later.
.PP
Dynamic spawning is usually straightforward unless there are reentry
points.  When there are multiple entry points a typical error is
failing to restore state after spawning processes. For example
.sp2
         .
         .
         .
         call spawn(...)
         k = 5
      return
 2000 continue
.sp
      m = k*(k+1)
         .
         .
         .
.sp2
will give an incorrect result for @m@ if @k@ is a local variable in 
the subroutine that is spawning.  To make this correct the user
must re-initialize @k@ after entry point 2000.
.sp
\fBTransporting Users Code\fR
.sp
To be supplied later
.sp5
.NH
GRAPHICS TRACE FACILITY
.PP
Performance issues are vital to the effective use of computers.
Yet the current status of software environments for implementing parallel
algorithms 
is in a primitive state.
Consequently,
we felt compelled to develop a system (described in the
previous section) to allow transportable implementation of parallel algorithms.
We feel equally compelled to develop tools to aid in understanding
how our parallel implementation performs when run on a parallel processor.
.PP
Our goals here are simple: to be able to trace the flow of
execution within our application and to understand where bottlenecks
occur.
Monitoring the run time by using a timing routine has been
employed in the past. This approach, however, has a number of
limitations in the parallel setting.
We want an 
animation of run-time behavior,
visualizing the parallel parts in execution
as the application is running.
We would like to 
understand what performance issues arise during
execution and what bottlenecks develop, and to view where programming errors
cause a parallel programming to "hang."
.PP
To achieve these goals, we have developed graphics tools
that allow us to 
trace the flow of execution.
In addition, we have devised tools to
aid in 
constructiing programs using our SCHEDULE package.
.PP
The graphics interface that provides trace information has been
designed to use a Sun-3 workstation.
Our plan in the near term is to convert the trace program to
use X-windows, a portable windowing system.
.PP
We hope these tools can be used to
influence the design of an algorithm by
providing insight into how
the program behaves when run in parallel,
thus allowing use to tune the algorithm.
Additionally, we hope that these tools will
further our understanding of
the nature of the parallel algorithm and of parallel 
processing in general.
.sp
.NH
Tracing an Execution
.PP
To trace the flow of execution,
an algorithm is implemented 
using the SCHEDULE package,
linked with the trace version of the Schedule library, and run
on a parallel machine.
.PP
The execution produces an output file of trace
events in a file named \fItrace.graph\fR.
The information in the file is a trace of the user specification
of the computational graph and the execution of the graph
by the computer.
Each node in the graph represents a subroutine, and the
arcs represent the dependencies between the codes.
The trace file
records the definition of the graph along with the 
execution and completion of the subroutine.
When execution has completed,
this file is then moved to a Sun workstation where the
trace facility has been installed. (It is assumed that 
suntools are used.)
.PP
The information in the trace files has the following meaning from the
Scheduler.

.KS
.ce
.B
Trace File Items
.R

.nf
.TS
center;
l l l l l l.
0	id	# child	# parents	parent ids
1	id
2	id	timing info
3	owner id	id
4	owner id	id
5	owner id	id	timing info
.TE
.KE
.fi

Rather than write each event out as it occurs, the Scheduler buffers
up a set of trace events.
When a limit is reached, the next set of trace events is written
to the trace file.
.PP
Trace events begin with a digit that defines
the operation the Scheduler has performed.

.KS
.ce
.B
Definition of Events
.R
.TS
center;
l l.
0	create a static node
1	start a static node executing
2	stop a static node executing
3	create a dynamic node
4	start a dynamic node executing
5	stop a dynamic node executing
.TE
.KE
.NH
Replaying a Trace
.PP
After the program has completed execution, the trace file can be
viewed on the Sun workstation.
The trace file is moved to the Sun file system 
and given a name of the form
.I trace.xxxxx.
(If NFS is
used between the Sun and the parallel machine, the file can be
read directly.)
To start the trace program, one types \fIsched.trace\fR.
Figure 1 shows what will appear on the 
canvas on the Sun screen.
.KS
.sp 25
.ce 1
Figure 1: Information on the Sun Screen
.KE

.PP
A number of buttons and slide bars help in
controlling the information to be viewed.
After the canvas is on the screen, a trace file must be loaded.
The file that is currently referenced is displayed on the
second line next to the words
.B 
Trace file:.
.R
The file name can be changed by moving the mouse cursor to point to
.B 
Trace file:
.R
and clicking the left mouse button
through all the
trace files available for viewing.
Alternatively, one can move the mouse cursor to
.B 
Trace file:
.R
and hold down the right button;
this will result in the display of a popup menu showing all the
trace files to choose from.
To select a file, one moves the mouse cursor into the popup box while
still holding down the left button, and releases
the button
on the desired
file. 
.PP
After the desired file is selected, it must be loaded
into the system.
This is
accomplished by clicking the 
.B Load
button.
The count of
number of events loaded is displayed at the end of the slider bar marked
.B Events.
.PP
To start the execution, one clicks on the 
.B Go
button.
When the 
.B Go
button is clicked initially, the events are processed internally
until the first
executable event (one of the events that has as its leading digit 1-5)
is reached; then the data dependency graph is displayed
on the screen.
The nodes in the dependency graph will change their state as the trace proceeds.
There are four types of nodes in the graph: clear, lined, hashed,
and black.
The \fIclear\fR nodes represent subroutines waiting to have their 
data dependencies satisified before starting execution.
The \fIlined\fR nodes represent subroutines whose data dependencies
are satisfied, but whose execution has not started.
The \fIhashed\fR nodes represent subroutines that are currently executing.
The \fIblack\fR nodes represent subroutines that have finished execution.
The arcs represent the data dependenciess between subroutines.
The numbers associated with each node in the graph represent
the 
.I
jobtag 
.R
for that node as defined through the user calls
to 
.I dep
and
.I putq.
.PP
The viewing of events can be stopped or restarted 
by clicking on the 
.B Stop
or the 
.B Go
buttons at any time.
Execution begins at the bottom of the graph and works its way to the
top, with the top node the last thing to execute.
.PP
The 
.B Quit
button is used to terminate the trace facility.
.PP
The 
.B Full
button is used in conjunction with the 
.B Redraw
button to expand the canvas by a factor of four.
First the 
.B Full
button is clicked;
then the 
.B Redraw 
button clicked to redisplay the expanded canvas.
The 
.B
Full
.R
button is really a toggle.
Thus, to get the "normal" canvas size
displayed, one clicks the
.B Full
button, then the 
.B Redraw
button.
.PP
The next two buttons,
.B Path 
and
.B Cpath,
are used together after the execution has completed.
They are intended to display the critical path of execution in the
graph
(i.e., the path that is taking the most time to execute).
The critical path determines the running time of the entire application.
To use this feature, one clicks on the 
.B Cpath
button and then the 
.B Path
button.
.B Cpath 
is used to compute the critical paths in the graph and
.B Path
to display the path. *
.FS
* In the current release of the graphics program,
the critical path is chosen by the maximum number of nodes from the start to finish
in the graph. This should really be based on time, not number of nodes.
.FE
.PP
The 
.B Subtree
button is used to display a subtree of the graph.
To use this feature, one places the 
mouse cursor on a node.
The node number clicked on will be displayed in bottom
right line of the top panel.
To see the subtree,
one clicks the 
.B Subtree 
button.
This step can be repeated to zoom in on a particular part of the graph.
To redisplay the full graph, one clicks the
.B Redraw 
button.
.PP
The 
.B Step 
button is used to single-step through the execution of the trace
file.
A way to use this is by loading a trace file, starting execution,
and stopping execution at the desired point.
Then the 
.B Step
button is used to single-step through the execution.
The activity can be resumed at normal speeds by clicking the
.B Go
button.
.PP
The last button is 
.B Histogram.
When clicked, this will display a histogram of the events run versus
the number of processors active.

.NH
Building a Schedule Program

.NH
Futures
.PP
Use sockets to pass information between processor and
graphics.

Use X-windows to provide a portable environment.
.NH
Conclusions
.PP
Proving useful in understanding run-time behavior of our parallel
programs.
Allows us to see into the machine.
Simplifies the task, no longer required examine output.
.sp5
.ps 10
.NH
 Schedule Subroutines
.PP
This section contains a list of Schedule subroutines and a brief 
description of their functionality.
.sp4

\fBPUTQ\fR

.nf
Used to place descriptor of a statically defined unit of computation 
on an internal problem list
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
putq(jobtag,subname,<parameter list>)
.sp
\fBComments\fR
.sp
This subroutine is used to place the descriptor of the unit of 
computation indexed by @jobtag@ on the problem list.
Every Schedule program must have at least one call to @putq@.  Moreover,
a call to @putq@ must be preceded by a call to @dep@:  
.sp
            call dep(jobtag,icango,nchks,list)
            call putq(jobtag,subname,<parameter list>)
.sp
The parameter
list to the right of @subname@ should be a valid set of parameters to
make the call:  
.sp
      call subname(<parameter list>) .
.sp
This call will be made when the data dependencies associated with @jobtag@
have been satisfied.  The program calling putq must have @subname@ declared
external.
.sp
\fBCalling Sequence\fR
.sp
.nf
.cs 1 24
      jobtag - an integer which labels the unit of computation 
               this integer may be a local variable in the subroutine
         calling @putq@
.sp
      subname - the name of the Fortran subroutine to be called when
               data dependencies for the process indexed by @jobtag@
               have been satisfied.
.sp
      <parameter list> - a list of parameters for a call to @subname@.
                they must be global variables and there should not be
                more than 20 parameters. 
.cs 1
.fi
.sp
.sp
\fBRestrictions\fR
.sp
The number of parameters in <parameter list> must not exceed 20.
.sp
Each parameter passed in <parameter list > must be a location in 
shared memory.  
.sp
Constants and expressions should not 
be passed. 
.sp
Variables declared locally within the subroutine
making the call to @putq@ should not be passed in <parameter list>.
.sp
.bp
\fBDEP\fR 

.nf
defines the data dependencies for a unit of computation
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
dep(jobtag,icango,ncheks,list)       
.sp
\fBComments\fR
This subroutine puts data dependencies for process @jobtag@ on the problem
list.  A call to @dep@ is always associated with a call to @putq@ :
.sp
            call dep(jobtag,icango,nchks,list)
            call putq(jobtag,yyy,<parameter list>)
.sp
.sp
\fBCalling Sequence\fR
.nf
.cs 1 24
.sp
      jobtag - a positive integer indexing a unique unit of computation 
.sp
      icango - a positive integer specifying how many processes must check in
               to this process before it can be placed on the readyq.
.sp
      nchks  - the number of processes that depend upon the completion of 
               this process.
.sp
      list -   an integer array specifying the jobtags of the processes
               which depend upon completion of this process.
.cs 1 
.fi
.sp
\fBRestrictions\fR
.sp
The integer @jobtag@ must be between 1 and 1000.  
.sp
The integer @nchks@ must be between 0 and  25.  
.sp
The array @list@ must have @nchks@ entries
which are the @jobtag@s of dependent processes.
.bp
\fBNXTAG\fR 

.nf
Claims the next available @jobtag@ for dynamic spawning
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
nxtag(mypar,jobtag)
.sp
\fBComments\fR
This subroutine is used to claim the next available @jobtag@ for dynamic
spawning.  The parent process @mypar@ makes a call to @nxtag@ and receives
a valid @jobtag@ for a child process when @nxtag@ returns.  A call to @nxtag@
is always followed by a call to @spawn@:
.sp
      call nxtag(mypar,jobtag)
      call spawn(mypar,jobtag,subname,<parameter list>)
.sp
Data dependencies
are set automatically so that @mypar@ depends upon the completion of 
the child process.  The child @jobtag@ will have its @icango@ counter set
to 0.  
.sp
\fBCalling Sequence\fR
.nf
.cs 1 24
.sp
.sp
      mypar -  an integer containing the index of the parent process
               that will spawn a child process with index @jobtag@. 
.sp
      jobtag - an integer labeling a dynamically spawned unit of computation 
               which will be given the data dependency of a child of @mypar@.
.sp
.cs 1
.fi
\fBRestrictions\fR
.sp
The number of calls to @nxtag@ plus the number of calls to @dep@ may
not exceed 1000.
.bp
\fBWAIT\fR 

.nf
Used to set a logical barrier after dynamic spawning
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
wait(mytag,lablno)
.sp
\fBComments\fR
.sp
This logical function will be used to set a logical barrier in process @mytag@
after the dynamic spawning of a number of child processes.  The process
will be allowed to continue execution immediately after this barrier once
all of the spawned child processes have completed.  It should only be used
if child processes have been spawned by @mytag@ through the use of 
the subroutine @spawn@.  This routine must be used in conjunction with
integer function @ientry@.  The required usage is 
.sp2
.nf 
.cs 1 24
           go to (1000,...,L000,...,N000), ientry(mytag,N)
      1000 continue
             .	
             .	
             .	
           do 100 j = 1,nproc
                  .
                  . (set parameters to define spawned process)
                  .
              call nxtag(mytag,jobtag)
              call spawn(mytag,jobtag,subname,<parms>)
       100 continue
           label = L
           if (wait(mytag,label)) return
      L000 continue
             .
             .
             .
.cs 1
.fi
If this function  returns a value of .true. then the calling process
@mytag@ should issue a return.  
A return value .true. indicates that some spawned process has not yet completed 
and checked in.  When this happens the parent process @mytag@ must be exited
to avoid busy waiting and also to allow this code to execute on a 
single physical processor.  When all of the spawned processes have 
completed then
the parent process @mytag@ will be reentered, execute the code before the
computed \fIgo to\fR referencing @ientry@, and then resume execution at the 
statement immediately following the reference 
to @wait@ (ie. at L000 in the example above). 
A return value .false. indicates all spawned processes have 
checked in.
If a value of .false. is returned then the parent process simply 
continues to execute without exiting (ie. at L000 in the example above). 
.sp
\fBCalling Sequence\fR
.sp
.nf
.cs 1 24
      mytag - an integer containing the index of the parent process
              that has spawned at least one  child process
.sp
      label - an integer indicating where computation is to resume after
              all spawned child processes have completed.  If @wait@
              returns .true. then a return should be made and the
              next reentry of process @mytag@ the routine @ientry@
              will return the value @label@.
.sp
.cs 1
.fi
\fBRestrictions\fR
.sp
There must be an invocation of @ientry@ in the process @mytag@.
.bp
\fBSPAWN\fR 

.nf
Used to dynamically spawn processes during execution
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
spawn(mypar,jobtag,subname,<parameter list>)
.sp
\fBComments\fR
.sp
This subroutine puts the descriptor of a unit of computation labeled by
@jobtag@ onto the problem list.  This process will be scheduled for execution
immediately since it is a leaf to the parent node @mypar@ in the dynamic
data dependency graph.  Moreover, a call to @spawn@ must be preceded by a 
call to @nxtag@:  
.sp
              call nxtag(mytag,jobtag)
              call spawn(mytag,jobtag,subname,<parmeter list>)
.sp
The parameter list to the right of @subname@ should be a 
valid set of parameters to make the call:  
.sp
      call subname(<parameter list>) .
.sp
This call will be made when the data dependencies associated with @jobtag@
have been satisfied.  The subroutine calling @spawn@ must have @subname@ 
declared external.  The action of this procedure differs from @putq@ in that 
the user does not assign jobtags or data dependencies.  A parent may spawn 
any number of children (within limits given below) but these child processes 
only report to the parent.
.sp
\fBCalling Sequence\fR
.sp
.nf
.cs 1 24
     mypar -    an integer which labels the parent process which is 
                spawning @jobtag@ with a call to @spawn@.
.sp
     jobtag -   an integer which labels the unit of computation 
                this integer may be a local variable in the 
                subroutine calling @spawn@.  Its value must have 
                been assigned by Schedule through a call to @nxtag@.
.sp
     subname -  the name of the Fortran subroutine to be called when
                data dependencies for the process labeled by @jobtag@
                have been satisfied.
.sp
      <parameter list> - a list of parameters for a call to @subname@.
                they must be global variables and there should 
                not be more than 20 parameters. 
.cs 1
.fi
.sp
.sp
\fBRestrictions\fR
.sp
The number of parameters in <parameter list> must not exceed 20.
.sp
Each parameter passed in <parameter list > must be a location in 
shared memory.  
.sp
Constants and expressions should not be passed 
.sp
Variables declared locally within the subroutine
making the call to @putq@ should not be passed in <parameter list>.
.sp
.bp
\fBSCHED\fR 

.nf
Acquires physical processors and executes the Schedule program
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
sched(nprocs,static,<parameter list>)
.sp
\fBComments\fR
.sp
This subroutine initializes locks and queues internal to Schedule, acquires
@nprocs@ physical processors, executes @static(<parameter list>)@ and
then executes the Schedule program defined by @static@.  It is very important
to understand that the execution of @static@ merely describes a Schedule 
program by placing units of computation their data dependencies on a problem 
list.  This computation proceeds only after @static@ has completed.
The parameter list to the right of @static@ should be a valid set of parameters 
to make the call:  
.sp
      call static(<parameter list>) .
.sp
The program calling @sched@ must have @static@ declared external.  Control
is returned to the calling program from @sched@ once the entire computation
defined by @static@ has been completed and a single thread of execution
is resumed within the calling program.  There may be any number of calls
to @sched@ within the calling program but the user must be aware that
the cost of obtaining @nprocs@ physical processors is incurred with
each call to @sched@.  Note it is useful to do an initial debugging
phase with @nprocs@ set to 1.  Once the correctness of the program
is established with @nprocs = 1@ the remaining bugs are generally
associated with incorrect partitioning of shared data and incorrect
specification of data dependencies.
.sp
\fBCalling Sequence\fR
.sp
.nf
.cs 1 24
      nprocs - an integer which specifies the number of physical processors
               which are to be devoted to the execution of the Schedule
               program defined by @static@. 
.sp
      static - the name of the Fortran subroutine to be called to
               build the static data dependency graph and record the descriptors
               of the units of computation associated with each node of the graph.
.sp
      <parameter list> - a list of parameters for a call to @static@.
                they must be global variables and there should not be
                more than 20 parameters. 
.cs 1
.fi
.sp
.sp
\fBRestrictions\fR
.sp
The number of parameters in <parameter list> must not exceed 20.
.sp
Each parameter passed in <parameter list > must be a location in 
shared memory.  
.sp
Constants and expressions should not be passed 
.sp
Variables declared locally within the subroutine
making the call to @putq@ should not be passed in <parameter list>.
.sp
.bp
       lockon
       lockoff
.sp
.bp
\fBLOCKON, LOCKOFF\fR 

.nf
Low level synchronization primitives
.fi
________________________________________________________________________________
.sp
\fBSyntax\fR
.sp
lockon(lock)
lockoff(lock)
.sp
\fBComments\fR
.sp
Subroutine lockon takes an integer argument @lock@.
the process which calls @lockon@ spin-waits until the value of @lock@ is 
has the value 0. this is read and set to 1 in an autonomous operation.
Subroutine lockoff sets the value of @lock@ to 0.
Caution must be exercised in using locks.  In particular 
one must be careful to initialize properly.