|
An Example - Matrix multiplication
Like it's namesake, HPF (High Performance Fortran), HPJava provides
some special syntax to define logical groupings within the set of
processors executing a program, and for declaring and manipulating
distributed arrays. These are special arrays whose elements are
distributed over a group of processors, rather than all living in the
memory of a single processor.
Here is an example of an HPJava program for multiplying together
two matrices, implemented as distributed arrays:
Procs2 p = new Procs2(P, P) ;
on(p) {
Range x = new BlockRange(N, p.dim(0)) ;
Range y = new BlockRange(N, p.dim(1)) ;
float [[-,-]] c = new float [[x, y]] ;
float [[-,*]] a = new float [[x, N]] ;
float [[*,-]] b = new float [[N, y]] ;
... initialize `a', `b'
overall(i = x for :)
overall(j = y for :) {
float sum = 0 ;
for(int k = 0 ; k < N ; k++)
sum += a [i, k] * b [k, j] ;
c [i, j] = sum ;
}
}
The code looks relatively simple (short, at least) but it is a
fairly sophisticated parallel program. It
already incorporates much of the special syntax of HPJava, so once this
example is clear you understand a large part of what HPJava is about.
A process group object p is created, representing a
2-dimensional, P by P grid of processes.
Generally the P * P processes will be some subset of the
processes in which the program was first started. The remainder of
this program is executed only by processes belonging to the subset.
This is specified through the on(p) construct.
Distributed arrays a, b and
c are declared
through special syntax, using double brackets. This distinguishes them
from the sequential arrays of ordinary Java. HPJava is a superset of
ordinary Java, so ordinary Java arrays are available in HPJava.
But there are several important differences between standard Java arrays
and distributed arrays. Rather than try to force two incompatible
models of an array into the same mold, HPJava accepts the distinction
pragmatically, and keeps the idea of a distributed array strictly separate
from the standard Java idea of a sequential array.
Whereas all the elements of an ordinary Java array reside in the memory
of the Java Virtual Machine that creates the array, the elements of
a distributed array are divided across the memories of the processes
that collectively create the array. Distributed arrays may be
multidimensional. In general their shape and distribution is specified
in the constructor using a list of distributed range objects.
These are members the Range class or its subclasses.
The 2-dimensional array
c has ranges x and y. (Other
languages - for example Ada - provide special syntax for describing ranges
of integers. These ranges can be used for declaring arrays and parametrizing
for loops. HPJava elevates its range entities to full object
status, and endows them with a distribution format.) The range
x represents a mapping of the index set 0 ... N - 1
into the first dimension of the process grid p
(this dimension is denoted by the expression p.dim(0)).
The range y is similar, but the interval is distributed
over the second dimension of p.
The choice of subclass BlockRange
for these ranges implies that the intervals are divided across the
process grid dimensions using block format.
This is the most obvious distribution strategy, in which
a consecutive blocks of the index space are allocated to
consecutive processes. The preferred block size is determined by dividing the
array size by the process dimension size, rounding up as necessary.
Various other distribution formats are supported through different
subclasses of Range. In general a multidimensional array
can have ranges with an arbitrary mix of these distribution formats.
The figure shows the distribution of the elements of c over
the 4 processors of p if P = 2 and
N = 5.
The arrays a and b are the same shape
as c - they are N by N
matrices - but their distribution patterns are different.
The second dimension of a and the first dimension
of b are sequential, or collapsed,
dimensions.
The type signature of an array with a sequential dimension includes
an asterisk in the associated slot, and a simple integer extent is used
in the same slot of the constructor, instead of a range object.
The layout of the elements of a looks like:
The first dimension of a is distributed over the ``vertical''
process dimension. But, because the second dimension of a is
sequential, every processor contains complete rows of the matrix.
Notice that each column of processes contains a replica of the whole
array. Because a has no dimension (range) distributed
over the second dimension of p, the whole array is
replicated across that process dimension. Each column of processes
has a copy of the whole array.
The distribution of b is complimentary:
Now every row of processes has a copy of the whole array.
Processes contain whole columns of the matrix, and every row of processes
contains a replica of the complete array. (The code that initializes
the elements of a and b should respect this
replication pattern, and enter identical values in each copy of the
elements of the arrays. The language definition doesn't insist on or
require this consistency of copies of replicated arrays, but it is usually
a good programming practice.)
Because we have carefully selected the distribution patterns for the
matrices, the parallel implementation of the matrix multiplication is
very straightforward. All the operands for the computation of
c[0,0], say, (namely a[0,0], a[0,1],
a[0,2], a[0,3], a[0,4],
b[0,0], b[1,0], b[2,0],
b[3,0], and b[4,0])
are already resident on the processor that holds
that element, so no communication is needed to compute the element
(or any other).
The overall construct is a loop
parametrized by a range object. Because a Range generally
represents a distributed range of integers, an overall
is generally a parallel, distributed loop. In the specific example
illustrated above the body of the construct
overall(i = x for :) {
...
}
is executed 3 times by processes in the first row of processes and
twice by processes in the second row. In each iteration the index symbol
i takes on a value that selects a particular, locally
held ``element'' of the range x. Note the symbol
i is not an integer value. Elements of distributed
ranges are special entities called locations. Locations can be
used as subscripts to distributed arrays - in fact, with one exception
discussed later, subscripts in distributed dimensions of arrays
must be index symbols, bound to locations in the array range.
In this way subscripting operations on distributed arrays are restricted
to ensure all direct array accesses involve locally held elements.
The implementation of the matrix multiplication should now be reasonably
transparent. A pair of nested parallel loops iterate over the locally
held elements of c. An inner sequential loop accumulates
the inner product of the relevant row of a with the
relevant column of b. Because the second
dimension of a and the first dimension of b
were explicitly declared
to be sequential, the compiler allows them to be subscripted
unrestrictedly with integer expressions. The variable sum is
simply a local scalar variable.
Next:
Process groups
|