HPJava Home Page

mpiJava
HPJava language
SC '02 demo

Project flier

Java HPC Course
HPspmd lectures


Community Grids
Bloomington photos

Java Grande Home
Numerics working group


PCRC Home



Viewing these pages

An Example - Matrix multiplication

Like it's namesake, HPF (High Performance Fortran), HPJava provides some special syntax to define logical groupings within the set of processors executing a program, and for declaring and manipulating distributed arrays. These are special arrays whose elements are distributed over a group of processors, rather than all living in the memory of a single processor.

Here is an example of an HPJava program for multiplying together two matrices, implemented as distributed arrays:

  Procs2 p = new Procs2(P, P) ;
  on(p) {
    Range x = new BlockRange(N, p.dim(0)) ;
    Range y = new BlockRange(N, p.dim(1)) ;

    float [[-,-]]  c = new float [[x, y]] ;

    float [[-,*]] a = new float [[x, N]] ;
    float [[*,-]] b = new float [[N, y]] ;

    ... initialize `a', `b'

    overall(i = x for :)
      overall(j = y for :) {

        float sum = 0 ;
        for(int k = 0 ; k < N ; k++)
          sum += a [i, k] * b [k, j] ;

        c [i, j] = sum ;
      }
  }
The code looks relatively simple (short, at least) but it is a fairly sophisticated parallel program. It already incorporates much of the special syntax of HPJava, so once this example is clear you understand a large part of what HPJava is about.

A process group object p is created, representing a 2-dimensional, P by P grid of processes. Generally the P * P processes will be some subset of the processes in which the program was first started. The remainder of this program is executed only by processes belonging to the subset. This is specified through the on(p) construct.

Distributed arrays a, b and c are declared through special syntax, using double brackets. This distinguishes them from the sequential arrays of ordinary Java. HPJava is a superset of ordinary Java, so ordinary Java arrays are available in HPJava. But there are several important differences between standard Java arrays and distributed arrays. Rather than try to force two incompatible models of an array into the same mold, HPJava accepts the distinction pragmatically, and keeps the idea of a distributed array strictly separate from the standard Java idea of a sequential array.

Whereas all the elements of an ordinary Java array reside in the memory of the Java Virtual Machine that creates the array, the elements of a distributed array are divided across the memories of the processes that collectively create the array. Distributed arrays may be multidimensional. In general their shape and distribution is specified in the constructor using a list of distributed range objects. These are members the Range class or its subclasses. The 2-dimensional array c has ranges x and y. (Other languages - for example Ada - provide special syntax for describing ranges of integers. These ranges can be used for declaring arrays and parametrizing for loops. HPJava elevates its range entities to full object status, and endows them with a distribution format.) The range x represents a mapping of the index set 0 ... N - 1 into the first dimension of the process grid p (this dimension is denoted by the expression p.dim(0)). The range y is similar, but the interval is distributed over the second dimension of p.

The choice of subclass BlockRange for these ranges implies that the intervals are divided across the process grid dimensions using block format. This is the most obvious distribution strategy, in which a consecutive blocks of the index space are allocated to consecutive processes. The preferred block size is determined by dividing the array size by the process dimension size, rounding up as necessary. Various other distribution formats are supported through different subclasses of Range. In general a multidimensional array can have ranges with an arbitrary mix of these distribution formats.

The figure shows the distribution of the elements of c over the 4 processors of p if P = 2 and N = 5.

The arrays a and b are the same shape as c - they are N by N matrices - but their distribution patterns are different. The second dimension of a and the first dimension of b are sequential, or collapsed, dimensions. The type signature of an array with a sequential dimension includes an asterisk in the associated slot, and a simple integer extent is used in the same slot of the constructor, instead of a range object. The layout of the elements of a looks like:

The first dimension of a is distributed over the ``vertical'' process dimension. But, because the second dimension of a is sequential, every processor contains complete rows of the matrix. Notice that each column of processes contains a replica of the whole array. Because a has no dimension (range) distributed over the second dimension of p, the whole array is replicated across that process dimension. Each column of processes has a copy of the whole array.

The distribution of b is complimentary:

Now every row of processes has a copy of the whole array.

Processes contain whole columns of the matrix, and every row of processes contains a replica of the complete array. (The code that initializes the elements of a and b should respect this replication pattern, and enter identical values in each copy of the elements of the arrays. The language definition doesn't insist on or require this consistency of copies of replicated arrays, but it is usually a good programming practice.)

Because we have carefully selected the distribution patterns for the matrices, the parallel implementation of the matrix multiplication is very straightforward. All the operands for the computation of c[0,0], say, (namely a[0,0], a[0,1], a[0,2], a[0,3], a[0,4], b[0,0], b[1,0], b[2,0], b[3,0], and b[4,0]) are already resident on the processor that holds that element, so no communication is needed to compute the element (or any other).

The overall construct is a loop parametrized by a range object. Because a Range generally represents a distributed range of integers, an overall is generally a parallel, distributed loop. In the specific example illustrated above the body of the construct

 
    overall(i = x for :) {
      ...
    }
is executed 3 times by processes in the first row of processes and twice by processes in the second row. In each iteration the index symbol i takes on a value that selects a particular, locally held ``element'' of the range x. Note the symbol i is not an integer value. Elements of distributed ranges are special entities called locations. Locations can be used as subscripts to distributed arrays - in fact, with one exception discussed later, subscripts in distributed dimensions of arrays must be index symbols, bound to locations in the array range. In this way subscripting operations on distributed arrays are restricted to ensure all direct array accesses involve locally held elements.

The implementation of the matrix multiplication should now be reasonably transparent. A pair of nested parallel loops iterate over the locally held elements of c. An inner sequential loop accumulates the inner product of the relevant row of a with the relevant column of b. Because the second dimension of a and the first dimension of b were explicitly declared to be sequential, the compiler allows them to be subscripted unrestrictedly with integer expressions. The variable sum is simply a local scalar variable.

Next: Process groups


Bryan Carpenter, (dbc@ecs.soton.ac.uk). Last updated May 2007.