next up previous contents
Next: Array Sections Up: The HPJava Language Previous: Parallel Programming and Locations   Contents


More Distribution Formats

HPJava provides further distribution formats for dimensions of its multiarrays without further extensions to the syntax. Instead, the Range class hierarchy is extended. The full HPJava Range hierarchy is illustrated in Figure 3.8.
Figure 3.8: The HPJava Range hierarchy.
\includegraphics[width=2.1in]{Figures/hpjava-range}
In the previous section, we have already seen the BlockRange distribution format and objects of class Dimension. The Dimension class is familiar, although is was not presented previously as a range class. CyclicRange corresponds to the cyclic distribution format available in HPF. The CyclicRange distributions can result in better load balancing than the simple BlockRange distributions. For example, dense matrix algorithms don't have the locality that favors the block distributions used in stencil updates, but do involve phases where parallel computations are restricted to subsections of the whole array. In block distributions format, these sections may be corresponding to only a fraction of the available processes, leading to a very poor distribution workload. Here is a artificial example:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
Procs2 p = ...
...- 1)
a [i, j] = complicatedFunction(i\lq , j\lq ) ;
}\end{verbatim}\end{minipage}
$

The point is that the overall constructs only traverse half the ranges of the array they process. As shown in Figure 3.9, this leads to a very poor distribution of workload. The process with coordinates $ (0,0)$ does nearly all the work. The process at $ (0,1)$ has a few elements to work on, and all the other processes are idle.
Figure 3.9: Work distribution for block distribution.
\includegraphics[width=3.5in]{Figures/blockBalance}
Figure 3.10: Work distribution for cyclic distribution.
\includegraphics[width=3.5in]{Figures/cyclicBalance}
In cyclic distribution format, the index space is mapped to the process dimension in round-robin fashion. By only changing the range constructor from BlockRange to CyclicRange, we can make better distribution workload. Figure 3.10 shows that the imbalance is not nearly as extreme. Notice that nothing changed in the program apart from the choice of range constructors. This is an attractive feature that HPJava shares with HPF. If HPJava programs are written in a sufficiently pure data parallel style (using overall loops and collective array operations) it is often possible to change the distribution format of arrays dramatically while leaving much of the code that processes them unchanged. The ExtBlockRange distribution describes a BlockRange distribution extended with ghost regions. The ghost region is extra space ``around the edges'' of the locally held block of multiarray elements. These extra locations can cache some of the element values properly belonging to adjacent processors. With ghost regions, the inner loop of algorithms for stencil updates can be written in a simple way, since the edges of the block don't need special treatment in accessing neighboring elements. Shifted indices can locate the proper values cached in the ghost region. This is a very important feature in real codes, so HPJava has a specific extension to make it possible. For ghost regions, we relax the rule that the subscript of a multiarray should be a distributed index. As special syntax, the following expression

   name$\displaystyle \pm$   expression$\displaystyle $

is a legal subscript if name is a distributed index declared in an overall or at control construct and expression is an integer expression--generally a small constant. This is called shifted index. The significance of the shifted index is that an element displaced from the original location will be accessed. If the shift puts the location outside the local block plus surrounding ghost region, an exception will occur. Ghost regions are not magic. The values cached around the edges of a local block can only be made consistent with the values held in blocks on adjacent processes by a suitable communication. A library function called writeHalo updates the cached values in the ghost regions with proper element values from neighboring processes. Figure 3.11 is a version of the Laplace equation using red-black relaxation with ghost regions. The last two arguments of the ExtBlockRange constructor define the widths of the ghost regions added to the bottom and top (respectively) of the local block.
Figure 3.11: Solution for Laplace equation using red-black with ghost regions in HPJava.
  Procs2 p = new Procs2(P, P) ;

  on(p) {
    Range x = new ExtBlockRange(N, p.dim(0), 1, 1);
    Range y = new ExtBlockRange(N, p.dim(1), 1, 1);

    double [[-,-]] a = new double [[x, y]] on p ;

    // Initialize `a': set boundary values.    
    overall(i = x for :) 
      overall(j = y for : )
        if(i` == 0 || i` == N - 1 || j` == 0 || j` == N - 1)
          a [i, j] = i` * i` - j` * j`;
        else
          a [i, j] = 0.0;
    
    // Main loop.
    Adlib.writeHalo(a);

    overall(i = x for 1 : N - 2)
      overall(j = y for 1 + (i` + iter) % 2 : N - 2 : 2) {
           a [i, j] = (double)0.25 * (a [i - 1, j] + a [i + 1, j] + 
                                      a [i, j - 1] + a [i, j + 1]) ;
      }
  }
The CollapsedRange is a range that is not distributed, i.e. all elements of the range are mapped to a single process. The example

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
Procs1 q = ...
...le [[-,-]] a = new double [[x, y]] on p ;
...
}\end{verbatim}\end{minipage}
$

creates an array in which the second dimension is distributed over processes in q, with the first dimension collapsed. Figure 3.12 illustrates the situation for the case N = 8.
Figure 3.12: A two-dimensional array, a, distributed over one-dimensional grid, q.
\includegraphics[width=3.5in]{Figures/2d-array-distribution-over-1d-array}
However, if an array is declared in this way with an collapsed range, one still has the awkward restriction that a subscript must be a distributed index--one can't exploit the implicit locality of a collapsed range, because the compiler treats CollapsedRange as just another distributed range class. In order to resolve this common problem, HPJava has a language extension adding the idea of sequential dimensions. If the type signature of a distributed array has the asterisk symbol, *, in a specific dimension, the dimension will be implicitly collapsed, and can have a subscript with an integer expression like an ordinary multiarray. For example:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
Procs1 p = ...
... p ;
...
at(j = y [1]) a [6, j] = a [1, j] ;
}\end{verbatim}\end{minipage}
$

An at construct is retained to deal with the distributed dimension, but there is no need for distributed indices in the sequential dimension. The array constructor is passed integer extent expressions for sequential dimensions. All operations usually applicable to distributed arrays are also applicable to arrays with sequential dimensions. As we saw in the previous section, it is also possible for all dimensions to be sequential, in which case we recover sequential multidimensional arrays. On a two-dimensional process grid, suppose a one-dimensional distributed array with BlockRange distribution format is created, and the array dimension is distributed over the first dimension of the process grid, but no array dimension distributed over the second:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
Procs2 p = ...
...
double [[-]] b = new double[[x]] on p ;
...
}\end{verbatim}\end{minipage}
$

The interpretation is that the array, b replicated over the second process dimension. Independent copies of the whole array are created at each coordinated where replication occurs. Replication and collapsing can both occur in a single array. For example,

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
Procs2 p = ...
...
double [[*]] c = new double[[N]] on p ;
...
}\end{verbatim}\end{minipage}
$

The range of c is sequential, and the array is replicated over both dimensions of p.
Figure 3.13: A direct matrix multiplication in HPJava.
  Procs2 p = new Procs2(P, P) ;
  on(p) {
    Range x = new BlockRange(N, p.dim(0)) ;
    Range y = new BlockRange(N, p.dim(1)) ;

    double [[-,-]] c = new double [[x, y]] on p ;
    double [[-,*]] a = new double [[x, N]] on p ;
    double [[*,-]] b = new double [[N, y]] on p ;

    ... initialize `a', `b'

    overall(i = x for :)
      overall(j = y for :) {
        double sum = 0 ;
        for(int k = 0 ; k < N; k++) sum += a [i, k] * b [k, j] ;
        c [i, j] = sum ;
      }
  }
Figure 3.14: Distribution of array elements in example of Figure 3.13. Array a is replicated in every column of processes, array b is replicated in every row.
\includegraphics[width=3in]{Figures/directMatDist}
A simple and potentially efficient implementation of matrix multiplication can be given if the operand arrays have carefully chosen replicated/collapsed distributions. The program is given in Figure 3.13. As illustrated in Figure 3.14, the rows of a are replicated in the process dimension associated with y. Similarly the columns of b are replicated in the dimension associated with x. Hence all arguments for the inner scalar product are already in place for the computation--no communication is needed. We would be very lucky to come across three arrays with such a special alignment relation (distribution format relative to one another). There is an important function in the Adlib communication library called remap, which takes a pair of arrays as arguments. These must have the same shape and type, but they can have unrelated distribution formats. The elements of the source array are copied to the destination array. In particular, if the destination array has a replicated mapping, the values in the source array are broadcast appropriately.
Figure 3.15: A general matrix multiplication in HPJava.
  void matmul(double [[,]] c, double [[,]] a, double [[,]] b) {
    Group p = c.grp() ;
    Range x = c.rng(0), y = c.rng(1) ;
    int   N = a.rng(1).size() ; 

    double [[-,*]] ta = new double [[x, N]] on p ;
    double [[*,-]] tb = new double [[N, y]] on p ;

    Adlib.remap(ta, a) ;
    Adlib.remap(tb, b) ;

    on(p)
      overall(i = x for :)
        overall(j = y for :) {
          double sum = 0 ;
          for(int k = 0 ; k < N ; k++) sum += ta [i, k] * tb [k, j] ;
          c [i, j] = sum ;
        }
  }
Figure 3.15 shows how we can use remap to adapt the program in Figure 3.13 and create a general purpose matrix multiplication routine. Besides the remap function, this example introduces the two inquiry methods grp() and rng() which are defined for any multiarray. The inquiry grp() returns the distribution group of the array, and the inquiry rng($ r$) returns the $ r$th range of the array. The argument $ r$ is in the range $ 0, \ldots, R - 1$, where $ R$ is the rank (dimensionality) of the array. Figure 3.15 also introduces the most general form of multiarray constructors. So far, we have seen arrays distributed over the whole of the active process group, as defined by an enclosing on construct. In general, an on clause attached to an array constructor itself can specify that the array is distributed over some subset of the active group. This allows one to create an array outside the on construct that will processes its elements.
next up previous contents
Next: Array Sections Up: The HPJava Language Previous: Parallel Programming and Locations   Contents
Bryan Carpenter 2004-06-09