next up previous
Next: Conclusion Up: Benchmarking HPJava: Prospects for Previous: Benchmarks

Case Study 2: Red-Black Relaxation

Red-black relaxation is an old favorite. It is still interesting since it is a kernel in some practical solvers (for example we have an HPJava version of a multigrid solver in which relaxation it is a dominantly time-consuming part). Also it conveniently exemplifies a whole family of similar, local, grid-based algorithms and simulations.

Figure 4: Red-black iteration.
Procs2 p = new Procs2(2, 3) ;
on(p) {
  Range x = new ExtBlockRange(N, p.dim(0)) ;
  Range y = new ExtBlockRange(N, p.dim(1)) ;

  double [[-,-]] a = new double [[x, y]] ;

  ... initialization for `a'

  for(int iter=0; iter<count; iter++){

    Adlib.writeHalo(a, wlo, whi);

    overall(i=x for 1 : N - 2) 
      overall(j=y for 1+(i`+iter)%2 : N-2 : 2) {
        a[i,j] = 0.25F * (a [i-1,j] + a [i+1,j] +
                          a [i,j-1] + a [i,j+1]);
      }
  }
}

We can see an HPJava version of red-black relaxation of the two dimensional Laplace equation in Figure 4. Here we use a different class of distributed range. The class ExtBlockRange adds ghost-regions [4] to distributed arrays that use them. A library function called Adlib.writeHalo updates the cached values in the ghost regions with proper element values from neighboring processes.

There are a few additional pieces of syntax here. The range of iteration of the overall construct can be restricted by adding a general triplet after the for keyword. The i` is read ``i-primed'', and yields the integer global index value for the distributed loop (i itself does not have a numeric value--it is a symbolic subscript). Finally, if the array ranges have ghost regions,

Figure 5: Naive translation of main loop from Figure 4.

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{tabbing}
\verb\v...
...rb\vert }\vert \\
\verb\vert}\vert
\end{tabbing}\end{minipage}\end{displaymath}

the general policy that an array subscript must be a simple distributed index is relaxed slightly--a subscript can be a shifted index, as here. The value of the numeric shift--symbolically added to or subtracted from the index--must not exceed the width of the ghost regions, and the index that is shifted must be a location in the distributed range of the array, as before.

Figure 5 gives the translation of the ``loop nest'' from Figure 4. Here the subscripting expressions are particularly complex, the strength-reduction is likely to be very effective. There is a relatively expensive method call:

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{tabbing}
\verb\v...
...\vert + iter) % 2, N-2, 2)\vert \\
\end{tabbing}\end{minipage}\end{displaymath}

used in the translation of the overall construct:

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{verbatim}overa...
...+ (i\lq  + iter) % 2 : N - 2 : 2)
...\end{verbatim}\end{minipage}\end{displaymath}

is repeated in every iteration of the loop associated with the outer overall. Because red-black iteration is a common pattern, the translator should probably try to recognize this idiom. If a nested overall includes expressions of the form

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{verbatim}(i\lq  + expr) % 2
\end{verbatim}\end{minipage}\end{displaymath}

where i` is the global index of the outer overall, and expr is invariant with respect to the outer loop, this should be taken as an indication to unroll the outer loop by 2. The arguments of the call to localBlock() then become loop invariant, and the whole invocation is a candidate to be lifted out of the outer loop (of course this depends on some special knowledge of the method localBlock()). In Table 2 this is what we mean by ``loop unrolling''.


Table 2: Red-black relaxation performance. HPJava code for single processor. Other codes sequential. All speeds in MFLOPS.
    Strength Loop
HPJava Naive Reduction unrolling
HPJava 113.5 168.1 212.0
Java 238.5 239.6 242.0
C++ 248.0 249.3 248.0
F77 246.8 246.8 171.6


The table shows that in this more complex example the relative performance of the optimized HPJava node code is actually somewhat better than in the previous test case--85% of straightforward C++ or Fortran. Perhaps the reference codes could be improved, but note that whenever we quote C++ or Fortran performance we are using the highest optimization level supported by the compiler (-O5). Because we are only interested in performance of the generated node code, HPJava timings omit call to Adlib.writeHalo().


next up previous
Next: Conclusion Up: Benchmarking HPJava: Prospects for Previous: Benchmarks
Bryan Carpenter 2004-04-24