next up previous contents
Next: Discussion Up: Applying PRE, SR, DCE, Previous: Case Study: Direct Matrix   Contents


Case Study: Laplace Equation Using Red-Black Relaxation

Next, we apply PRE, SR, and DCE to Laplace equation using red-black relaxation from Figure 3.11.
Figure 5.9: The innermost overall loop of naively translated Laplace equation using red-black relaxation
Block __$t42 = y.localBlock (1 + (__$t40 + iter) % 2, N - 2, 2) ;
int   __$t43 = y.str () ; Dimension __$t44 = y.dim () ;
APGGroup  __$t45 = __$t38.restrict (__$t44) ;

int __$t47 = __$t42.glb_bas, __$t48 = __$t42.sub_bas ;

for (int __$t46 = 0 ; __$t46 < __$t42.count ; __$t46 ++) {
    a__$DD [a__$bas.base + a__$0.stride * __$t41 + 
                           a__$1.stride * __$t48] = (double) 0.25 * 
        (a__$DD [a__$bas.base + a__$0.stride * (__$t41 - __$t36 * 1) +
                                a__$1.stride * __$t48] + 
         a__$DD [a__$bas.base + a__$0.stride * (__$t41 + __$t36 * 1) + 
                                a__$1.stride * __$t48] + 
         a__$DD [a__$bas.base + a__$0.stride * __$t41 + 
                                a__$1.stride * (__$t48 - __$t43 * 1)] + 
         a__$DD [a__$bas.base + a__$0.stride * __$t41 + 
                                a__$1.stride * (__$t48 + __$t43 * 1)]) ;

    __$t47 += __$t42.glb_stp ; __$t48 += __$t42.sub_stp ;
}
Figure 5.9 shows the innermost overall loop of the naively translated direct matrix multiplication program in HPJava, using the new translation schemes from Figure 5.4. Here, let's focus on the control variables. Suppose that there is a nested overall loop. Then, after the translation, control variables generated by the inner loop reside inside the outer loop. That is, these expensive method calls are repeatedly called in every iteration of the outer loop. Since in the real life HPJava is aimed at large scale scientific and engineering applications, the number of iterations are likely to be large. Using information available to the compiler, moreover, we know in advance that the return values for str(), dim(), and restrict() are not changed no matter what computations occur in the inner loop, except for localBlock(). This means that these control variables are partially redundant for the outer overall. Moreover, __$t44 and __$t45 are absolutely useless because they are not used anywhere, and __$t47 is useless because it is only used in an useless statement, __$t47 += __$t42.glb_stp. LU is a quite useful optimization technique for this experimental study since the innermost overall looks as follows:

which generates an expensive method call in the translation:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
y.localBloc...
... iter) % 2, N-2, 2) ;
\end{verbatim}<tex2html_comment_mark>883\end{minipage}
$

where iter is invariant with respect to the outer loop. This means that the outer loop can be unrolled by 2. Then, the arguments of localBlock() become loop invariant, and the whole invocation is a candidate to be hoisted outside the outer loop. The following simplified code is the optimized code by applying LU to the outer loop:

Because (__$t40 + iter) % 2 does not change when __$t40 is incremented by a multiple of 2, this expression (for example) is loop-invariant. Now both invocations of localBlock() becomes invariant, and are ready to be hoisted outside the outer loop.
Figure 5.10: Optimized HPJava program of Laplace equation using red-black relaxation by HPJOPT2
// Hoisted variables
Block __$t35 = x.localBlock (1, N - 2) ; int __$t36 = x.str () ;
int __$t40 = __$t35.glb_bas, __$t41 = __$t35.sub_bas ;
int __$t43 = y.str () ;
int glb_iter = __$t35.glb_bas + iter ;
Block __$t42_1 = y.localBlock (1 + (glb_iter), N - 2, 2) ;
Block __$t42_2 = y.localBlock (1 + (glb_iter + __$t35.glb_stp), N - 2, 2) ;
int a3 = a__$bas.base, a4 = a__$0.stride, a5 = a__$1.stride ;
int a6 = a3 + a4 * __$t41, a7 = a4 * __$t36, a9 = a5 * __$t43 ;

// Main body of Laplace equation using red-black relaxation
int __$t39 = 0 ;
if (__$t39 < __$t35.count) {
  int b1 = __$t35.glb_stp, b2 = __$t35.sub_stp ;
  do {
    int __$t48 = __$t42.sub_bas ;
    int __$t46 = 0 ;
    if (__$t46 < __$t42.count) {
      int a1 = __$t42.glb_stp, a2 = __$t42.sub_stp, ;
      int a8 = a5 * __$t48, a10 = a6 + a8 ;
      do {
        a__$DD [a10] = 
            (double) 0.25 * (a__$DD [a10 - a7] + a__$DD [a10 + a7] + 
                             a__$DD [a10 - a9] + a__$DD [a10 + a9]) ;
        __$t48 += a2 ; __$t46 ++ ;
      } while (__$t46 < __$t42.count) ;
    }
    __$t48 += a2 ; __$t46 = 0 ;
    if (__$t46 < __$t42.count) {
      int a1 = __$t42.glb_stp, a2 = __$t42.sub_stp, ;
      int a8 = a5 * __$t48, a10 = a6 + a8 ;
      do {
        a__$DD [a10] = 
            (double) 0.25 * (a__$DD [a10 - a7] + a__$DD [a10 + a7] + 
                             a__$DD [a10 - a9] + a__$DD [a10 + a9]) ;
        __$t48 += a2 ; __$t46 ++ ;
      } while (__$t46 < __$t42.count) ;
    }
    __$t40 += 2 * b1 ; __$t41 += 2 * b2 ; __$t39 += 2 ;
  } while (__$t39 < __$t35.count) ;
}
To eliminate complicated distributed index subscript expressions and to hoist control variables in the inner loops, we will adopt the following algorithm;
Step 1:
(Optional) Apply Loop Unrolling.
Step 2:
Hoist control variables to the outermost loop by using compiler information if loop invariant.
Step 3:
Apply Partial Redundancy Elimination and Strength Reduction.
Step 4:
Apply Dead Code Elimination.
We will call this algorithm HPJOPT2 (HPJava OPTimization Level 2)[*] . Applying LU is optional since it is only useful when a nested overall loop involves the pattern, (i` + expr) % 2. We don't treat Step 3 of HPJOPT2 as a part of PRE. It is feasible for control variables and control subscripts to be hoisted by applying PRE. But using information available to the compiler, we often know in advance they are loop invariant without applying PRE. Thus, without requiring PRE, the compiler hoists them if they are loop invariant. Figure 5.10 shows a complete example of the optimized Laplace equation using red-black relaxation by HPJOPT2. All control variables are successfully lifted out of the outer loop, and complicated distributed index subscript expressions are well optimized. In the mean time, we also need to observe what effect was produced on the subscript expression of a multiarray element access by PRE and SR. In the innermost overall loop, we have a multiarray element access in 3.11,

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
a [i - 1, j]\end{verbatim}\end{minipage}
$

After the naive translation, in Figure 5.9, this access transforms into

PRE knows that __$t41, and __$t36 are constant values using data flow analysis. Thus, computations involving __$t41, and __$t36 can be hoisted outside the loop where the multiarray element access resides. An interesting one is a variable, __$t48, being an induction variable candidate for SR. The multiplicative operation involving __$t48 can be replaced with some cheaper additive operations. Thus, SR can affect a part of the subscript expression of this multiarray element access. Finally, this multiarray element access is optimized as follows;

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
a__$DD [a10 - a7]\end{verbatim}<tex2html_comment_mark>890\end{minipage}
$

Since most of the generated method calls for control variables are hoisted outside the loops, and expensive multiplicative operations are replaced with cheaper additive operations using PRE and Strength Reduction, we expect the optimized Laplace equation using red-black relaxation program to be faster in performance than the naive translation. This will be confirmed in the following two chapters
next up previous contents
Next: Discussion Up: Applying PRE, SR, DCE, Previous: Case Study: Direct Matrix   Contents
Bryan Carpenter 2004-06-09