next up previous contents
Next: 3-Dimensional Diffusion Equation Up: Partial Differential Equation Previous: Background on Partial Differential   Contents


Laplace Equation Using Red-Black Relaxation

In this section, we will benchmark the well-known Laplace equation using red-black relaxation. Figure 3.11 is an implementation of the `red-black' scheme, in which all even sites are stencil-updated, then all odd sites are stencil-updated in a separate phase. There are two things we want to point out for PDE benchmark applications before we start: overhead of run-time communication library calls and relation between problem size and cache size. First, an HPJava program such as Laplace equation using red-black relaxation from Figure 3.11 introduces a run-time communication library call, Adlib.writeHalo(), updating the cached values in the ghost regions with proper element values from neighboring processes. Because our concern here is with optimization of node code, not the communication library, we will ignore (delete) communication library calls when benchmarking node performance of HPJava through this chapter. We do two experiments, one with all data in the cache, and the other not, and see what we learn from the experiments. In the first experiment, in order to maximize performance, we will choose a right size of each matrix for this benchmark. When the size of a matrix on PDE applications is larger than the size of physical cache, the cache hit ratio will be decreased. Thus, we assume the we choose a proper size of matrix for Laplace equation using red-black relaxation to maximize the efficiency of code in the machine. Because the size of physical cache for the Linux machine is 256 KB and the data type is double, the proper size of matrix should be less than . So, we choose a 150 $ \times $ 150 matrix for benchmarking an HPJava Laplace equation using red-black relaxation program. In section 5.4, we discussed that a relatively expensive method call, localBlock(), can't be lifted out of the outer loop since it is not an invariant. We produced the loop unrolling (LU) technique to handle this problem. Then, is LU the best choice to make the method call an invariant and to hoist outside the outer loop? Compared to LU, we introduce another programming technique, called splitting. This is a programming technique rather than a compiler optimization scheme. The main body of the HPJava Laplace equation using red-black relaxation from Figure 3.11 looks like:

In the naive translation, the inner overall generates a method call:

which is not an invariant and can't be hoisted outside the outer loop. This problem can be solved if the index triplet of the inner overall is invariant with respect to the outer loop. We can split the inner overall in the main body splitting each nested overall above into two nested overall, one for updating `red' and the other for updating `black' like:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
// Updating...
... N - 2)
overall(j = y for 1 : N - 2 : 2) { ... }\end{verbatim}\end{minipage}
$

Figure 6.2: Performance comparison between 150 $ \times $ 150 original Laplace equation from Figure 3.11 and split version of Laplace equation.
Figure 6.2 shows the performance comparison of Laplace equation using red-black relaxation with the problem size of 150 $ \times $ 150. The original Laplace equation using red-black relaxation is from Figure 3.11. It is optimized by HPJOPT2 with loop unrolling since the pattern, (i` + expr) % 2 is detected. The split version is the newly introduced program to avoid the pattern, (i` + expr) % 2 we can immediately hoist the heavy method call, localBlock(), outside the outer loop (without loop unrolling). Moreover, the original version of Java and C programs is not loop-unrolled. Rather, it is maximally optimized by O and -O5 options, respectively. The performance of naive translations for both programs are almost identical to each other and quite behind Java and C performance (about 30%). For the program only PRE applied, the split version is slightly better (10%) than the original. Both are still quite behind Java and C performance (about 49%). The program with only PRE applied has performance improvement over the naive translation. The original is improved by 186% and the split version is improved by 205%. Moreover, we got more dramatic performance improvement from HPJOPT2 over the naive translation. The original is improved by 361% as well as the split version is improved by 287%.
Figure 6.3: Memory locality for loop unrolling and split versions.
\includegraphics[width=3.5in]{Figures/memory}
The interesting thing is the performance of programs where HPJOPT2 is applied. The speed of the original with HPJOPT2 (and loop-unrolled) is 127% over the split version with HPJOPT2 because of memory locality. Data for the original with HPJOPT2 are locally held for each inner loop, compared to data for the split version like Figure 6.3. This means that applying the HPJOPT2 compiler optimization (unrolling the outer loop by 2) is the better choice than programmatically splitting the overall nest. One of the most interesting results is that the performance of the original with HPJOPT2 (and loop-unrolled) is almost identical with Java and C performance (the original is 6% and 10% behind, respectively). That is, an HPJava node code of Laplace equation using red-black relaxation optimized by HPJOPT2 is quite competitive with Java and C implementations.
Figure 6.4: Performance comparison between 512 $ \times $ 512 original Laplace equation from Figure 3.11 and split version of Laplace equation.
For our second experiment, Figure 6.4 shows the performance comparison of Laplace equation using red-black relaxation with the problem size of 512 $ \times $ 512. The pattern of performance is virtually identical with that of the problem size of 150 $ \times $ 150 except that if the version for all data in cache has 12% better performance than the large problem size.
next up previous contents
Next: 3-Dimensional Diffusion Equation Up: Partial Differential Equation Previous: Background on Partial Differential   Contents
Bryan Carpenter 2004-06-09