Next: 3-Dimensional Diffusion Equation
Up: Partial Differential Equation
Previous: Background on Partial Differential
Contents
Laplace Equation Using Red-Black Relaxation
In this section, we will benchmark the well-known Laplace equation
using red-black relaxation. Figure 3.11 is an
implementation of the `red-black' scheme, in which all even sites
are stencil-updated, then all odd sites are stencil-updated in a
separate phase.
There are two things we want to point out for PDE benchmark
applications before we start: overhead of run-time communication
library calls and relation between problem size and cache size.
First, an HPJava program such as Laplace equation using red-black
relaxation from Figure 3.11 introduces a run-time
communication library call, Adlib.writeHalo(), updating the
cached values in the ghost regions with proper element values from
neighboring processes. Because our concern here is with optimization
of node code, not the communication library, we will ignore (delete)
communication library calls when benchmarking node performance of
HPJava through this chapter.
We do two experiments, one with all data in the cache, and the
other not, and see what we learn from the experiments.
In the first experiment, in order to maximize performance, we will
choose a right size of each
matrix for this benchmark. When the size of a matrix on PDE
applications is larger than the size of physical cache, the cache hit
ratio will be decreased. Thus, we assume the we choose a proper size
of matrix for Laplace equation using red-black relaxation to maximize
the efficiency of code in the machine. Because the size of physical
cache for the Linux machine is 256 KB and the data type is double,
the proper size of matrix should be less than
. So, we choose a 150
150 matrix for benchmarking an
HPJava Laplace equation using red-black relaxation program.
In section 5.4, we discussed that a relatively
expensive method call, localBlock(), can't be lifted out of the
outer loop since it is not an invariant. We produced the loop
unrolling (LU) technique to handle this problem. Then, is LU the best
choice to make the method call an invariant and to hoist outside the
outer loop?
Compared to LU, we introduce another programming technique, called
splitting. This is a programming technique rather than a
compiler optimization scheme. The main body of the HPJava Laplace
equation using red-black relaxation from Figure 3.11
looks like:
In the naive translation, the inner overall generates a method call:
which is not an invariant and can't be hoisted outside the outer
loop. This problem can be solved if the index triplet of the
inner overall is invariant with respect to the outer loop. We
can split the inner overall in the main body splitting each
nested overall above into two nested overall, one for
updating `red' and the other for updating `black' like:
Figure 6.2:
Performance comparison between 150
150 original
Laplace equation from Figure 3.11 and split version
of Laplace equation.
|
|
Figure 6.2 shows the performance comparison of
Laplace equation using red-black relaxation with the problem size of
150
150. The original Laplace
equation using red-black relaxation is from Figure 3.11.
It is optimized by HPJOPT2 with loop unrolling since the pattern,
(i` + expr) % 2 is detected. The split version is the newly
introduced program to avoid the pattern, (i` + expr) % 2 we can
immediately hoist the heavy method call, localBlock(), outside
the outer loop (without loop unrolling). Moreover, the original
version of Java and C programs is not loop-unrolled. Rather, it is
maximally optimized by O and -O5 options, respectively.
The performance of naive translations for both programs are almost
identical to each other and quite behind Java and C performance
(about 30%). For the program only PRE applied, the split version is
slightly better (10%) than the original. Both are still quite
behind Java and C performance (about 49%).
The program with only PRE applied has performance improvement over the
naive translation. The original is improved by 186% and the
split version is improved by 205%.
Moreover, we got more dramatic performance improvement from HPJOPT2
over the naive translation. The original is improved by 361% as well
as the split version is improved by 287%.
Figure 6.3:
Memory locality for loop unrolling and split versions.
|
|
The interesting thing is the performance of programs where HPJOPT2 is
applied. The speed of the original with HPJOPT2 (and loop-unrolled)
is 127% over the split version with HPJOPT2 because of memory
locality. Data for the original with HPJOPT2 are locally held for each
inner loop, compared to data for the split version like Figure
6.3. This means that applying the HPJOPT2 compiler
optimization (unrolling the outer loop by 2) is the better choice
than programmatically splitting the overall nest.
One of the most interesting results is that the performance of the
original with HPJOPT2 (and loop-unrolled) is almost identical with Java
and C performance (the original is 6% and 10% behind,
respectively). That is, an HPJava node code of Laplace equation using
red-black relaxation optimized by HPJOPT2 is quite competitive with Java
and C implementations.
Figure 6.4:
Performance comparison between 512
512 original
Laplace equation from Figure 3.11 and split version
of Laplace equation.
|
|
For our second experiment, Figure 6.4 shows the
performance comparison of
Laplace equation using red-black relaxation with the problem size of
512
512. The pattern of performance is virtually identical
with that of the problem size of 150
150 except that if the
version for all data in cache has 12% better performance than the large
problem size.
Next: 3-Dimensional Diffusion Equation
Up: Partial Differential Equation
Previous: Background on Partial Differential
Contents
Bryan Carpenter
2004-06-09