Next: 3-Dimensional Diffusion Equation
Up: Benchmarking HPJava, Part II:
Previous: Direct Matrix Multiplication
Contents
Laplace Equation Using Red-Black Relaxation
In Figure 3.11, we have introduced a run-time
communication library call, Adlib.writeHalo(), updating the
cached values in the ghost regions with proper element values from
neighboring processes. With ghost regions, the inner loop of
algorithms for stencil updates can be written in a simple way, since
the edges of the block don't need special treatment in accessing
neighboring elements. Shifted indices can locate the proper values
cached in the ghost region. This is a very important feature in real
codes. Thus, the main issue for this subsection is to assess the
performance bottleneck in an HPJava program from run-time
communication library calls. Through benchmarking both Laplace
equations with and without Adlib.writeHalo(), we will
analyze the effect and latency of the library call in real codes
.
Figure 7.2:
512
512 Laplace Equation using Red-Black Relaxation
without Adlib.writeHalo() on shared memory machine
|
|
Figure 7.2 shows the performance of the
Laplace equation using red-black relaxation without
Adlib.writeHalo() on the shared memory machine.
Again, we need to see the Java performance over the C performance on
the shared memory machine. It is 98% over C. It is quite
a satisfactory performance achievement for Java.
The table 7.4 shows the speedup of the
naive translation over sequential Java program. Moreover, it
shows the speedup of HPJOPT2 over the naive translation.
Table 7.4:
Speedup of the naive translation over sequential Java and C
programs for the Laplace equation using red-black relaxation without
Adlib.writeHalo() on SMP.
|
Number of Processors |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
Naive translation |
|
|
|
|
|
|
|
|
|
over Java |
0.43 |
0.86 |
1.29 |
1.73 |
2.25 |
2.73 |
3.27 |
3.60 |
|
HPJOPT2 over Java |
0.59 |
1.15 |
1.73 |
2.35 |
2.95 |
3.60 |
4.29 |
4.87 |
|
HPJOPT2 over |
|
|
|
|
|
|
|
|
|
Naive translation |
1.36 |
1.34 |
1.35 |
1.37 |
1.31 |
1.32 |
1.31 |
1.35 |
The speedup of the naive translation over sequential Java is
up to 360% with 8 processors. The speedup of HPJOPT2 over sequential
Java is up to 487% with 8 processors. The speedup of HPJOPT2 over the
naive translation is up to 137%.
The table 7.5 shows the speedup of the
naive translation and HPJOPT2 for each number of processors over the
performance with one processor.
Table 7.5:
Speedup of the naive translation and HPJOPT2 for each number of
processors over the performance with one processor for the Laplace
equation using red-black relaxation without Adlib.writeHalo() on SMP.
|
Number of Processors |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
Naive translation |
1.97 |
2.96 |
3.98 |
5.17 |
6.27 |
7.53 |
8.28 |
|
HPJOPT2 |
1.94 |
2.92 |
4.00 |
4.98 |
6.08 |
7.24 |
8.21 |
The naive translation gets up to 828% speedup using 8 processors
on the shared memory machine. Moreover, HPJOPT2 gets up to 821% speedup.
Thus, if we could ignore Adlib.writeHalo(), the HPJava system
would give a tremendous performance improvement on the shared memory
machine. Moreover, HPJOPT2 optimization scheme works quite well.
Figure 7.3:
Laplace Equation using Red-Black Relaxation on shared memory machine
|
|
Figure 7.3 shows the performance of the Laplace equation
using red-black relaxation with Adlib.writeHalo() on the
shared memory machine.
The table 7.6 shows the speedup of the naive
translation over sequential Java and C programs. Moreover, it shows
the speedup of HPJOPT2 over the naive translation.
Table 7.6:
Speedup of the naive translation over sequential Java and C
programs for the Laplace equation using red-black relaxation on SMP.
|
Number of Processors |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
Naive translation |
|
|
|
|
|
|
|
|
|
over Java |
0.43 |
0.79 |
1.01 |
1.24 |
1.38 |
1.50 |
1.64 |
1.71 |
|
HPJOPT2 over Java |
0.77 |
1.14 |
1.38 |
1.71 |
1.89 |
2.07 |
2.25 |
2.36 |
|
HPJOPT2 over |
|
|
|
|
|
|
|
|
|
Naive translation |
1.77 |
1.45 |
1.37 |
1.38 |
1.37 |
1.38 |
1.37 |
1.38 |
The speedup of the naive translation with 8 processors over sequential
Java and C is up to 171%. The speedup of HPJOPT2 with 8 processors
over sequential Java and C is up to 236%. The speedups of
HPJOPT2 over the naive translation is up to 177%.
The table 7.7 shows the speedup of the naive
translation and HPJOPT2 for each number of processors over the
performance with one processor.
Table 7.7:
Speedup of the naive translation and HPJOPT2 for each number of
processors over the performance with one processor for the Laplace
equation using red-black relaxation on SMP.
|
Number of Processors |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
Naive translation |
1.81 |
2.33 |
2.86 |
3.19 |
3.46 |
3.77 |
3.95 |
|
HPJOPT2 |
1.49 |
1.80 |
2.23 |
2.47 |
2.69 |
2.93 |
3.07 |
The naive translation gets up to 395% speedup using 8 processors
on the shared memory machine. Moreover, HPJOPT2 gets up to 307%
speedup. Compared with the performance without
Adlib.writeHalo(), it is up to 200% slower. However, in
both programs with and without the library call, the optimization
achievement is quite consistent. It means obviously optimization
strategies are not affected by the run-time communication library
calls. They only make the running time of HPJava programs slow down.
Thus, when using run-time communication libraries, we need to carefully
design HPJava programs since the latency can't be just disregarded.
We want to note the shared memory version of Adlib was implemented
very naively by porting the transport layer, mpjdev, to exchange
messages between Java and threads. A more careful port for the shared
memory version would probably give much better performance.
Figure 7.4:
Laplace Equation using Red-Black Relaxation on distributed memory machine
|
|
Table 7.8:
Speedup of the naive translation for each number of
processors over the performance with one processor for the Laplace
equation using red-black relaxation on the distributed memory machine.
|
Number of Processors |
4 |
9 |
16 |
25 |
36 |
|
Naive translation |
4.06 |
7.75 |
9.47 |
12.18 |
17.04 |
Figure 7.4 shows performance for the Laplace equation
using red-black relaxation on the distributed memory machine. Table
7.8 also shows the speedup of the naive translation
for each number of processors over the performance with one processor
for the Laplace equation using red-black relaxation on the machine.
Next: 3-Dimensional Diffusion Equation
Up: Benchmarking HPJava, Part II:
Previous: Direct Matrix Multiplication
Contents
Bryan Carpenter
2004-06-09