next up previous contents
Next: 3-Dimensional Diffusion Equation Up: Benchmarking HPJava, Part II: Previous: Direct Matrix Multiplication   Contents


Laplace Equation Using Red-Black Relaxation

In Figure 3.11, we have introduced a run-time communication library call, Adlib.writeHalo(), updating the cached values in the ghost regions with proper element values from neighboring processes. With ghost regions, the inner loop of algorithms for stencil updates can be written in a simple way, since the edges of the block don't need special treatment in accessing neighboring elements. Shifted indices can locate the proper values cached in the ghost region. This is a very important feature in real codes. Thus, the main issue for this subsection is to assess the performance bottleneck in an HPJava program from run-time communication library calls. Through benchmarking both Laplace equations with and without Adlib.writeHalo(), we will analyze the effect and latency of the library call in real codes[*].
Figure 7.2: 512 $ \times $ 512 Laplace Equation using Red-Black Relaxation without Adlib.writeHalo() on shared memory machine
Figure 7.2 shows the performance of the Laplace equation using red-black relaxation without Adlib.writeHalo() on the shared memory machine. Again, we need to see the Java performance over the C performance on the shared memory machine. It is 98% over C. It is quite a satisfactory performance achievement for Java. The table 7.4 shows the speedup of the naive translation over sequential Java program. Moreover, it shows the speedup of HPJOPT2 over the naive translation.


Table 7.4: Speedup of the naive translation over sequential Java and C programs for the Laplace equation using red-black relaxation without Adlib.writeHalo() on SMP.
Number of Processors 1 2 3 4 5 6 7 8
Naive translation
over Java 0.43 0.86 1.29 1.73 2.25 2.73 3.27 3.60
HPJOPT2 over Java 0.59 1.15 1.73 2.35 2.95 3.60 4.29 4.87
HPJOPT2 over
Naive translation 1.36 1.34 1.35 1.37 1.31 1.32 1.31 1.35

The speedup of the naive translation over sequential Java is up to 360% with 8 processors. The speedup of HPJOPT2 over sequential Java is up to 487% with 8 processors. The speedup of HPJOPT2 over the naive translation is up to 137%. The table 7.5 shows the speedup of the naive translation and HPJOPT2 for each number of processors over the performance with one processor.


Table 7.5: Speedup of the naive translation and HPJOPT2 for each number of processors over the performance with one processor for the Laplace equation using red-black relaxation without Adlib.writeHalo() on SMP.
Number of Processors 2 3 4 5 6 7 8
Naive translation 1.97 2.96 3.98 5.17 6.27 7.53 8.28
HPJOPT2 1.94 2.92 4.00 4.98 6.08 7.24 8.21

The naive translation gets up to 828% speedup using 8 processors on the shared memory machine. Moreover, HPJOPT2 gets up to 821% speedup. Thus, if we could ignore Adlib.writeHalo(), the HPJava system would give a tremendous performance improvement on the shared memory machine. Moreover, HPJOPT2 optimization scheme works quite well.
Figure 7.3: Laplace Equation using Red-Black Relaxation on shared memory machine
Figure 7.3 shows the performance of the Laplace equation using red-black relaxation with Adlib.writeHalo() on the shared memory machine. The table 7.6 shows the speedup of the naive translation over sequential Java and C programs. Moreover, it shows the speedup of HPJOPT2 over the naive translation.


Table 7.6: Speedup of the naive translation over sequential Java and C programs for the Laplace equation using red-black relaxation on SMP.
Number of Processors 1 2 3 4 5 6 7 8
Naive translation
over Java 0.43 0.79 1.01 1.24 1.38 1.50 1.64 1.71
HPJOPT2 over Java 0.77 1.14 1.38 1.71 1.89 2.07 2.25 2.36
HPJOPT2 over
Naive translation 1.77 1.45 1.37 1.38 1.37 1.38 1.37 1.38

The speedup of the naive translation with 8 processors over sequential Java and C is up to 171%. The speedup of HPJOPT2 with 8 processors over sequential Java and C is up to 236%. The speedups of HPJOPT2 over the naive translation is up to 177%. The table 7.7 shows the speedup of the naive translation and HPJOPT2 for each number of processors over the performance with one processor.


Table 7.7: Speedup of the naive translation and HPJOPT2 for each number of processors over the performance with one processor for the Laplace equation using red-black relaxation on SMP.
Number of Processors 2 3 4 5 6 7 8
Naive translation 1.81 2.33 2.86 3.19 3.46 3.77 3.95
HPJOPT2 1.49 1.80 2.23 2.47 2.69 2.93 3.07

The naive translation gets up to 395% speedup using 8 processors on the shared memory machine. Moreover, HPJOPT2 gets up to 307% speedup. Compared with the performance without Adlib.writeHalo(), it is up to 200% slower. However, in both programs with and without the library call, the optimization achievement is quite consistent. It means obviously optimization strategies are not affected by the run-time communication library calls. They only make the running time of HPJava programs slow down. Thus, when using run-time communication libraries, we need to carefully design HPJava programs since the latency can't be just disregarded. We want to note the shared memory version of Adlib was implemented very naively by porting the transport layer, mpjdev, to exchange messages between Java and threads. A more careful port for the shared memory version would probably give much better performance.
Figure 7.4: Laplace Equation using Red-Black Relaxation on distributed memory machine


Table 7.8: Speedup of the naive translation for each number of processors over the performance with one processor for the Laplace equation using red-black relaxation on the distributed memory machine.
Number of Processors 4 9 16 25 36
Naive translation 4.06 7.75 9.47 12.18 17.04

Figure 7.4 shows performance for the Laplace equation using red-black relaxation on the distributed memory machine. Table 7.8 also shows the speedup of the naive translation for each number of processors over the performance with one processor for the Laplace equation using red-black relaxation on the machine.
next up previous contents
Next: 3-Dimensional Diffusion Equation Up: Benchmarking HPJava, Part II: Previous: Direct Matrix Multiplication   Contents
Bryan Carpenter 2004-06-09