Next: Partial Differential Equation
Up: Benchmarking HPJava, Part I:
Previous: Importance of Node Performance
Contents
Direct Matrix Multiplication
Figure 6.1:
Performance for Direct Matrix Multiplication in HPJOPT2, PRE,
Naive HPJava, Java, and C on the Linux machine

Matrix multiplication is one of the most basic mathematical and
scientific algorithms. Several matrix multiplication algorithms have
been described in HPJava, such as general matrix multiplication from
the Figure 3.15, pipelined matrix multiplication
from the HPJava manual [9], and direct matrix
multiplication from the Figure 3.13.
The ``direct'' matrix multiplication algorithm is relatively easier
and potentially more efficient since the operand arrays have carefully
chosen replicated/collapsed distributions. Moreover, as illustrated in
Figure 3.14, the rows of a
are replicated in
the process dimension associated with y
. Similarly the columns
of b
are replicated in the dimension associated with x
.
Hence all arguments for the inner scalar product are already in place
for the computationno runtime communication is needed.
Figure 6.1 shows the performance of the direct matrix
multiplication programs in Mflops/sec with the sizes of
50 50, 80 80, 100 100, 128 128, and
150 150 in HPJava, Java, and C on the Linux machine.
First, we need to see the Java performance from 6.1.
The performance of Java over the C performance is up to 74%, and the
average is 67%. This means that with a favorable choice
of the matrix size, we can expect the Java implementation to be
competitive with the C implementation. Secondly, we see the HPJava
performance. The performance of naive translation over Java is up to
82%, and the average is 77%. It is quite acceptable. An interesting
result we have expected is the PRE performance over the Java
performance. It is up to 122%, and the average is 112%. HPJava with
PRE optimization can get the same performance as Java or better on one
processor. Thirdly, we need to see the speedup of PRE over the naive
translation. It is up to 150%, and the average is 140%. The
optimization gives dramatic speedup to the HPJava translation.
We can see that HPJOPT2 has no advantage over simple PRE in the
Figure 6.1. The reason is the innermost loop
for the direct matrix multiplication algorithm in HPJava is ``for''
loop, i.e. ``sequential'' loop, in the Figure
5.5. This means HPJOPT2 scheme has nothing
to optimize (e.g. hoisting control variables to the outermost overall
construct) for this algorithm.
Next: Partial Differential Equation
Up: Benchmarking HPJava, Part I:
Previous: Importance of Node Performance
Contents
Bryan Carpenter
20040609