next up previous contents
Next: Partial Differential Equation Up: Benchmarking HPJava, Part I: Previous: Importance of Node Performance   Contents


Direct Matrix Multiplication

Figure 6.1: Performance for Direct Matrix Multiplication in HPJOPT2, PRE, Naive HPJava, Java, and C on the Linux machine
Matrix multiplication is one of the most basic mathematical and scientific algorithms. Several matrix multiplication algorithms have been described in HPJava, such as general matrix multiplication from the Figure 3.15, pipelined matrix multiplication from the HPJava manual [9], and direct matrix multiplication from the Figure 3.13. The ``direct'' matrix multiplication algorithm is relatively easier and potentially more efficient since the operand arrays have carefully chosen replicated/collapsed distributions. Moreover, as illustrated in Figure 3.14, the rows of a are replicated in the process dimension associated with y. Similarly the columns of b are replicated in the dimension associated with x. Hence all arguments for the inner scalar product are already in place for the computation--no run-time communication is needed. Figure 6.1 shows the performance of the direct matrix multiplication programs in Mflops/sec with the sizes of 50$ \times $ 50, 80 $ \times $ 80, 100 $ \times $ 100, 128 $ \times $ 128, and 150 $ \times $ 150 in HPJava, Java, and C on the Linux machine. First, we need to see the Java performance from 6.1. The performance of Java over the C performance is up to 74%, and the average is 67%. This means that with a favorable choice of the matrix size, we can expect the Java implementation to be competitive with the C implementation. Secondly, we see the HPJava performance. The performance of naive translation over Java is up to 82%, and the average is 77%. It is quite acceptable. An interesting result we have expected is the PRE performance over the Java performance. It is up to 122%, and the average is 112%. HPJava with PRE optimization can get the same performance as Java or better on one processor. Thirdly, we need to see the speedup of PRE over the naive translation. It is up to 150%, and the average is 140%. The optimization gives dramatic speedup to the HPJava translation. We can see that HPJOPT2 has no advantage over simple PRE in the Figure 6.1. The reason is the innermost loop for the direct matrix multiplication algorithm in HPJava is ``for'' loop, i.e. ``sequential'' loop, in the Figure 5.5. This means HPJOPT2 scheme has nothing to optimize (e.g. hoisting control variables to the outermost overall construct) for this algorithm.
next up previous contents
Next: Partial Differential Equation Up: Benchmarking HPJava, Part I: Previous: Importance of Node Performance   Contents
Bryan Carpenter 2004-06-09