next up previous contents
Next: Conclusion and Future Work Up: Applications and Performance Previous: LAPI   Contents


Communication Evaluation

In this section we timed each part of an Adlib communication call to compare underlying communication latency with C/MPI and to find most time consuming part of the operation. This data can be used for further optimization of Adlib communication calls.

We divided HPJava communication into two parts: actual communication calls like isend(), irecv(), and iwaitany(), and communication preparations. In preparation parts, we include a high-level Adlib collective communication schedule like the method remap() and writeHalo(), and a message packing and unpacking parts of mpjdev. We measured timing of communication and preparation parts in a Laplace equation solver on 9 processors, and also measured sendrecv() function call using C/MPI with 1368 bytes on 2 processors (Table 6.9).

Table 6.9: Latency of Laplace equation Communication Library per One Iteration and C/MPI sendrecv() function in microseconds.
Adlib.writeHalo() Preparations Communications C/MPI (1368 bytes)  
300.00 100.00 200.00 44.97  


We use 9 processors to measure timing because this is the smallest number of processors with most time consuming communication pattern. Figure 6.14 illustrates communication patterns of Adlib.writeHalo() method among 9 processors with a 3 by 3 processes grid. This figure indicates that processor 4 is the most communication-based processor with 4 pairs of send and receive. Since the size of the problem is $ 512^2$, the processor 4 performed each send and receive with 171 double values (1368 bytes).

According to Table 6.9 the communication in writeHalo() takes about 200 microseconds. Meanwhile 4 pairs of send and receive communications are taking about 180 microseconds with C/MPI. So our communication library performs very close to the C/MPI version with marginal overhead. Overhead in our communication is due to the language binding and some extra work like finding the Java class and store communication result values during the JNI call. As we can see in the Table 6.9, one third (100 microseconds) of the total writeHalo() method timing (300 microseconds) is consumed by the preparation of communication. It is bit high but it is not such a bad performance for the initial implementation. Useful optimization can be done on this part in the future.

Figure 6.14: writeHalo() communication patterns on 9 processors.
\includegraphics[width=3.5in height=2.5in]{Figs/wh9.eps}

We see similar behavior in the CFD benchmark (Table 6.10). This is done on 9 processors with processes grid of 9 by 1 and problem size of 97 by 25. In this processes grid two send and receive communication is occurred on each processors, where first and last processor which only one send and receive is happening. We have about 20 microseconds of communication latency which is about same as previous case. About one fourth of total time is spent on preparation. This reduction of preparation time is due to the smaller problem size.

Table 6.10: Latency of CFD Communication Library per One Iteration and C/MPI sendrecv() function in microseconds.
Adlib.writeHalo() Preparations Communications C/MPI (1728 bytes)  
181.57 42.1 139.47 60.00  


The above data indicates that further optimization is needed on preparation part of Adlib. In the future we also may adopt a platform specific communication library (for example, LAPI on AIX) instead of using MPI to reduce actual communication latency as discussed in section 5.5.3.


next up previous contents
Next: Conclusion and Future Work Up: Applications and Performance Previous: LAPI   Contents
Bryan Carpenter 2004-06-09