We divided HPJava communication into two parts: actual communication calls
like isend(), irecv(), and iwaitany(), and
communication preparations.
In preparation parts, we include a high-level Adlib collective communication
schedule like the method remap() and writeHalo(), and a message
packing and unpacking parts of mpjdev.
We measured timing of communication and preparation parts
in a Laplace equation solver on 9 processors, and
also measured sendrecv() function call using C/MPI
with 1368 bytes on 2 processors (Table 6.9).
|
We use 9 processors to measure timing because this is the smallest number of
processors with most time consuming communication pattern.
Figure 6.14 illustrates communication patterns of
Adlib.writeHalo() method among 9 processors with a 3 by 3 processes grid.
This figure indicates that processor 4 is the most
communication-based processor with 4 pairs of send and receive.
Since the size of the problem is
, the processor 4 performed each send
and receive with 171 double values (1368 bytes).
According to Table 6.9 the communication in writeHalo() takes about 200 microseconds. Meanwhile 4 pairs of send and receive communications are taking about 180 microseconds with C/MPI. So our communication library performs very close to the C/MPI version with marginal overhead. Overhead in our communication is due to the language binding and some extra work like finding the Java class and store communication result values during the JNI call. As we can see in the Table 6.9, one third (100 microseconds) of the total writeHalo() method timing (300 microseconds) is consumed by the preparation of communication. It is bit high but it is not such a bad performance for the initial implementation. Useful optimization can be done on this part in the future.
We see similar behavior in the CFD benchmark (Table 6.10).
This is done on 9 processors with processes grid of 9 by 1 and
problem size of 97 by 25.
In this processes grid two send and receive communication is occurred on each
processors, where first and last processor which only one send and receive is
happening. We have about 20 microseconds of communication latency which is
about same as previous case. About one fourth of total time is spent on
preparation.
This reduction of preparation time is due to the smaller problem size.
|
The above data indicates that further optimization is needed on preparation part of Adlib. In the future we also may adopt a platform specific communication library (for example, LAPI on AIX) instead of using MPI to reduce actual communication latency as discussed in section 5.5.3.