next up previous
Next: mpjdev Up: Collective Communication for the Previous: An Application


Implementation of Collectives

In this section we will discusses Java implementation of the Adlib collective operations. For illustration we concentrate on the important Remap operation. Although it is a powerful and general operation, it is actually one of the more simple collectives to implement in the HPJava framwork. General algorithms for this primitive have been described by other authors. For example it is essentially equivalent to the operation called Regular_Section_Copy_Sched in [2]. In this section we want to illustrate how this kind of operation can be implemented in term of the particular Range and Group hierarchies of HPJava (complemented by suitable set of messaging primitives). All collective operations in the library are based on communication schedule objects. Each kind of operation has an associated class of schedules. Particular instances of these schedules, involving particular data arrays and other parameters, are created by the class constructors. Executing a schedule initiates the communications required to effect the operation. A single schedule may be executed many times, repeating the same communication pattern. In this way, especially for iterative programs, the cost of computations and negotiations involved in constructing a schedule can often be amortized over many executions. This paradigm was pioneered in the CHAOS/PARTI libraries [8]. If a communication pattern is to be executed only once, simple wrapper functions can be made available to construct a schedule, execute it, then destroy it. The overhead of creating the schedule is essentially unavoidable, because even in the single-use case individual data movements generally have to be sorted and aggregated, for efficiency. The associated data structures are just those associated with schedule construction. Constructor and public method of the remap schedule for distributed arrays of float element can be described as follows:

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{verbatim}
cl...
...}
public execute() {...}
}\end{verbatim}
\end{minipage}
\end{displaymath}

The remap schedule combines two functionalities: it reorganizes data in the way indicated by the distribution formats of source and destination array. Also, if the destination array has a replicated distribution format, it broadcasts data to all copies of the destination. Here we will concentrate on the former aspect, which is handled by an object of class RemapSkeleton contained in every Remap object. During construction of a RemapSkeleton schedule, all send messages, receive messages, and internal copy operations implied by execution of the schedule are enumerated and stored in light-weight data structures. These messages have to be sorted before sending, for possible message agglomeration, and to ensure a deadlock-free communication schedule. These algorithms, and maintenance of the associated data structures, are dealt with in a base class of RemapSkeleton called BlockMessSchedule. The API for the superclass is outlined in Figure 11. To set-up such a low-level schedule, one makes a series of calls to sendReq and recvReq to define the required messages. Messages are characterized by an offset in some local array segment, and a set of strides and extents parameterizing a multi-dimensional patch of the (flat Java) array. Finally the build() operation does any necessary processing of the message lists. The schedule is executed in a ``forward'' or ``backward'' direction by invoking gather() or scatter().
Figure 11: API of the class BlockMessSchedule
\begin{figure}
\small
\begin{verbatim}
public abstract class BlockMessSched...
...
void scatter() { ... }
...
}\end{verbatim}
\normalsize
\end{figure}
The implementation details of BlockMessSchedule will not be discussed in greater detail here because they are not particularly specific to our HPJava system, and the principles are fairly well-known. However we do wish to describe in a little more detail the implementation of the higher-level RemapSkeleton schedule on top of BlockMessSchedule. This provides some insight into the structure HPJava distributed arrays, and the underlying role of the special Range and Group classes. To produce an implementation of the RemapSkeleton class that works independently of the detailed distribution format of the arrays we rely on virtual functions of the Range class to enumerate the blocks of index values held on each processor. These virtual functions, implemented differently for different distribution formats, encode all important information about those formats. To a large extent the communication code itself is distribution format independent.
Figure 12: The HPJava Range hierarchy
2in2in./range.eps
The range hierarchy of HPJava is illustrated in Figure 12, and some of the relevant virtual functions are displayed in the API of Figure 13. The most relevant methods optionally take arguments that allow one to specify a contiguous or strided subrange of interest. The Triplet and Block classes represent simple struct-like objects holding a few int fields describing respectively a ``triplet'' interval, and the strided interval of ``global'' and ``local'' subscripts that the distribution format maps to a particular process. In the examples here Triplet is used only to describe a range of process coordinates that a range or subrange is distributed over.
Figure 13: Partial API of the class Range
\begin{figure}
\small
\begin{verbatim}
public abstract class Range {
publ...
...d, int lo, int hi, int stp) {...}
}\end{verbatim}
\normalsize
\end{figure}
Figure 14: sendLoop method for Remap
\begin{figure}
\begin{verbatim}
private void sendLoop(int offset, Group remGr...
....dim(), crd),
r + 1) ;
}
}
}\end{verbatim}
\normalsize
\end{figure}
Now the RemapSkeleton communication schedule is built by two subroutines called sendLoop and recvLoop that enumerate messages to be sent and received respectively. Figure 14 sketches the implementation of sendLoop. This is a recursive function--it implements a multidimensional loop over the rank dimensions of the arrays. It is initially called with r = 0. There is little point going into full detail of the algorithm here, but an important thing to note is how this function uses the virtual methods on the range objects of the source and destination arrays to enumerate blocks--local and remote--of relevant subranges, and enumerates the messages that must be sent. Figure 15 illustrates the significance of some of the variables in the code. When the offset and all extents and strides of a particular message have been accumulated, the sendReq() method of the base class is invoked. The variables src and dst represent the distributed array arguments. The inquiries rng() and grp() extract the range and group objects of these arrays.
Figure 15: Illustration of sendLoop operation for ramp
3.3in1.8in./remap.eps
Not all the schedules of Adlib are as ``pure'' as Remap. A few, like WriteHalo have built-in dependency on the distribution format of the arrays (the existence of ghost regions in the case of WriteHalo). But all rely heavily on the methods and inquiries of the Range and Group classes, which abstract the distribution format of arrays. The specific API of these classes has evolved through C++ and Java versions of Adlib over a fairly lengthy period of development. In the HPJava version currently under development, the lower-level, underlying schedules like BlockMessSchedule (which are not dependent on higher-level ideas like distributed ranges and distributed arrays) are in turn implemented on top of a messaging API, called mpjdev, described the next section. To perform the actual communication and to deal with preparation of the data, it uses methods of the mpjdev like isend(), irecv(), read(), write(), strGatjer(), and strScatter(). The isend() and irecv() are used for actual communication. The write() and strGatjer() are used for packing the data and read() and strScatter() are used for unpacking the data where two of those methods (read() and write()) are deling with a contiguous data and the other two (strGatjer() and strScatter()) are dealing with non-contiguous data.
next up previous
Next: mpjdev Up: Collective Communication for the Previous: An Application
Bryan Carpenter 2003-01-23