next up previous contents
Next: Collective Communications Up: A high-level communication library Previous: Background   Contents

Implementation of Collectives

In this section we will discuss Java implementation of the Adlib collective operations. For illustration we concentrate on the important Remap operation. Although it is a powerful and general operation, it is actually one of the more simple collectives to implement in the HPJava framework.

General algorithms for this primitive have been described by other authors in the past. For example it is essentially equivalent to the operation called Regular_Section_Copy_Sched in [6]. In this section we want to illustrate how this kind of operation can be implemented in term of the particular Range and Group classes of HPJava, complemented by suitable set of messaging primitives.

All collective operations in the library are based on communication schedule objects. Each kind of operation has an associated class of schedules. Particular instances of these schedules, involving particular data arrays and other parameters, are created by the class constructors. Executing a schedule initiates the communications required to effect the operation. A single schedule may be executed many times, repeating the same communication pattern. In this way, especially for iterative programs, the cost of computations and negotiations involved in constructing a schedule can often be amortized over many executions. This pattern was pioneered in the CHAOS/PARTI libraries [21]. If a communication pattern is to be executed only once, simple wrapper functions are made available to construct a schedule, execute it, then destroy it. The overhead of creating the schedule is essentially unavoidable, because even in the single-use case individual data movements generally have to be sorted and aggregated, for efficiency. The data structures for this are just those associated with schedule construction.

Constructor and public method of the remap schedule for distributed arrays of float element can be summarized as follows:

$\displaystyle \begin{minipage}[t]{\linewidth}\small\begin{verbatim}
class Remap...  ...

The # notation was explained in section 3.3.

The remap schedule combines two functionalities: it reorganizes data in the way indicated by the distribution formats of source and destination array. Also, if the destination array has a replicated distribution format, it broadcasts data to all copies of the destination. Here we will concentrate on the former aspect, which is handled by an object of class RemapSkeleton contained in every Remap object.

During construction of a RemapSkeleton schedule, all send messages, receive messages, and internal copy operations implied by execution of the schedule are enumerated and stored in light-weight data structures. These messages have to be sorted before sending, for possible message agglomeration, and to ensure a deadlock-free communication schedule. These algorithms, and maintenance of the associated data structures, are dealt with in a base class of RemapSkeleton called BlockMessSchedule. The API for the superclass is outlined in Figure 4.1. To set-up such a low-level schedule, one makes a series of calls to sendReq and recvReq to define the required messages. Messages are characterized by an offset in some local array segment, and a set of strides and extents parameterizing a multi-dimensional patch of the (flat Java) array. Finally the build() operation does any necessary processing of the message lists. The schedule is executed in a ``forward'' or ``backward'' direction by invoking gather() or scatter().

Figure 4.1: API of the class BlockMessSchedule
\begin{figure}\small\begin{verbatim}public abstract class BlockMessSchedule {...
...) { ... }void scatter() { ... }...

In general Top-level schedules such as Remap, which deal explicitly with distributed arrays, are implemented in terms of some lower-level schedules such as BlockMessSchedule that simply operate on blocks and words of data. These lower-level schedules do not directly depend on the Range and Group classes. The lower level schedules are tabulated in Table 4.1.

Table 4.1: Low-level Adlib schedules
  operations on ``words'' operations on ``blocks''
Point-to-point MessSchedule BlockMessSchedule
Remote access DataSchedule BlockDataSchedule
  TreeSchedule BlockTreeSchedule
Tree operations RedxSchedule BlockRedxSchedule
  Redx2Schedule BlockRedx2Schedule

Here ``words'' means contiguous memory blocks of constant (for a given schedule instance) size. ``Blocks'' means multidimensional (r-dimensional) local array sections, parameterized by a vector of r extents and a vector of memory strides. The point-to-point schedules are used to implement collective operations that are deterministic in the sense that both sender and receiver have advanced knowledge of all required communications. Hence Remap and other regular communications such as Shift are implemented on top of BlockMessSchedule. The ``remote access'' schedules are used to implement operations where one side must inform the other end that a communication is needed. These negotiations occur at schedule-construction time. Irregular communication operations such as collective Gather and Scatter are implemented on these schedules. The tree schedules are used for various sorts of broadcast, multicast, synchronization, and reduction.

We will describe in more detail the implementation of the higher-level RemapSkeleton schedule on top of BlockMessSchedule. This provides some insight into the structure HPJava distributed arrays, and the underlying role of the special Range and Group classes.

To produce an implementation of the RemapSkeleton class that works independently of the detailed distribution format of the arrays we rely on virtual functions of the Range class to enumerate the blocks of index values held on each processor. These virtual functions, implemented differently for different distribution formats, encode all important information about those formats. To a large extent the communication code itself is distribution format independent.

The range hierarchy of HPJava was illustrated in Figure 3.2, and some of the relevant virtual functions are displayed in the API of Figure 4.2. Most methods optionally take arguments that allow one to specify a contiguous or strided subrange of interest. The Triplet and Block instances represent simple struct-like objects holding a few int fields. Those integer files are describing respectively a ``triplet'' interval, and the strided interval of ``global'' and ``local'' subscripts that the distribution format maps to a particular process. In the examples here Triplet is used only to describe a range of process coordinates that a range or subrange is distributed over.

Figure 4.2: Partial API of the class Range
\begin{figure}\small\begin{verbatim}public abstract class Range {
public int...
...d, int lo, int hi, int stp) {...}
. . .

Figure 4.3: sendLoop method for Remap
\begin{figure}\small\begin{verbatim}private void sendLoop(int offset, Group r...
...strict(rng.dim(), crd),
r + 1) ;

Now the RemapSkeleton communication schedule is built by two methods called sendLoop and recvLoop that enumerate messages to be sent and received respectively. Figure 4.3 sketches the implementation of sendLoop. This is a recursive function--it implements a multidimensional loop over the rank dimensions of the arrays. It is initially called with r = 0. An important thing to note is how this function uses the virtual methods on the range objects of the source and destination arrays to enumerate blocks--local and remote--of relevant subranges, and enumerates the messages that must be sent. Figure 4.4 illustrates the significance of some of the variables in the code. When the offset and all extents and strides of a particular message have been accumulated, the sendReq() method of the base class is invoked. The variables src and dst represent the distributed array arguments. The inquiries rng() and grp() extract the range and group objects of these arrays.

Figure 4.4: Illustration of sendLoop operation for remap
\begin{figure}\centering\epsfig{file=Figs/remap.eps, width=4.5in}\end{figure}

Not all the schedules of Adlib are as ``pure'' as Remap. A few, like WriteHalo have built-in dependency on the distribution format of the arrays (the existence of ghost regions in the case of WriteHalo). But they all rely heavily on the methods and inquiries of the Range and Group classes, which abstract the distribution format of arrays. The API of these classes has evolved through C++ and Java versions of Adlib over a long period.

In the HPJava version, the lower-level, underlying schedules like BlockMessSchedule (which are not dependent on higher-level ideas like distributed ranges and distributed arrays) are in turn implemented on top of a messaging API, called mpjdev, described in the section 5.1. To deal with preparation of the data and to perform the actual communication, it uses methods of the mpjdev like read(), write(), strGather(), strScatter(), isend(), and irecv().

The write() and strGather() are used for packing the data and read() and strScatter() are used for unpacking the data where two of those methods (read() and write()) are dealing with a contiguous data and the other two (strGather() and strScatter()) are dealing with non-contiguous data. The usage of strGather() is to write a section to the buffer from a multi-dimensional, strided patch of the source array. The behaviour of strScatter() is opposite of strGather(). It reads a section from the buffer into a multi-dimensional, strided patch of the destination array. The isend() and irecv() are used for actual communication.

next up previous contents
Next: Collective Communications Up: A high-level communication library Previous: Background   Contents
Bryan Carpenter 2004-06-09