Next: mpjdev
Up: Collective Communication for the
Previous: An Application
Implementation of Collectives
In this section we will discusses Java implementation of the Adlib collective
operations. For illustration we concentrate on the important Remap
operation. Although it is a powerful and general operation, it is actually
one of the more simple collectives to implement in the HPJava framwork.
General algorithms for this primitive have been described by other authors.
For example it is essentially equivalent to the operation called
Regular_Section_Copy_Sched in [2].
In this section we want to illustrate how this kind of operation can be
implemented in term of the particular Range and Group
hierarchies of HPJava (complemented by suitable set of messaging
primitives).
All collective operations in the library are based on communication
schedule objects. Each kind of operation has an associated class of
schedules. Particular instances of these schedules, involving particular data
arrays and other parameters, are created by the class constructors. Executing
a schedule initiates the communications required to effect the operation.
A single schedule may be executed many times, repeating the same communication
pattern. In this way, especially for iterative programs, the cost of
computations and negotiations involved in constructing a schedule can often be
amortized over many executions. This paradigm was pioneered in the
CHAOS/PARTI libraries [8]. If a communication pattern is to be
executed only once, simple wrapper functions can be made available to
construct a schedule, execute it, then destroy it. The overhead of creating
the schedule is essentially unavoidable, because even in the single-use case
individual data movements generally have to be sorted and aggregated, for
efficiency. The associated data structures are just those associated with
schedule construction.
Constructor and public method of the remap schedule for
distributed arrays of float element can be described as follows:
The remap schedule combines two functionalities: it reorganizes data
in the way indicated by the distribution formats of source and
destination array. Also, if the destination array has a replicated
distribution format, it broadcasts data to all copies of the destination.
Here we will concentrate on the former aspect, which is handled by
an object of class RemapSkeleton contained in every Remap
object.
During construction of a RemapSkeleton schedule, all send messages,
receive messages, and internal copy operations implied by execution
of the schedule are enumerated and stored in light-weight data structures.
These messages have to be sorted before sending, for possible message
agglomeration, and to ensure a deadlock-free communication schedule.
These algorithms, and maintenance of the associated data structures,
are dealt with in a base class of RemapSkeleton called
BlockMessSchedule. The API for the superclass is outlined in
Figure 11. To set-up such a low-level schedule, one makes
a series of calls to sendReq and recvReq to define the
required messages. Messages are characterized by an offset in some
local array segment, and a set of strides and extents parameterizing
a multi-dimensional patch of the (flat Java) array. Finally the
build() operation does any necessary processing of the message lists.
The schedule is executed in a ``forward'' or ``backward'' direction by
invoking gather() or scatter().
Figure 11:
API of the class BlockMessSchedule
 |
The implementation details of BlockMessSchedule will not be
discussed in greater detail here because they are not particularly
specific to our HPJava system, and the principles are fairly
well-known.
However we do wish to describe in a little more detail the implementation
of the higher-level RemapSkeleton schedule on top of
BlockMessSchedule. This provides some insight into the structure
HPJava distributed arrays, and the underlying role of the special
Range and Group classes.
To produce an implementation of the RemapSkeleton class that
works independently of the detailed distribution format of the arrays
we rely on virtual functions of the Range class to enumerate
the blocks of index values held on each processor. These virtual
functions, implemented differently for different distribution formats,
encode all important information about those formats.
To a large extent the communication code itself is distribution
format independent.
Figure 12:
The HPJava Range hierarchy
|
2in2in./range.eps
|
The range hierarchy of HPJava is illustrated in Figure 12,
and some of the relevant virtual functions are displayed in the
API of Figure 13. The most relevant methods optionally
take arguments that allow one to specify a contiguous or strided
subrange of interest. The Triplet and Block
classes represent simple struct-like objects holding a few int
fields describing respectively a ``triplet'' interval, and the strided
interval of ``global'' and ``local'' subscripts that the distribution format
maps to a particular process. In the examples here Triplet is
used only to describe a range of process coordinates that a range
or subrange is distributed over.
Figure 13:
Partial API of the class Range
 |
Figure 14:
sendLoop method for Remap
 |
Now the RemapSkeleton communication schedule is built by two
subroutines called sendLoop and recvLoop
that enumerate messages to be sent and received respectively.
Figure 14 sketches the implementation of sendLoop.
This is a recursive function--it implements a
multidimensional loop over the rank dimensions of the arrays.
It is initially called with r = 0. There is little point
going into full detail of the algorithm here, but an important thing
to note is how this function uses the virtual methods on the range objects
of the source and destination arrays to enumerate
blocks--local and remote--of relevant subranges, and enumerates
the messages that must be sent. Figure 15 illustrates the
significance of some of the variables in the code. When the offset and
all extents and strides of a particular message have been accumulated,
the sendReq() method of the base class is invoked. The variables
src and dst represent the distributed array arguments.
The inquiries rng() and grp() extract the range and group
objects of these arrays.
Figure 15:
Illustration of sendLoop operation for ramp
|
3.3in1.8in./remap.eps
|
Not all the schedules of Adlib are as ``pure'' as Remap.
A few, like WriteHalo have built-in dependency on the distribution
format of the arrays (the existence of ghost regions in the case
of WriteHalo). But all rely heavily on the methods and
inquiries of the Range and Group classes, which abstract
the distribution format of arrays. The specific API of these classes
has evolved through C++ and Java versions of Adlib over a fairly lengthy
period of development.
In the HPJava version currently under development, the lower-level,
underlying schedules like BlockMessSchedule (which are not dependent
on higher-level ideas like distributed ranges and distributed arrays)
are in turn implemented on top of a messaging API, called mpjdev,
described the next section. To perform the actual communication and
to deal with preparation of the data, it uses methods of the mpjdev
like isend(), irecv(), read(), write(),
strGatjer(), and strScatter(). The isend() and
irecv() are used for actual communication. The write() and
strGatjer() are used for packing the data and read() and
strScatter() are used for unpacking the data where two of those methods
(read() and write()) are deling with a contiguous data and
the other two (strGatjer() and strScatter()) are dealing with
non-contiguous data.
Next: mpjdev
Up: Collective Communication for the
Previous: An Application
Bryan Carpenter
2003-01-23