The CHAOS/PARTI series of libraries was developed at the University of Maryland.
The original PARTI library was designed to deal with irregular scientific computations. A classic example is a physical problem discretized on an unstructured mesh. A characteristic inner loop for this kind of problem might look something like:
DO I = 1, NEDGE Y(EDGE1(I)) = Y(EDGE1(I)) + F(X(EDGE1(I)), X(EDGE2(I))) Y(EDGE2(I)) = Y(EDGE2(I)) + G(X(EDGE1(I)), X(EDGE2(I))) ENDDOWe assume the problem is defined on some graph (or mesh). The arrays
Yare defined over the nodes of the graph. The Ith edge in the graph connects the nodes
EDGE2(I). The value of
Yat a node is a sum of terms, each depending on the value of
Xat the ends of an edge connected to the node.
PARTI is particularly optimized for problems that have some ``locality of reference''. It assumes that a mapping of nodes to processors can be chosen such that most edges connect pairs of nodes on the same processor. The situation is illustrated in Figure 5. The local loops will be over the edges held on each processor. Here only one the edge (2, 3) held on processor P0 causes non-local references.
Figure 5 is copied from one of the illustrations in . It assumes an irregular distribution of X and Y: elements 1, 2, 5 of the data arrays are stored on the first processor and elements 3, 4, 6 are stored on the second processor. This property is not particularly important to the discussion here. We could assume that the node numbering is permuted so that arrays like X and Y have a block distribution.
In any case the important point is that there is a class of problems with the following property: edges and nodes of the problem graph can be partitioned in such a way that most references become local. To be specific, they can be partitioned so that the majority of locally held elements of indirection vectors like EDGE1, EDGE2 reference locally held elements of data arrays, like X and Y.
Based on this observation, PARTI assumes irregular loops are parallelized by a technique similar to the method of ghost regions discussed for regular stencil problems in section 1.2. First the indirection vectors are preprocessed to convert global index values to local subscripts. The locality property of the partition implies that the majority of these local subscripts refer to locally held data elements. If the global index actually referenced a data element held on another processor, the local subscript references an element in a ``ghost extension'' of the local data segment. These ghost regions are filled or flushed by suitable PARTI primitives: collective communication routines, called outside the local processing loop.
Here is a simpler sequential loop with irregular accesses:
DO I = 1, N X(IA(I)) = X(IA(I)) + Y(IB(I)) ENDDOA parallel version (from ) is given in Figure 6.
The first call to the subroutine LOCALIZE deals with the
X(IA(I)) terms. It does two things. It translates the I_BLK_COUNT global subscripts in IA to local subscripts, returned
in the array LOCAL_IA, and it sets up a communication
schedule. A handle to this data structure is returned in SCHEDULE_IA.
A communication schedule is created by analysing the requested set of accesses, sending lists of accessed elements to the processors that own them where necessary, detecting appropriate aggregations and redundancy eliminations, and so on. The end result is some digested list of messages that must be sent and received, including the local sources and destinations of the data in those messages.
Another input parameter to LOCALIZE is the distribution of the data array--DAD_X in the first call. Another output parameter is the number of elements in the ghost region that will actually be needed--OFF_PROC_X here.
The second call to LOCALIZE performs a similar analysis for the
Together these calls comprise what is called the inspector phase for the loop. It is followed by the executor phase, in which results are actually computed and data is actually communicated.
The collective call to GATHER communicates necessary element
values from physical segments of Y--the second argument--into
the target ghost regions for Y, which starts at
Y(Y_BLK_SIZE + 1)--the first argument. The third argument
is the communication schedule for this operation. The call to ZERO_OUT_BUFFER just sets all elements in the ghost region of X to zero.
The main loop is self-explanatory. Contributions to locally owned
X(IA) elements are accumulated directly into the local physical
segment of X. Contributions to non-locally owned elements are
accumulated into the ghost region of X.
Finally the call SCATTER_ADD sends the values in the ghost region of X to the relevant owners, where they are added to the appropriate elements in the physical region of the array segment. This is an example of a combining scatter operation.
Besides the PARTI software, a major contribution here is the elaboration of the inspector-executor model, and the insight that construction of communication schedules should be separated from execution of those schedules. One immediate benefit of this separation arises in the common situation where the form of the inner loop (the pattern of subscripting) is constant over many iterations of some outer loop. The same communication schedule can be reused many times--in other words the inspector phase can be lifted out of the main loop.
The importance of the inspector-executor model and the idea of communication schedules are not tied to the details of the PARTI primitives. For example they are not dependent on the particular assumptions about locality, or the special use of ghost regions, or even particularly specific to irregular computations.