Next: The Global Array Toolkit Up: High Level Libraries for Previous: High Level Libraries for   Contents

PARTI

The PARTI [16] series of libraries was developed at University of Maryland. PARTI was originally designed for irregular scientific computations. In irregular problems (e.g. PDEs on unstructured meshes, sparse matrix algorithms, etc) a compiler can't anticipate data access patterns until run-time, since the patterns may depend on the input data, or be multiply indirected in the program. The data access time can be decreased by pre-computing which data elements will be sent and received. PARTI transforms the original sequential loop to two constructs, called the inspector and executor loops. First, the inspector loop analyzes the data references and calculates what data needs to be fetched and where they should be stored. Second, the executor loop executes the actual computation using the information generated by the inspector. Figure 2.7 is a basic sequential loop with irregular accesses. A parallel version from [16] is illustrated in Figure 2.8.
Figure 2.7: Simple sequential irregular loop.
 ``` DO I = 1, N X(IA(I)) = X(IA(I)) + Y(IB(I)) ENDDO ```
Figure 2.8: PARTI code for simple parallel irregular loop.
 ```C Create required schedules (Inspector): CALL LOCALIZE(DAD_X, SCHEDULE_IA, IA, LOCAL_IA, I_BLK_COUNT, OFF_PROC_X) CALL LOCALIZE(DAD_Y, SCHEDULE_IB, IB, LOCAL_IB, I_BLK_COUNT, OFF_PROC_Y) C Actual computation (Executor): CALL GATHER(Y(Y_BLK_SIZE + 1), Y, SCHEDULE_IB) CALL ZERO_OUT_BUFFER(X(X_BLK_SIZE + 1), OFF_PROC_X) DO L = 1, I_BLK_COUNT X(LOCAL_IA(I)) = X(LOCAL_IA(I)) + Y(LOCAL_IA(I)) ENDDO CALL SCATTER_ADD(X(X_BLK_SIZE + 1), X, SCHEDULE_IA) ```
The first call, `LOCALIZE`, from the parallel version corresponds to `X(IA(I))` terms from the sequential loop. It translates the `I_BLK_COUNT` global subscripts in `IA` to local subscripts, which are returned in the array `LOCAL_IA`. Also, it builds up a communication schedule, which is returned in `SCHEDULE_IA`. Setting up the communication schedule involves resolving the requests of accesses, sending lists of accessed elements to the owner processors, detecting proper accumulation and redundancy eliminations, etc. The result is some list of messages that contains the local sources and destinations of the data. Another input argument of `LOCALIZE` is the descriptor of the data array, `DAD_X`. The second call works in the similar way with respect to `Y(IA(I))`. We have seen the inspector phrase for the loop. The next is the executor phrase where actual computations and communications of data elements occurs. A collective call, `GATHER` fetches necessary data elements from `Y` into the target ghost regions which begins at `Y(Y_BLK_SIZE + 1)`. The argument, `SCHEDULE_IB`, includes the communication schedule. The next call `ZERO_OUT_BUFFER` make the value of all elements of the ghost region of `X` zero. In the main loop the results for locally owned `X(IA)` elements are aggregated directly to the local segment `X`. Moreover, the results from non-locally owned elements are aggregated to the ghost region of `X`. The final call, `SCATTER_ADD`, sends the values in the ghost region of `X` to the related owners where the values are added in to the physical region of the segement. We have seen the inspector-executor model of PARTI. An important lesson from the model is that construction of communication schedules must be isolated from execution of those schedules. The immediate benefit of this separation arises in the common situation where the form of the inner loop is constant over many iterations of some outer loop. The same communication schedule can be reused many times. The inspector phase can be moved out of the main loop. This pattern is supported by the Adlib library used in HPJava.

Next: The Global Array Toolkit Up: High Level Libraries for Previous: High Level Libraries for   Contents
Bryan Carpenter 2004-06-09