For the sake of concrete discussion we will make an assumption that, in the kind of Grande applications where MPI is likely to be used, some of the most pressing performance issues concern arrays and multidimensional arrays of small objects--especially arrays of primitive elements such as ints and floats. For benchmarks we therefore concentrated on the overheads introduced by object serialization when the objects contain many arrays of primitive elements. Specifically we concentrated on communication of two-dimensional arrays with primitive elements. We note that there some debate about whether the Java model of multidimensional arrays is the most appropriate one for high performance computing. There are various proposals for optimized HPC array class libraries [#!MIDKIFF!#]. See Section 5.6 for some further discussion.
The ``ping-pong'' method was used to time point-to-point communication of an by array of primitive elements treated as a one dimensional array of objects, and compare it with communication of an array without using serialization. As an intermediate case we also timed communication of a by array treated as a one-dimensional (size 1) array of objects. This allows us to extract an estimate of the overhead to ``serialize'' an individual primitive element. The code for sending and receiving the various array shapes is given schematically in Figure 5.1.
As a crude timing model for these benchmarks, one can assume that there
is a cost
serialize each primitive element of type T, an additional cost
to serialize each
subarray, similar constants
unserialization, and a cost
physically tranfser each element of data.
Then the total time for benchmarked communications should be
All measurements were performed on a cluster of 2-processor, 200 Mhz UltraSparc nodes connected through a SunATM-155/MMF network. The underlying MPI implementation was Sun MPI 3.0 (part of the Sun HPC package). The JDK was jdk1.2beta4. Shared memory results quoted are obtained by running two processes on the processors of a single node. Non-shared-memory results are obtained by running peer processes in different nodes.
In a series of measurements, element serialization and unserialization timing parameters were estimated by independent benchmarks of the serialization code. The parameters and were estimated by plotting the difference between serialization and unserialization times for T and T Our timing model assumed the values of these parameters is independent of the element type. This is only approximately true, and the values quoted in the table and used in the plotted curves are averages. Separately measured values for byte arrays were smaller than these averages, and for int and float arrays they were larger. The raw communication speed was estimated from ping-pong results for . Table 5.1 contains the resulting estimates of the various parameters for byte and float elements.
Figures 5.2 and 5.3 on Page and plot actual measured times from ping-pong benchmarks for the mpiJava sends and receives of arrays with byte and float elements. In the plots the array extent, , ranges between 128 and 1024. The measured times for , and are compared with the formulae given above (setting the constants to zero). The agreement is good, so our parametrization is assumed to be realistic in the regime considered.
According to table 5.1 the overhead of Java serialization nearly always dominates other communication costs. In the worst case--floating point numbers--it takes around 2 microseconds to serialize each number and a smaller but comparable time to unserialize. But it only takes a few hundredths of a microsecond to communicate the word through shared memory. Serialization slows communication by nearly two orders of magnitude. When the underlying communication is over a fast network rather than through shared memory the raw communication time is still only a fraction of a microsecond, and serialization still dominates that time by about one order of magnitude. For byte elements serialization costs are smaller, but still larger than the communication costs in the fast network and still much larger than the communication cost through shared memory. Serialization costs for int elements are intermediate.
The constant overheads for serializing each subarray, characterized by the parameters and are also quite large, although, for the array sizes considered here they only make a dominant contribution for the byte arrays, where individual element serialization is relatively fast.