For the sake of concrete discussion we will make an assumption that, in the kind of Grande applications where MPI is likely to be used, some of the most pressing performance issues concern arrays and multidimensional arrays of small objects--especially arrays of primitive elements such as ints and floats. For benchmarks we therefore concentrated on the overheads introduced by object serialization when the objects contain many arrays of primitive elements. Specifically we concentrated on communication of two-dimensional arrays with primitive elements. We note that there some debate about whether the Java model of multidimensional arrays is the most appropriate one for high performance computing. There are various proposals for optimized HPC array class libraries [#!MIDKIFF!#]. See Section 5.6 for some further discussion.
The ``ping-pong'' method was used to time
point-to-point communication of an by array of primitive
elements treated as a one dimensional array of objects, and compare it
with communication of an
array without using serialization. As
an intermediate case we also timed communication of a
by
array treated as a one-dimensional (size 1) array of objects.
This allows us to extract an estimate of the overhead to ``serialize''
an individual primitive element.
The code for sending and receiving the various array shapes
is given schematically in Figure 5.1.
|
As a crude timing model for these benchmarks, one can assume that there
is a cost
to
serialize each primitive element of type T, an additional cost
to serialize each
subarray, similar constants
and
for
unserialization, and a cost
to
physically tranfser each element of data.
Then the total time for benchmarked communications should be
All measurements were performed on a cluster of 2-processor, 200 Mhz UltraSparc nodes connected through a SunATM-155/MMF network. The underlying MPI implementation was Sun MPI 3.0 (part of the Sun HPC package). The JDK was jdk1.2beta4. Shared memory results quoted are obtained by running two processes on the processors of a single node. Non-shared-memory results are obtained by running peer processes in different nodes.
In a series of measurements, element serialization and unserialization
timing parameters were estimated by independent benchmarks of the
serialization code. The parameters
and
were estimated by plotting the
difference between serialization and unserialization times for T[1][
] and T[][]
Our timing
model assumed the values of these parameters is independent of the
element type. This is only approximately true, and the values quoted in
the table and used in the plotted curves are averages.
Separately measured values for byte arrays were smaller than
these averages, and for int and float arrays they were
larger.
The raw communication speed was
estimated from ping-pong results for
.
Table 5.1
contains the resulting estimates of the various parameters for byte
and float elements.
|
Figures 5.2 and 5.3 on Page
and
plot actual measured
times from ping-pong benchmarks for the mpiJava sends and receives of
arrays with byte and float elements. In the
plots the array extent, , ranges between 128 and 1024. The measured
times for
,
and
are compared
with the formulae given above (setting the
constants to zero). The
agreement is good, so our parametrization is assumed to be realistic in
the regime considered.
According to table 5.1 the overhead of Java serialization nearly always dominates other communication costs. In the worst case--floating point numbers--it takes around 2 microseconds to serialize each number and a smaller but comparable time to unserialize. But it only takes a few hundredths of a microsecond to communicate the word through shared memory. Serialization slows communication by nearly two orders of magnitude. When the underlying communication is over a fast network rather than through shared memory the raw communication time is still only a fraction of a microsecond, and serialization still dominates that time by about one order of magnitude. For byte elements serialization costs are smaller, but still larger than the communication costs in the fast network and still much larger than the communication cost through shared memory. Serialization costs for int elements are intermediate.
The constant overheads for serializing each subarray, characterized by
the parameters
and
are also
quite large, although, for the array sizes considered here they only
make a dominant contribution for the byte arrays, where
individual element serialization is relatively fast.