next up previous contents
Next: Reducing Serialization Overheads for Up: Object Serialization for Marshalling Previous: Adding Serialization to the   Contents


Benchmark Results for Multidimensional Arrays

For the sake of concrete discussion we will make an assumption that, in the kind of Grande applications where MPI is likely to be used, some of the most pressing performance issues concern arrays and multidimensional arrays of small objects--especially arrays of primitive elements such as ints and floats. For benchmarks we therefore concentrated on the overheads introduced by object serialization when the objects contain many arrays of primitive elements. Specifically we concentrated on communication of two-dimensional arrays with primitive elements. We note that there some debate about whether the Java model of multidimensional arrays is the most appropriate one for high performance computing. There are various proposals for optimized HPC array class libraries [#!MIDKIFF!#]. See Section 5.6 for some further discussion.

The ``ping-pong'' method was used to time point-to-point communication of an by array of primitive elements treated as a one dimensional array of objects, and compare it with communication of an $N^2$ array without using serialization. As an intermediate case we also timed communication of a $1$ by $N^2$ array treated as a one-dimensional (size 1) array of objects. This allows us to extract an estimate of the overhead to ``serialize'' an individual primitive element. The code for sending and receiving the various array shapes is given schematically in Figure 5.1.

Figure 5.1: Send and receive operations for various array shapes.
$N^2$ float vector

  

float [] buf = new float [N * N] ;
MPJ.COMM_WORLD.send(buf, 0, N * N,
MPJ.FLOAT,
dst, tag) ;

  

float [] buf = new float [N * N] ;
MPJ.COMM_WORLD.recv(buf, 0, N * N,
MPJ.FLOAT,
src, tag) ;





$N \times N$ float array

  

float [] [] buf = new float [N] [N] ;
MPJ.COMM_WORLD.send(buf, 0, N,
MPJ.OBJECT,
dst, tag) ;

  

float [] [] buf = new float [N] [] ;
MPJ.COMM_WORLD.recv(buf, 0, N,
MPJ.OBJECT,
src, tag) ;





$1 \times N^2$ float array

  

float [] [] buf = new float [1] [N * N] ;
MPJ.COMM_WORLD.send(buf, 0, 1,
MPJ.OBJECT,
dst, tag) ;

  

float [] [] buf = new float [1] [] ;
MPJ.COMM_WORLD.recv(buf, 0, 1,
MPJ.OBJECT,
src, tag) ;

As a crude timing model for these benchmarks, one can assume that there is a cost $t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize ser}}$ to serialize each primitive element of type T, an additional cost $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize ser}}$ to serialize each subarray, similar constants $t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize unser}}$ and $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize unser}}$ for unserialization, and a cost $t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize com}}$ to physically tranfser each element of data. Then the total time for benchmarked communications should be

$\displaystyle t^{\mbox{\small\tt T[$N^2$]}}$ $\textstyle =$ $\displaystyle c +
t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize com}} N^2$ (5.1)
$\displaystyle t^{\mbox{\small\tt T[1][$N^2$]}}$ $\textstyle =$ $\displaystyle c' +
(t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize ser}} +
t^{\mbo...
...ox{\scriptsize com}} +
t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize unser}}) N^2$ (5.2)
$\displaystyle t^{\mbox{\small\tt T[$N$][$N$]}}$ $\textstyle =$ $\displaystyle c'' +
(t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize ser}} +
t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize unser}}) N +$  
    $\displaystyle (t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize ser}} +
t^{\mbox{\sc...
...ox{\scriptsize com}} +
t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize unser}}) N^2$ (5.3)

These formulae do not attempt to explain the constant initial overhead, don't take into account the extra bytes for type description that serialization introduces into the stream, and ignore possible non-linear costs associated with analysing object graphs, etc. Empirically these effects are small for the range of we consider.

All measurements were performed on a cluster of 2-processor, 200 Mhz UltraSparc nodes connected through a SunATM-155/MMF network. The underlying MPI implementation was Sun MPI 3.0 (part of the Sun HPC package). The JDK was jdk1.2beta4. Shared memory results quoted are obtained by running two processes on the processors of a single node. Non-shared-memory results are obtained by running peer processes in different nodes.

In a series of measurements, element serialization and unserialization timing parameters were estimated by independent benchmarks of the serialization code. The parameters $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize ser}}$ and $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize unser}}$ were estimated by plotting the difference between serialization and unserialization times for T[1][$N^2$] and T[][] Our timing model assumed the values of these parameters is independent of the element type. This is only approximately true, and the values quoted in the table and used in the plotted curves are averages. Separately measured values for byte arrays were smaller than these averages, and for int and float arrays they were larger. The raw communication speed was estimated from ping-pong results for $t^{\mbox{\small\tt T[$N^2$]}}$. Table 5.1 contains the resulting estimates of the various parameters for byte and float elements.


Table: Estimated parameters in serialization and communication timing model. The $t^{\mbox{\scriptsize T}}_{\mbox{\scriptsize com}}$ values are respectively for non-shared memory ($\dag$) and shared memory ($\S $) implementations of the underlying communication.
$t^{\mbox{\scriptsize byte}}_{\mbox{\scriptsize ser}}$ = 0.043$\mu$s   $t^{\mbox{\scriptsize float}}_{\mbox{\scriptsize ser}}$ = 2.1$\mu$s   $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize ser}}$ = 100$\mu$s
$t^{\mbox{\scriptsize byte}}_{\mbox{\scriptsize unser}}$ = 0.027$\mu$s   $t^{\mbox{\scriptsize float}}_{\mbox{\scriptsize unser}}$ = 1.4$\mu$s   $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize unser}}$ = 53$\mu$s
$t^{\mbox{\scriptsize byte}}_{\mbox{\scriptsize com}}$ = $0.062\mu\mbox{s}^\dag$   $t^{\mbox{\scriptsize float}}_{\mbox{\scriptsize com}}$ = $0.25\mu\mbox{s}^\dag$    =  
$t^{\mbox{\scriptsize byte}}_{\mbox{\scriptsize com}}$ = $0.008\mu\mbox{s}^\S$   $t^{\mbox{\scriptsize float}}_{\mbox{\scriptsize com}}$ = $0.038\mu\mbox{s}^\S$    =  


Figures 5.2 and 5.3 on Page [*] and [*] plot actual measured times from ping-pong benchmarks for the mpiJava sends and receives of arrays with byte and float elements. In the plots the array extent, , ranges between 128 and 1024. The measured times for $t^{\mbox{\small\tt T[$N^2$]}}$, $t^{\mbox{\small\tt T[1][$N^2$]}}$ and $t^{\mbox{\small\tt T[$N$][$N$]}}$ are compared with the formulae given above (setting the $c$ constants to zero). The agreement is good, so our parametrization is assumed to be realistic in the regime considered.

Figure: Communication times from ping-pong benchmark in non-shared-memory case. The lines represent the model defined by Equations 5.1 to 5.3 in the text, with parameters from Table 5.1.
\begin{figure}\centerline{\psfig{figure=Figs/nonsh_byte.eps,width=3.5in}}\centerline{\psfig{figure=Figs/nonsh_float.eps,width=3.5in}}\end{figure}

Figure: Communication times from ping-pong benchmark in shared-memory case. The lines represent the model defined by Equations 5.1 to 5.3 in the text, with parameters from Table 5.1.
\begin{figure}\centerline{\psfig{figure=Figs/sh_byte.eps,width=3.5in}}\centerline{\psfig{figure=Figs/sh_float.eps,width=3.5in}}\end{figure}

According to table 5.1 the overhead of Java serialization nearly always dominates other communication costs. In the worst case--floating point numbers--it takes around 2 microseconds to serialize each number and a smaller but comparable time to unserialize. But it only takes a few hundredths of a microsecond to communicate the word through shared memory. Serialization slows communication by nearly two orders of magnitude. When the underlying communication is over a fast network rather than through shared memory the raw communication time is still only a fraction of a microsecond, and serialization still dominates that time by about one order of magnitude. For byte elements serialization costs are smaller, but still larger than the communication costs in the fast network and still much larger than the communication cost through shared memory. Serialization costs for int elements are intermediate.

The constant overheads for serializing each subarray, characterized by the parameters $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize ser}}$ and $t^{\mbox{\scriptsize vec}}_{\mbox{\scriptsize unser}}$ are also quite large, although, for the array sizes considered here they only make a dominant contribution for the byte arrays, where individual element serialization is relatively fast.


next up previous contents
Next: Reducing Serialization Overheads for Up: Object Serialization for Marshalling Previous: Adding Serialization to the   Contents
Bryan Carpenter 2004-06-09