HPJava Home Page

mpiJava
HPJava language
SC '02 demo

Project flier

Java HPC Course
HPspmd lectures


Community Grids
Bloomington photos

Java Grande Home
Numerics working group


PCRC Home



Viewing these pages

An Example - Choleski Decomposition

Any real, symmetric, positive definite matrix A can be decomposed into the form

where L is a lower triangular matrix and the upper triangular matrix LT is the transpose of L.

Now write A and L as follows:

where b and m are column vectors and A′ and L′ are square matrices of dimension one less than A. Clearly A′ is symmetric and L′ is lower triangular. We have

Comparing terms we find

and

in other words, L′ is itself the Choleski decomposition of A′ - m mT .

These observations yield the following algorithm for computing a Choleski decomposition:

where now the result for L overwrites the lower triangle of A.

An HPJava parallel version is

  Procs1 p = new Procs1(P) ;
  on(p) {
    Range x = new CyclicRange(N, p.dim(0));

    float [[*,-]] a = new float [[N, x]] ;

    float [[*]]   m = new float [[N]] ;

    ... some code to initialize `a' ...

    for(int k = 0 ; k < N - 1 ; k++) {

      at(j = x[k]) {
        float d =  Math.sqrt(a[k, j]) ;

        a[k, j] = d ;
        for(int i = k + 1 ; i < N ; i++)
          a[i, j] /= d ;
      }

      Adlib.remap(m[[k + 1 : N - 1]], a[[k + 1 : N - 1, k]]) ;

      overall(j = x for k + 1 : N - 1)
        for(int i = j` ; i < N ; i++)
          a[i, j] -= m[i] * m[j`] ;
    }

    at(j = x[N - 1])
      a[N - 1, j] = Math.sqrt(a[N - 1, j]) ;
  }
A cyclic range is used to distribute elements of each row of the array over the set of processors. Every column is stored in its entirety on a particular processor - this is specified by making the first dimension sequential. In this kind of algorithm the cyclic ("round robin") assignment of columns to processors generally makes for better load balancing than a block-wise assignment would.

The program introduces the third and final special control construct of HPJava. The at construct is a poorer cousin of the overall construct. It only specifies a single value for its distributed index. So the statement

    at(i = x[n]) {
      ...
    }
is completely equivalent to
    overall(i = x for n : n) {
      ...
    }
The final statement of our program:
    at(j = x[N - 1])
      a[N - 1, j] = Math.sqrt(a[N - 1, j]) ;
should be read to say that the assignment in the body is only executed by the process (or, in a more general, multidimensional context, the set of processes) that holds the N-1 element of range x. In this sense the at may also be interpretted as a specialized on construct, except that it has the additional role of scoping a distributed index. In typical parallel programs the at construct is used less frequently than the general overall construct or the on construct.

During the computation, the collective communication function remap is used for broadcasting successive columns of a after they are updated to hold the value of value of the column vector m. As in the earlier matrix multiplication example, the fact that the temporary array m has no distributed dimension implies that its storage is replicated. Only the section m[[k + 1 : N - 1]] is required by the algorithm, so only this section is broadcast.

Note that it would be illegal to use the distributed index j directly as a subscript of m; the index j represents a location in the distributed range x, whereas the array m has a different (collapsed) range - use of j here would produce a run-time error. One must use the integer expression j` instead.


Bryan Carpenter, (dbc@ecs.soton.ac.uk). Last updated May 2007.