next up previous contents
Next: Discussion Up: Q3 - Local Dependence Previous: Data Description   Contents


Experimental Study - Q3

Figure 6.6: Q3 Index Algorithm in HPJava
overall (i = x for : )                // Compute probability
  for (int j = 0; j<NUM_RESPONSES; j++) {
    double exp = Math.exp(1.7 * par[j, 0] * (tht[i] - par[j, 1])) ;
    prob[i, j] = par[j, 2] + (1.0 - par[j, 2]) * exp / (1.0 + exp) ;
  }

overall (i = x for : )                // Calculate the difference scores
  for (int j = 0; j<NUM_RESPONSES; j++)
    diff[i, j] = raw[i, j] - prob[i, j] ;

Adlib.sumDim(sum_diff, diff, 0) ;     // sum(di) / n
for (int i = 0; i<NUM_RESPONSES; i++)
  sum_diff[i] /= NUM_STUDENTS ;

overall (i = x for : )                // sum(power(di, 2)) / n
  for (int j = 0; j<NUM_RESPONSES; j++)
    power_diff_2[i, j] = diff[i, j] * diff[i, j] / NUM_STUDENTS ;
Adlib.sumDim(sum_power_diff_2, power_diff_2, 0) ;

overall (k = x for : )                // calculating didj
  for (int i = 0; i<NUM_RESPONSES; i++)
    for (int j = 0; j<NUM_RESPONSES; j++)
      if (i != j)
        didj[k, i*NUM_RESPONSES + j] = diff[k, i] * diff[k, j] / NUM_STUDENTS ;

Adlib.sumDim(sum_didj, didj, 0) ;     // sum(di * dj) / n

for (int i=0; i<NUM_RESPONSES; i++)   // covariance
  for (int j=0; j<NUM_RESPONSES; j++)
    if (i != j)
      cov[i][j] = sum_didj[i*NUM_RESPONSES + j] - sum_diff[i] * sum_diff[j] ;

for (int i=0; i<NUM_RESPONSES; i++)   // variance
  var[i] = sum_power_diff_2[i] - sum_diff[i] * sum_diff[i] ;

for (int i = 0; i<NUM_RESPONSES; i++) // Calculate Q3
  for (int j = 0; j<NUM_RESPONSES; j++)
    if (i != j) {
      q3[i][j] = cov[i][j] / Math.sqrt(var[i] * var[j]) ;
      avg += q3[i][j] ;
    }
Q3 is a real world application with a large number of computations, rather than a scientific benchmarking program. Figure 6.6 is a Q3 implementation in HPJava. Here, rather than benchmarking the whole program, we need to know which parts dominates the whole algorithm. Obviously, Q3 application spend its most time in the overall construct for calculating difference of every pair of diff elements, and the summation of the difference. The kernel of Q3 dominates 97% of the whole program according to the time measuring.[*] Thus, we will benchmark only the kernel of Q3 algorithm.
Figure 6.7: Performance for Q3 on Linux machine
\includegraphics[height=3.2in]{Figures/q3_Linux}
Figure 6.7 shows the performance comparisons of Q3 in HPJava, Java, and C on the Linux machine. The Java performance is 61% over the C performance. Here, what we have to focus on is the performance of HPJava over that of Java. The HPJava performance is 119% over the Java performance. This is the ideal result of HPJava performance we always have expected. The reason HPJava is faster than Java is that, in the HPJava implementation of Q3, in order to calculate where $ i = 1, \ldots , 2551$ and , we use a run-time communication library, Adlib.sumDim() instead of loops used in the sequential Java implementation of Q3. In addition, the design decision leads the HPJava implementation of Q3 to an efficient implementation. To calculate , we use nested for loops instead of overall constructs. If a matrix with a small size, such as 75 $ \times $ 75, is distributed, the cost of the communications is higher than that of the computing time. When a multiarray is created, we generally suggest that a dimension of a small size (e.g. 100) be a sequential dimension rather than distributed one to avoid unnecessary communication times. The optimized codes by PRE and HPJOPT2 performs up to 113% and 115%, respectively, over the naive translation. Especially, the result of HPJOPT2 is up to 84% of the C performance. Moreover, the result of HPJOPT2 is up to 138% of the Java performance. That is, an HPJava program of Q3 index maximally optimized by HPJOPT2 is very competitive with C implementations. As discussed in section 6.3.2, because the problem size of Q3 is too large to fit in the physical cache of the Linux machine, the net speed of Q3 is slower (17.5 Mflops/sec with HPJOPT2 optimization) than other applications ($ \geq$ 250 Mflops/sec with HPJOPT2 optimization). However, the net speed of Java and C programs is slower than other applications as well. Even though Q3 is relatively slower than others on a single processor, we expect performance on multi-processors to be better because data will be divided, with the right choice of the number of processors, each part may fit in the cache of the parallel machines.
next up previous contents
Next: Discussion Up: Q3 - Local Dependence Previous: Data Description   Contents
Bryan Carpenter 2004-06-09