Parallel For

PowerPC AltiVec SIMD multimedia instructions

Data Layout: continuous and aligned data blocks
A simplified translation of the following example parallel-for loop is given below. An abstract vector instruction version of the sequential code is translated into architecture dependent vector intrinsics


Grid1 *g = new Grid1(0, n+1);
 
Grid1IteratorSub it(1, n, g);
 
DistArray x(g), y(g);
 
...

float e = 0;
 
ForEach(int i, it,`
 
  	x(i) += ( y(i+1) + y(i-1) )*.5;
 
  	e    += sqr( y(i) );', `

      x.vecstore(i, x.vec(i,1) + y.vecu(i,1) + y.vecu(i,-1));

      PVECSTORE(PVECTOR(e), PVECTOR(e) + sqr(y.vec(i)));' )
 
...

AltiVec code:


#include <altivec.h>

...

float *x = new float[n+1];

float *y = new float[n+1];

...

float e = 0;

float ve[4] = {0, 0, 0, 0};

for (int i=1; i<n; i+=4) {
 
   float* yp = &y[i+1], y0 = &y[i], ym = &y[i-1];

   vec_st(vec_madd(

     vec_splats(.5),

     vec_add(

       vec_perm(vec_ld(0,ym), vec_ld(16,ym),

          vec_lvsl(0,ym)),

       vec_perm(vec_ld(0,yp), vec_ld(16,yp),

          vec_lvsl(0,yp))),

     vec_splats(0.)),

     0, &x[i]);

   vec_st(vec_add(

      vec_ld(0,&ve[0]),vec_madd(

       vec_ld(0,y0),

       vec_ld(0,y0),

       vec_splats(0.))),

      0, &ve[0]);

}

e += ve[0] + ve[1] + ve[2] + ve[3];

...

delete[] x, y;

Non-multiples of 4 and unaligned data (x[1] and y[1] at 16 byte boundaries) require some code modifications. Compare to the SSE processor instruction sets. The Cell BE vector instructions on the SPUs differ slightly.