PowerPC AltiVec SIMD multimedia instructions

Data Layout: continuous and aligned data blocks
A simplified translation of the following example parallel-for loop is given below. An abstract vector instruction version of the sequential code is translated into architecture dependent vector intrinsics


Grid1 *g = new Grid1(0, n+1);
Grid1IteratorSub it(1, n, g);
DistArray x(g), y(g);
...
float e = 0;
ForEach(int i, it,`
   x(i) += ( y(i+1) + y(i-1) )*.5;
   e += sqr( y(i) );', `
   x.vecstore(i, x.vec(i,1) + y.vecu(i,1) + y.vecu(i,-1));
   PVECSTORE(PVECTOR(e), PVECTOR(e) + sqr(y.vec(i)));' )
...


AltiVec code:

#include <altivec.h>
...
float *x = new float[n+1];
float *y = new float[n+1];
...
float e = 0;
float ve[4] = {0, 0, 0, 0};
for (int i=1; i<n; i+=4) {
   float* yp = &y[i+1], y0 = &y[i], ym = &y[i-1];
   vec_st(vec_madd(
    vec_splats(.5),
    vec_add(
     vec_perm(vec_ld(0,ym), vec_ld(16,ym),
      vec_lvsl(0,ym)),
     vec_perm(vec_ld(0,yp), vec_ld(16,yp),
      vec_lvsl(0,yp))),
    vec_splats(0.)),
    0, &x[i]);
   vec_st(vec_add(
    vec_ld(0,&ve[0]),vec_madd(
     vec_ld(0,y0),
     vec_ld(0,y0),
     vec_splats(0.))),
    0, &ve[0]);
}
e += ve[0] + ve[1] + ve[2] + ve[3];
...
delete[] x, y;

Non-multiples of 4 and unaligned data (x[1] and y[1] at 16 byte boundaries) require some code modifications. Compare to the SSE processor instruction sets. The Cell BE vector instructions on the SPUs differ slightly.