PowerPC AltiVec SIMD multimedia instructions
Data Layout: continuous and aligned data blocks
A simplified translation of the following example parallel-for loop is given below. An abstract vector instruction version of the sequential code is translated into architecture dependent vector intrinsics
Grid1 *g = new Grid1(0, n+1);
Grid1IteratorSub it(1, n, g);
DistArray x(g), y(g);
...
float e = 0;
ForEach(int i, it,`
x(i) += ( y(i+1) + y(i-1) )*.5;
e += sqr( y(i) );', `
x.vecstore(i, x.vec(i,1) + y.vecu(i,1) + y.vecu(i,-1));
PVECSTORE(PVECTOR(e), PVECTOR(e) + sqr(y.vec(i)));' )
...
|
AltiVec code:
#include <altivec.h>
...
float *x = new float[n+1];
float *y = new float[n+1];
...
float e = 0;
float ve[4] = {0, 0, 0, 0};
for (int i=1; i<n; i+=4) {
float* yp = &y[i+1], y0 = &y[i], ym = &y[i-1];
vec_st(vec_madd(
vec_splats(.5),
vec_add(
vec_perm(vec_ld(0,ym), vec_ld(16,ym),
vec_lvsl(0,ym)),
vec_perm(vec_ld(0,yp), vec_ld(16,yp),
vec_lvsl(0,yp))),
vec_splats(0.)),
0, &x[i]);
vec_st(vec_add(
vec_ld(0,&ve[0]),vec_madd(
vec_ld(0,y0),
vec_ld(0,y0),
vec_splats(0.))),
0, &ve[0]);
}
e += ve[0] + ve[1] + ve[2] + ve[3];
...
delete[] x, y;
|
Non-multiples of 4 and unaligned data (x[1] and y[1] at 16 byte boundaries) require some code modifications. Compare to the SSE processor instruction sets. The Cell BE vector instructions on the SPUs differ slightly.