IA32 SSE SIMD multimedia instructions
Data Layout: continuous and aligned data blocks
A simplified translation of the following example parallel-for loop is given below. An abstract vector instruction version of the sequential code is translated into architecture dependent vector intrinsics
Grid1 *g = new Grid1(0, n+1);
Grid1IteratorSub it(1, n, g);
DistArray x(g), y(g);
...
float e = 0;
ForEach(int i, it,`
x(i) += ( y(i+1) + y(i-1) )*.5;
e += sqr( y(i) );', `
x.vecstore(i, x.vec(i,1) + y.vecu(i,1) + y.vecu(i,-1));
PVECSTORE(PVECTOR(e), PVECTOR(e) + sqr(y.vec(i)));' )
...
|
Intel IA32/AMD
SSE code:
#include <xmmintrin.h>
...
float *x = new float[n+1];
float *y = new float[n+1];
...
float e = 0;
float ve[4] = {0, 0, 0, 0};
for (int i=1; i<n; i+=4) {
float half =.5;
_mm_store_ps(&x[i],
_mm_mul_ps(_mm_load1_ps(&half),
_mm_add_ps(_mm_loadu_ps(&y[i+1]),
_mm_loadu_ps(&y[i-1]))));
_mm_store_ps(&ve[0],
_mm_add_ps(_mm_load_ps(&ve[0]),
_mm_mul_ps(_mm_load_ps(&y[i]),
_mm_load_ps(&y[i]))));
}
e += ve[0] + ve[1] + ve[2] + ve[3];
...
delete[] x, y;
|
Non-multiples of 4 and unaligned data (x[1] and y[1] at 16 byte boundaries) require some code modifications. Compare to the AltiVec processor instruction sets.