Parallel For

IA32 SSE SIMD multimedia instructions

Data Layout: continuous and aligned data blocks
A simplified translation of the following example parallel-for loop is given below. An abstract vector instruction version of the sequential code is translated into architecture dependent vector intrinsics


Grid1 *g = new Grid1(0, n+1);
 
Grid1IteratorSub it(1, n, g);
 
DistArray x(g), y(g);
 
...

float e = 0;
 
ForEach(int i, it,`
 
  	x(i) += ( y(i+1) + y(i-1) )*.5;
 
  	e    += sqr( y(i) );', `

      x.vecstore(i, x.vec(i,1) + y.vecu(i,1) + y.vecu(i,-1));

      PVECSTORE(PVECTOR(e), PVECTOR(e) + sqr(y.vec(i)));' )
 
...

Intel IA32/AMD SSE code:


#include <xmmintrin.h>

...

float *x = new float[n+1];

float *y = new float[n+1];

...

float e = 0;

float ve[4] = {0, 0, 0, 0};

for (int i=1; i<n; i+=4) {

    float half =.5;

    _mm_store_ps(&x[i],
 
    	_mm_mul_ps(_mm_load1_ps(&half),

      		   _mm_add_ps(_mm_loadu_ps(&y[i+1]),

        			      _mm_loadu_ps(&y[i-1]))));

    _mm_store_ps(&ve[0],

    	_mm_add_ps(_mm_load_ps(&ve[0]),

      		   _mm_mul_ps(_mm_load_ps(&y[i]),

        			      _mm_load_ps(&y[i]))));

}

e += ve[0] + ve[1] + ve[2] + ve[3];

...

delete[] x, y;

Non-multiples of 4 and unaligned data (x[1] and y[1] at 16 byte boundaries) require some code modifications. Compare to the AltiVec processor instruction sets.