Parallel For

Parallel-For Loop Example

The following code example is translated to different programming models.

declaration:


Grid1 *g = new Grid1(0, n+1);
 
Grid1IteratorSub it(1, n, g);
 
DistArray x(g), y(g);
 
double e = 0;

main code:


ForEach(int i, it,
 
  	x(i) += ( y(i+1) + y(i-1) )*.5;
 
  	e    += sqr( y(i) ); )

The following code examples can also be compiled with run-time libraries which provides the implementations of grids, arrays and the remaining parts of the iterator. The result is parallel code. Currently, threads are used for shared memory environements, message passing is used for distributed memory environements and mixed models are used for hybrid environements.

Arrays


int n = 64; 

Grid1 *g = new Grid1(0, n+1); 

Grid1IteratorSub it(1, n, g); 

DistArray1<double> x(g), y(g); 

double e = 0.; 

ForEach(int i, it,  x(i) += ( y(i+1) + y(i-1) )*.5;  e += sqr( x(i) ); )

The `foreach' loop based on the index set of the iterator expands to the sequential code


for(int i=1; i<n; i++) {x(i) += (y(i+1) + y(i-1))*.5;  e += sqr(x(i));}

but provides the semantic of independent operations, and consists of arbitrary (reentrant) C++ code including nested function calls. The result is guaranteed to be independent of the sequence. For the reduction of the variable e this is only true up to floating point rounding.
Replicated data, i.e. scalars and small data structures allocated on each process can either be read-only (store is disallowed) in a loop or a reduction variable (simple load is disallowed). In the case of a reduction, special code is created for thread and/or message passing environments.



Grid2 *g2 = new Grid2(0, n+1, 0, n+1); 


DistArray2<double> z(g2), a(g2); 


Grid2IteratorSub ita(1, n, 1, n, g2); 


ForEach(`int i, int j', ita, `z(i,j) = ( a(i-1,j-1) + a(i+1,j+1) +  


                              a(i-1,j) + a(i+1,j) + a(i,j-1) + a(i,j+1) )/6.;') 


Grid2IteratorOutside itb(1, n, 1, n, g2); 


ForEach(`int i, int j', itb, `z(i,j) = 0.;')

The nesting of the i and j loops and the execution order is not specified.

Arrays of Different Shapes



class fine { public: int map(int i) {return i / 2;} } f; 


Grid2 *g2 = new Grid2(0, n+1, 0, n+1); 


DistArray2<double> z(g2); 


Grid1 *gf = new Grid1(0, 2*n+1, g, &f); 


DistArray1map<double, fine> x(gf); 


ForEach(int i, it, x(i)     = z(2*i)*.5 + ( z(2*i-1) + z(2*i+1) )*.25; ) 


ForEach(int i, it, z(2*i)   = x(i); 


                            z(2*i+1) = ( x(i) + x(i+1) )*.5; )

The first `foreach' loop triggers a left process fetch for the array z on the finer grid, while the second one triggers a right fetch for array x, while still being a local store operation under transformation pi.

Tree Traversal



class tree : public KAryTree<class tree, 2> { 


public:     // generic binary tree provides  tree* child(int); 


     complex<double> m, l, f, x; 


... }; 


tree *root = new tree;

Tree creation proceeds by (parallel) insertion of particles or (parallel) sorting according to some partitioning scheme, where algorithms of different types are involved.



TopDownIterator<tree> down(root); 


ForEach(tree *b, down, b->f = b->l; )

The order of execution is no longer arbitrary, but partially ordered, in this case top down from root to the leaves, such that lots of parallelism is exposed. Operations on the replicated coarse tree are executed on all processes and operations on the remaining trees are executed by the respective owners. The first `foreach' loop shows a local assignment completely independent of the execution order.



TopDownIterator<tree> down(root); 


ForEach(tree *b, down, ` 


    for (int i=0; i<2; i++) 


          if (b->child(i))  b->child(i)->l += b->l; ')

The second one performs a local store at the child nodes, such that a strict parent before child order has to be enforced.
Except for possible global reduction operations no parallel communication is needed.



BottomUpIterator<tree> up(root); 


ForEach(tree *b, up, ` 


     for (int i=0; i<2; i++) 


          if (b->child(i))  b->m += b->child(i)->m; ')

A bottom up, leaf to root execution order is shown in the next example. Now, the processes first execute the operations on their own sub trees in a children before parent order. A communication step at the replicated coarse tree''s leaf nodes is necessary to update them, which currently uses a global message passing gather operation. Afterwards the operations can be executed on all processes on the coarse tree.

Tree Traversal with Neighborhood Communication

The following statements within class tree declaration define the relation fetch, which guards all accesses to nodes pointed to by elements of the interaction list inter.



  Require( list<tree*> inter,  fetch ); 


  double x0, x1; 


  int fetch(tree *b) { 


     return (x0==b->x1) || (x1==b->x0); 


}

Statement `require' both declares the variable inter and attaches the relation fetch to it. The following code shows a tree iterator using indirect addressing.



ForEach(tree *b, down, ` 


  for (list<tree*>::const_iterator i = b->inter.begin();  


             i != b->inter.end(); i++) 


       b->l += log(abs(b->x - (*i)->x)) * (*i)->m;  ')

Each process collects all nodes defined by `fetch' on the coarse tree''s leafs and may be needed by another process and forwards these.

A Fast Multipole-Summation Code

The first part of the fast multipole summation method computes the far field multipole expansions in a bottom-up order. Children node expansions are shifted to the origin of the parent node expansion and summed up.


  void InitMPExp();             // store mp_exp 

  void ComputeMPExp();          // store mp_exp 

  void ShiftMPExp(tree *cb);    // store mp_exp, load cb->mp_exp

The main code instantiates and uses a tree iterator and contains the actual algorithmic atom to be executed for each tree node.


BottomUpIterator<tree> iu(root); 

 

ForEach(tree *b, iu, ` 

  if (b->isleaf()) 

     b->ComputeMPExp(); 

  else { 

       b->InitMPExp(); 

       for (int i=0; i<tree::dim; i++) 

            if (b->child[i]) 

                 b->ShiftMPExp(b->child[i]); 

  }  ')

The second stage of the fast multipole summation computes the interaction lists. For a balanced tree, the interactions are a set of siblings of a node. However, in the case of unbalanced trees, additional nodes on finer or coarser levels may be needed for the interactions. Nevertheless, the interaction lists can be computed in a top-down tree traversal.
The final stage of the algorithm, which can also be executed in conjunction with the second stage, computes the local expansions and finally the fields. Here, the far field multipole expansions are evaluated directly or converted into near field local expansions, which need to be evaluated. This can be performed in a top-down tree traversal with a set of methods in the tree class,



  void VListInter(tree *src);   // store local_exp, load src->mp_exp 

  void UListInter(tree *src);   // store field, load src->x 

  void WListInter(tree *src);   // store field, load src->mp_exp 

  void XListInter(tree *src);   // store local_exp, load src->x 

  void ShiftLocalExp(tree *cb); // store cb->local_exp, load local_exp

which perform the four types of possible conversions and shift the local expansions for propagation to the children nodes.



ForEach(tree *b, it, ` 


  for (int i=0; i<tree::dim; i++) 


       if (b->child[i]) { 


            b->ShiftLocalExp(b->child[i]); 


            b->child[i]->field = 0.0; 


            if (b->child[i]->isleaf()) 


              for (list<tree*>::iterator n = b->child[i]->inter.begin(); 


                      n != b->child[i]->inter.end(); n++) { 


                if ((*n)->isleaf()) 


                   b->child[i]->UListInter(*n); 


                else                b->child[i]->WListInter(*n); 


           } 


         else 


           for (list<tree*>::iterator n = b->child[i]->inter.begin(); 


                   n != b->child[i]->inter.end(); n++) { 


                if ((*n)->isleaf()) b->child[i]->XListInter(*n); 


                else                b->child[i]->VListInter(*n); 


              } 


         b->child[i]->EvaluateLocalExp(); 


    }  ')

Both sequential and thread parallel versions are relatively easy to generate, since there is only a parent to child data dependence in the local_exp arrays. However, in the distributed memory version of this code, the dependence on sibling nodes� mp_exp arrays and particle data x turns out to be a severe problem.
The solution we chose here is an additional hint (code annotation) in the application program. The hint provides a criterion to select a set of nodes, which are needed at most. The transformed code initiates a message passing step to exchange the variables determined by the dependence analysis.



  Require(list<tree*> neighbor, fetch); 


  Require(list<tree*> inter, fetch); 


  int fetch(tree *b) { 


     return (distance(b) <= 2 * fmin(diam, b->diam));  


   }