Parallel-For Loop Example
The following code example is translated to different programming models.
declaration:
Grid1 *g = new Grid1(0, n+1);
Grid1IteratorSub it(1, n, g);
DistArray x(g), y(g);
double e = 0;
|
main code:
ForEach(int i, it,
x(i) += ( y(i+1) + y(i-1) )*.5;
e += sqr( y(i) ); )
|
The following code examples can also be compiled with run-time libraries which
provides the implementations of grids, arrays and the remaining parts of
the iterator. The result is parallel code. Currently, threads are used for shared memory environements, message passing is used for distributed memory environements and mixed models are used for hybrid environements.
Arrays
int n = 64;
Grid1 *g = new Grid1(0, n+1);
Grid1IteratorSub it(1, n, g);
DistArray1<double> x(g), y(g);
double e = 0.;
ForEach(int i, it, x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( x(i) ); )
|
The `foreach' loop based on the index set of the iterator expands to
the sequential code
for(int i=1; i<n; i++) {x(i) += (y(i+1) + y(i-1))*.5; e += sqr(x(i));}
|
but provides the semantic of independent operations, and consists of
arbitrary (reentrant) C++ code including nested function calls. The
result is guaranteed to be independent of the sequence. For the
reduction of the variable e this is only true up to floating
point rounding.
Replicated data, i.e. scalars and small data structures allocated on each
process can either be read-only (store is disallowed) in a loop or a
reduction variable (simple load is disallowed). In the case of a reduction,
special code is created for thread and/or message passing environments.
Grid2 *g2 = new Grid2(0, n+1, 0, n+1);
DistArray2<double> z(g2), a(g2);
Grid2IteratorSub ita(1, n, 1, n, g2);
ForEach(`int i, int j', ita, `z(i,j) = ( a(i-1,j-1) + a(i+1,j+1) +
a(i-1,j) + a(i+1,j) + a(i,j-1) + a(i,j+1) )/6.;')
Grid2IteratorOutside itb(1, n, 1, n, g2);
ForEach(`int i, int j', itb, `z(i,j) = 0.;')
|
The nesting of the i and j loops and the execution order is not
specified.
Arrays of Different Shapes
class fine { public: int map(int i) {return i / 2;} } f;
Grid2 *g2 = new Grid2(0, n+1, 0, n+1);
DistArray2<double> z(g2);
Grid1 *gf = new Grid1(0, 2*n+1, g, &f);
DistArray1map<double, fine> x(gf);
ForEach(int i, it, x(i) = z(2*i)*.5 + ( z(2*i-1) + z(2*i+1) )*.25; )
ForEach(int i, it, z(2*i) = x(i);
z(2*i+1) = ( x(i) + x(i+1) )*.5; )
|
The first `foreach' loop triggers a left process fetch for the array
z on the finer grid, while the second one triggers a right fetch
for array x, while still being a local store operation under
transformation pi.
Tree Traversal
class tree : public KAryTree<class tree, 2> {
public: // generic binary tree provides tree* child(int);
complex<double> m, l, f, x;
... };
tree *root = new tree;
|
Tree creation proceeds by (parallel) insertion of particles or
(parallel) sorting according to some partitioning scheme, where
algorithms of different types are involved.
TopDownIterator<tree> down(root);
ForEach(tree *b, down, b->f = b->l; )
|
The order of execution is no longer arbitrary, but partially ordered,
in this case top down from root to the leaves, such that lots of
parallelism is exposed. Operations on the replicated coarse tree are
executed on all processes and operations on the remaining trees are
executed by the respective owners. The first `foreach' loop shows a local
assignment completely independent of the execution order.
TopDownIterator<tree> down(root);
ForEach(tree *b, down, `
for (int i=0; i<2; i++)
if (b->child(i)) b->child(i)->l += b->l; ')
|
The second
one performs a local store at the child nodes, such that a strict
parent before child order has to be enforced.
Except for possible global reduction operations
no parallel communication is needed.
BottomUpIterator<tree> up(root);
ForEach(tree *b, up, `
for (int i=0; i<2; i++)
if (b->child(i)) b->m += b->child(i)->m; ')
|
A bottom up, leaf to root execution order is shown in the next example.
Now, the processes first execute the operations on their own sub trees
in a children before parent order. A communication step at the
replicated coarse tree''s leaf nodes is necessary to update them, which
currently uses a global message passing gather operation. Afterwards
the operations can be executed on all processes on the coarse tree.
Tree Traversal with Neighborhood Communication
The following statements within class tree declaration define the
relation fetch, which guards all accesses to nodes
pointed to by elements of the interaction list inter.
Require( list<tree*> inter, fetch );
double x0, x1;
int fetch(tree *b) {
return (x0==b->x1) || (x1==b->x0);
}
|
Statement `require' both declares the variable inter and
attaches the relation fetch to it. The following code shows a
tree iterator using indirect addressing.
ForEach(tree *b, down, `
for (list<tree*>::const_iterator i = b->inter.begin();
i != b->inter.end(); i++)
b->l += log(abs(b->x - (*i)->x)) * (*i)->m; ')
|
Each process collects all nodes defined by `fetch' on the
coarse tree''s leafs and may be needed by another process and forwards
these.
A Fast Multipole-Summation Code
The first part of the fast multipole summation method computes the far
field multipole expansions in a bottom-up order. Children node
expansions are shifted to the origin of the parent node expansion and
summed up.
void InitMPExp(); // store mp_exp
void ComputeMPExp(); // store mp_exp
void ShiftMPExp(tree *cb); // store mp_exp, load cb->mp_exp
|
The main code instantiates and uses a tree iterator and contains the
actual algorithmic atom to be executed for each tree node.
BottomUpIterator<tree> iu(root);
ForEach(tree *b, iu, `
if (b->isleaf())
b->ComputeMPExp();
else {
b->InitMPExp();
for (int i=0; i<tree::dim; i++)
if (b->child[i])
b->ShiftMPExp(b->child[i]);
} ')
|
The second stage of the fast multipole summation computes the
interaction lists. For a balanced tree, the interactions are a set of
siblings of a node. However, in the case of unbalanced trees, additional
nodes on finer or coarser levels may be needed for the interactions. Nevertheless, the
interaction lists can be computed in a top-down tree traversal.
The final stage of the algorithm, which can also be executed in
conjunction with the second stage, computes the local expansions and
finally the fields. Here, the far field multipole expansions are
evaluated directly or converted into near field local expansions,
which need to be evaluated. This can be performed in a top-down
tree traversal with a set of methods in the tree class,
void VListInter(tree *src); // store local_exp, load src->mp_exp
void UListInter(tree *src); // store field, load src->x
void WListInter(tree *src); // store field, load src->mp_exp
void XListInter(tree *src); // store local_exp, load src->x
void ShiftLocalExp(tree *cb); // store cb->local_exp, load local_exp
|
which perform the four types of possible conversions and shift the
local expansions for propagation to the children nodes.
ForEach(tree *b, it, `
for (int i=0; i<tree::dim; i++)
if (b->child[i]) {
b->ShiftLocalExp(b->child[i]);
b->child[i]->field = 0.0;
if (b->child[i]->isleaf())
for (list<tree*>::iterator n = b->child[i]->inter.begin();
n != b->child[i]->inter.end(); n++) {
if ((*n)->isleaf())
b->child[i]->UListInter(*n);
else b->child[i]->WListInter(*n);
}
else
for (list<tree*>::iterator n = b->child[i]->inter.begin();
n != b->child[i]->inter.end(); n++) {
if ((*n)->isleaf()) b->child[i]->XListInter(*n);
else b->child[i]->VListInter(*n);
}
b->child[i]->EvaluateLocalExp();
} ')
|
Both sequential and thread parallel versions are relatively easy to
generate, since there is only a parent to child data dependence in the
local_exp arrays. However, in the distributed memory version of
this code, the dependence on sibling nodes´ mp_exp arrays and
particle data x turns out to be a severe problem.
The solution we chose here is an additional hint (code annotation) in
the application program. The hint provides a criterion to select a set
of nodes, which are needed at most. The transformed code initiates a
message passing step to exchange the variables determined by the
dependence analysis.
Require(list<tree*> neighbor, fetch);
Require(list<tree*> inter, fetch);
int fetch(tree *b) {
return (distance(b) <= 2 * fmin(diam, b->diam));
}
|