Mixed and Hybrid Parallel Programmings Models

Current computing platforms offer several levels of parallelism, such as SIMD instructions (e.g. 4 floating point operations) in a processor core, multiple cores in a processor chip, multiple processor chips in a shared memory node and multiple nodes connected (in several layers) as a parallel computer.
Similar hierarchies can be constructed by a cluster of Cell processor nodes, each with 8 co-processor SPUs and SIMD instructions. An alternative is a cluster of nodes with attached GPUs for numerical computations. The Cuda system organizes the GPU into groups of 8 processors running strictly in SIMD mode, where 4 such groups form a warp sharing the processor resources. A number of warps can be executed in parallel on the available multi-processors of a GPU board. Brook+ organizes data in streams, which are wrote to and read from the GPU running in SIMD mode.

Parallel programming models for hierarchically structured parallel computers will use a hierarchy of appropriate programming models, e.g. an outer MPI message-passing for the distributed memory part of the computer, a middle layer of thread parallelism (Pthreads, OpenMP) for the shared memory multi-processor nodes and an inner SIMD parallelism for the instruction parallelism.
Similar, message passing can be combined with Cell or GPU inner programming models, where again a mixture of thread and SIMD programming is employed. Note that multi-threading is not necessarily the optimal way to use (larger virtual) shared memory computers. At some point message-passing or one-sided communication may be superior.