Hybrid MPI and OpenMP Parallel Programming



Rolf Rabenseifner, High Performance Computing Center Stuttgart (HLRS), Germany

Georg Hager, Erlangen Regional Computing Center (RRZE), Germany

Gabriele Jost, AMD (Advanced Micro Devices), Sunnyvale, CA, USA


Half-day Tutorial proposed for Supercomputing 2012 (SC12)


Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with single/multi-socket and multi-core SMP nodes, but also “constellation” type systems with large SMP nodes. Parallel programming may combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node.

This tutorial analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. This includes a discussion on planned future OpenMP support for accelerators. Various hybrid MPI+OpenMP approaches are compared with pure MPI, and benchmark results on different platforms are presented. Numerous case studies demonstrate the performance-related aspects of hybrid MPI/OpenMP programming, and application categories that can take advantage of this model are identified. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a "how-to" section.


Detailed Description


Tutorial goals:


Straightforward programming of clusters of shared memory nodes often leads to unsatisfactory performance results. The participant learns hybrid parallel programming. Pure message passing (one MPI process on each core) and mixed model programming (multi-threaded MPI processes) only partially fit to the architecture of modern HPC systems. The tutorial teaches about solving those performance problems, but also teaches technical aspects of mixed model programming. At the end of the tutorial, the attendee will be sensitive about many pitfalls in parallel programming on clusters of SMP nodes.


Targeted audience:


People who are in charge with the development of efficient parallel software on clusters of shared memory nodes.


Content level:


25% Introductory, 50% Intermediate, 25% Advanced


Audience prerequisites:


Some knowledge about parallel programming with MPI and OpenMP.


Why the topic is relevant to SC attendees:


Most systems in HPC and supercomputing environments are clusters of SMP nodes, ranging from clusters of dual/quad-core CPUs to large constellations in tera- and petascale computing. Numerical software for these systems often scales worse than expected. This tutorial helps to find the appropriate programming model and to prevent pitfalls with mixed model (MPI+OpenMP) programming.


General description of tutorial content:


Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with multi-core single/multi-CPU boards, but also "constellation" type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node.

This tutorial analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. Bandwidth and latency is shown for intra-socket, inter-socket and inter-node communication.  The affinity of processes and their threads and memory is a key factor. The thread-safety status of several existing MPI libraries is also discussed. Case studies with the multi-zone NAS Parallel Benchmarks will be provided to demonstrate various aspects of hybrid MPI/OpenMP programming.


This tutorial analyzes strategies to overcome typical drawbacks of easily usable programming schemes on clusters of SMP nodes.


Detailed Outline


         Introduction  /  Motivation

         Programming models on clusters of SMP nodes

o         Major programming models

o         Pure MPI

o         Hybrid Masteronly Style

o         Overlapping Communication and Computation

o         Pure OpenMP

         Case Studies  /  pure MPI vs. hybrid MPI+OpenMP


o         The Multi-Zone NAS Parallel Benchmarks
with results on different multi-core SMP-clusters (Cray XE6, SGI Altix ICE, IBM Power 6, Westmere Cluster)

o         Examples of thread and process placement with numactl

          “How-to” on hybrid programming

o         Compilation and linkage of  hybrid MPI+OpenMP programs

o         Special considerations for multi-socket-multi-core systems in highly parallel environments,
e.g., process/thread pinning, cross-socket/on-socket communication

o         ccNUMA memory locality

o         Intra-socket, inter-socket, and inter-node communication characteristics

o         Control policy for processes, threads and memory, e.g., with taskset, numactl, likwid-pin

o         Hybrid implementation of sparse matrix-vector multiply (e.g. in an iterative solver)

         Mismatch Problems

o         Topology problem

o         Unnecessary intra-node communication

o         Inter-node bandwidth problem

o         Sleeping threads and saturation problem

o         Additional OpenMP overhead

o         Overlapping communication and computation

o         Communication overhead with DSM

o         No silver bullet





         Application categories that can benefit from hybrid parallelization

o         Nested parallelism

o         Load balancing

o         Memory comnsumption

o         Scaling problems

         Thread-safety quality of MPI libraries

         Tools support for multi-threaded MPI processes

         Other options on clusters of SMP nodes

o         Future OpenMP support for accelerators



o         References


Resume / Curriculum Vitae


Dr. Rolf Rabenseifner


Rolf Rabenseifner studied mathematics and physics at the University of Stuttgart. Since 1984, he has worked at the High-Performance Computing-Center Stuttgart (HLRS). He led the projects DFN-RPC, a remote procedure call tool, and MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without loosing the full MPI interface. In his dissertation, he developed a controlled logical clock as global time for trace-based profiling of parallel and distributed applications. Since 1996, he has been a member of the MPI-2 Forum and since Dec. 2007 he is in the steering committee of the MPI-3 Forum and was responsible for new MPI-2.1 standard. From January to April 1999, he was an invited researcher at the Center for High-Performance Computing at Dresden University of Technology. Currently, he is head of Parallel Computing - Training and Application Services at HLRS. He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge Benchmark Suite. In recent projects, he studied parallel I/O, parallel programming models for clusters of SMP nodes, and optimization of MPI collective routines. In workshops and summer schools he teaches parallel programming models in many universities and labs in Germany. In January 2012, the Gauss Center of Supercomputing (GCS), with HLRS, LRZ in Garching and the Jülich Supercomputing Center as members, was selected as one of six PRACE Advanced Training Centers (PATCs) and he was appointed as GCS’PATC director.


Homepage: http://www.hlrs.de/people/rabenseifner/

List of publications: https://fs.hlrs.de//projects/rabenseifner/publ/


Dr. Georg Hager


Georg Hager studied theoretical physics at the University of Bayreuth, specializing in nonlinear dynamics, and holds a PhD in Computational Physics from the University of Greifswald. He is a senior researcher in the HPC Services group at Erlangen Regional Computing Center (RRZE), which is part of the University of Erlangen-Nuremberg. Recent research includes architecture-specific optimization strategies for current microprocessors, performance engineering of scientific codes on chip and system levels, and special topics in shared memory and hybrid programming. His daily work encompasses all aspects of user support in High Performance Computing like tutorials and training, code parallelization, profiling and optimization, and the assessment of novel computer architectures and tools. His textbook “Introduction to High Performance Computing for Scientists and Engineers” is recommended or required reading in many HPC-related lectures and courses worldwide. In his teaching activities he puts a strong focus on performance modeling techniques that lead to a better understanding of the interaction of program code with the hardware.


Homepage: http://blogs.fau.de/hager

List of publications: http://blogs.fau.de/hager/publications/

List of talks and teaching activities: http://blogs.fau.de/hager/talks/


Book: Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, ISBN 978-1439811924, 356 pages, July 2010.


Award: Informatics Europe “Curriculum Best Practices Award 2011: Parallelism and Concurrency” in recongnition of the outstanding educational initiative “Teaching high performance computing to scientists and engineers: A model-based approach”  (together with Jan Treibig and Gerhard Wellein).


Dr. Gabriele Jost


Gabriele Jost obtained her doctorate in Applied Mathematics from the University of Göttingen, Germany. For more than a decade she worked for various vendors (Suprenum GmbH, Thinking Machines Corporation, and NEC) of high performance parallel computers in the areas of vectorization, parallelization, performance analysis and optimization of scientific and engineering applications.  In 1998 she joined the NASA Ames Research Center in Moffett Field, California, USA as a Research Scientist. Here her work focused on evaluating and enhancing tools for parallel program development and investigating the usefulness of different parallel programming paradigms.  In 2005 she moved from California to the Pacific Northwest and joined Sun Microsystems as a staff engineer in the Compiler Performance Engineering team. In 2006 she joined Oracle as a Principal Software Engineer and worked on performance analysis of application server software. In 2008 she decided to return to California and pursue her passion for High Performance Computing. She joined the Texas Advanced Computing Center and worked as a Research Scientist supporting TeraGrid users remotely from Monterey, CA. In October 2011 Gabriele joined Advanced Micro Devices (AMD) as a design engineer in the Systems Performance Optimization group.


List of publications:


Book: Barbara Chapman, Gabriele Jost, and Ruud van der Pas: Using OpenMP. MIT Press, Oct. 2007, ISBN 978-0262533027.


Keywords: Clusters, Optimization, Parallel Programming, Performance, Tools


URL of this page: https://fs.hlrs.de/projects/rabenseifner/publ/SC2012-hybrid.html