Rolf
Rabenseifner, High Performance Computing Center Stuttgart (HLRS), Germany
Georg Hager, Erlangen Regional Computing Center (RRZE), Germany
Gabriele
Jost, Supersmith, Monterey, CA, USA
Half-day Tutorial at Supercomputing 2013 (SC13)
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with single/multi-socket and multi-core SMP nodes, but also “constellation”
type systems with large SMP nodes. Parallel programming may combine the
distributed memory parallelization on the node interconnect with the shared
memory parallelization inside of each node.
This
tutorial analyzes the strengths and weaknesses of several parallel programming
models on clusters of SMP nodes. Multi-socket-multi-core systems in highly
parallel environments are given special consideration. MPI-3.0 introduced a new
shared memory programming interface, which can be combined with MPI message
passing and remote memory access on the cluster interconnect. It can be used
for direct neighbor accesses similar to OpenMP or for direct halo copies, and
enables new hybrid programming models. These models are compared with various
hybrid MPI+OpenMP approaches and pure MPI. This tutorial also includes a
discussion on OpenMP support for accelerators. Benchmark results on different
platforms are presented. Numerous case studies demonstrate the
performance-related aspects of hybrid programming, and application categories
that can take advantage of this model are identified. Tools for hybrid
programming such as thread/process placement support and performance analysis
are presented in a "how-to" section.
Details: https://fs.hlrs.de/projects/rabenseifner/publ/SC2013-hybrid.html
Detailed Description
Tutorial goals:
Straightforward
programming of clusters of shared memory nodes often leads to unsatisfactory
performance results. The participant learns various hybrid parallel programming
models, which include new MPI-3.0 interfaces. Pure message passing (one MPI
process on each core) and mixed model programming (multi-threaded MPI
processes) only partially fit to the architecture of modern HPC systems. The
tutorial teaches about solving those performance problems, but also teaches
technical aspects of mixed model programming. At the end of the tutorial, the
attendee will be sensitive about many pitfalls in parallel programming on
clusters of SMP nodes.
Targeted audience:
People who
are in charge with the development of efficient parallel software on clusters
of shared memory nodes.
Content level:
25% Introductory, 50% Intermediate, 25%
Advanced
Audience prerequisites:
Some
knowledge about parallel programming with MPI and OpenMP.
Why the topic is relevant to
SC attendees:
Most
systems in HPC and supercomputing environments are clusters of SMP nodes,
ranging from clusters of multicore CPUs to large constellations in tera- and
petascale computing. Numerical software for these systems often scales worse
than expected. This tutorial helps to find the appropriate programming model
and to prevent pitfalls with mixed model (MPI+OpenMP) programming and
introduces new programming schemes with the newly published MPI-3.0 standard.
General description of
tutorial content:
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with multi-core single/multi-CPU boards, but also "constellation"
type systems with large SMP nodes. Parallel programming must combine the
distributed memory parallelization on the node interconnect with the shared
memory parallelization inside of each node.
This
tutorial analyzes the strength and weakness of several parallel programming
models on clusters of SMP nodes. Various hybrid shared memory and message
passing parallel programming models based on MPI-3.0 and OpenMP are compared
with pure MPI. Benchmark results of several platforms are presented. Bandwidth
and latency is shown for intra-socket, inter-socket and inter-node 1-sided and
2-sided communication. The affinity of
processes and their threads and memory is a key factor. The thread-safety
status of several existing MPI libraries is also discussed. Case studies with
the multi-zone NAS Parallel Benchmarks and Mantevo Application Proxies will be
provided to demonstrate various aspects of hybrid MPI/OpenMP programming.
This
tutorial analyzes strategies to overcome typical drawbacks of easily usable
programming schemes on clusters of SMP nodes.
Detailed
Outline
•
Introduction /
Motivation
•
Programming
models on clusters of SMP nodes
o
Major programming models
o
Pure MPI
o
Hybrid MPI Cluster
Communication & MPI shared memory access
•
Using MPI-3.0 shared memory
for halo transfers
•
Using MPI-3.0 shared memory
for PGAS-like shared memory programming
o
Hybrid MPI & OpenMP
•
Hybrid Masteronly Style
•
Overlapping Communication
and Computation
o
Pure OpenMP
Case Studies / pure MPI vs. hybrid MPI+OpenMP
o
The Multi-Zone NAS Parallel
Benchmarks and Mantevo Application Proxies
with results on different multi-core SMP-clusters (Cray XE6, SGI Altix ICE, IBM
Power 6, Westmere Cluster)
o
Examples of thread and
process placement with numactl
•
“How-to” on hybrid programming
o
Compilation and linkage of hybrid MPI+OpenMP programs
o
Special considerations for
multi-socket-multi-core systems in highly parallel environments,
e.g., process/thread pinning, cross-socket/on-socket communication
o
ccNUMA memory locality
o
Intra-socket, inter-socket,
and inter-node communication characteristics
o
Control policy for
processes, threads and memory, e.g., with taskset, numactl, likwid-pin
o
Hybrid implementation of
sparse matrix-vector multiply (e.g. in an iterative solver)
•
Mismatch
Problems
o
Topology problem
o
Unnecessary
intra-node communication
o
Inter-node
bandwidth problem
o
Sleeping threads
and saturation problem
o Additional OpenMP overhead
o Overlapping communication and computation
o Communication overhead with DSM
o No silver bullet
•
Application
categories that can benefit from hybrid parallelization
o
Nested parallelism
o
Load balancing
o
Memory consumption
o
Scaling problems
•
Thread-safety
quality of MPI libraries
•
Tools
support for multi-threaded MPI processes
•
Other
options on clusters of SMP nodes
o
MPI/OpenACC and OpenMP support for accelerators
•
Summary
•
Appendix
o
References
Resume / Curriculum Vitae
Dr. Rolf Rabenseifner
Rolf
Rabenseifner studied mathematics and physics at the University of Stuttgart.
Since 1984, he has worked at the High-Performance Computing-Center Stuttgart
(HLRS). He led the projects DFN-RPC, a remote procedure call tool, and
MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without
loosing the full MPI interface. In his dissertation, he developed a controlled
logical clock as global time for trace-based profiling of parallel and
distributed applications. Since 1996, he has been a member of the MPI-2 Forum
and since Dec. 2007 he is in the steering committee of the MPI-3 Forum and was
responsible for new MPI-2.1 standard. From January to April 1999, he was an
invited researcher at the Center for High-Performance Computing at Dresden
University of Technology.
Currently, he is head of Parallel Computing - Training and Application Services at HLRS. He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge Benchmark Suite. In recent projects, he studied parallel I/O, parallel programming models for clusters of SMP nodes, and optimization of MPI collective routines. In workshops and summer schools he teaches parallel programming models in many universities and labs in Germany. In January 2012, the Gauss Center of Supercomputing (GCS), with HLRS, LRZ in Garching and the Jülich Supercomputing Center as members, was selected as one of six PRACE Advanced Training Centers (PATCs) and he was appointed as GCS’PATC director.
Homepage: http://www.hlrs.de/people/rabenseifner/
List of publications: https://fs.hlrs.de//projects/rabenseifner/publ/
Dr.
Georg Hager
Georg Hager
studied theoretical physics at the University of Bayreuth, specializing in
nonlinear dynamics, and holds a PhD in Computational Physics from the University
of Greifswald. He is a senior researcher in the HPC Services group at Erlangen
Regional Computing Center (RRZE), which is part of the University of
Erlangen-Nuremberg. Recent research includes architecture-specific optimization
strategies for current microprocessors, performance engineering of scientific
codes on chip and system levels, and special topics in shared memory and hybrid
programming. His daily work encompasses all aspects of user support in High
Performance Computing like tutorials and training, code parallelization,
profiling and optimization, and the assessment of novel computer architectures
and tools. His textbook “Introduction to
High Performance Computing for Scientists and Engineers” is recommended or
required reading in many HPC-related lectures and courses worldwide. In his
teaching activities he puts a strong focus on performance modeling techniques
that lead to a better understanding of the interaction of program code with the
hardware.
Homepage: http://blogs.fau.de/hager
List of publications: http://blogs.fau.de/hager/publications/
List of talks and teaching activities: http://blogs.fau.de/hager/talks/
Book:
Award:
Dr. Gabriele Jost
Gabriele
Jost obtained her doctorate in Applied Mathematics from the University of
Göttingen, Germany. For more than a decade she worked for various vendors
(Suprenum GmbH, Thinking Machines Corporation, and NEC) of high performance
parallel computers in the areas of vectorization, parallelization, performance
analysis and optimization of scientific and engineering applications.
In 1998 she
joined the NASA Ames Research Center in Moffett Field, California, USA as a
Research Scientist. Here her work focused on evaluating and enhancing tools for
parallel program development and investigating the usefulness of different
parallel programming paradigms.
In 2005 she moved from California to the Pacific Northwest and joined Sun Microsystems as a staff engineer in the Compiler Performance Engineering team. In 2006 she joined Oracle as a Principal Software Engineer and worked on performance analysis of application server software. In 2008 she decided to return to California and pursue her passion for High Performance Computing. She joined the Texas Advanced Computing Center and worked as a Research Scientist supporting TeraGrid users remotely from Monterey, CA. She is now with Supersmith, a small California corporation specializing in software creation and service support for high performance computing.
List of publications:
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/j/Jost:Gabriele.html
Book:
·
Barbara Chapman, Gabriele Jost, and Ruud van der Pas: Using OpenMP. MIT Press, Oct. 2007, ISBN
978-0262533027.
Keywords: Clusters, Optimization, Parallel
Programming, Performance, Tools
URL of this page: https://fs.hlrs.de/projects/rabenseifner/publ/SC2013-hybrid.html