Rolf
Rabenseifner, High Performance Computing Center Stuttgart (HLRS), Germany
Georg Hager, Erlangen Regional Computing Center (RRZE), Germany
Gabriele
Jost, Texas Advanced Computing Center, The University of Texas at Austin, USA
Half-day Tutorial at Supercomputing 2011 (SC11)
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with single/multi-socket and multi-core SMP nodes, but also
"constellation" type systems with large SMP nodes. Parallel
programming may combine the distributed memory parallelization on the node
interconnect with the shared memory parallelization inside of each node.
This
tutorial analyzes the strengths and weaknesses of several parallel programming
models on clusters of SMP nodes. Multi-socket-multi-core systems in highly
parallel environments are given special consideration. This includes a
discussion on planned future OpenMP support for accelerators. Various hybrid
MPI+OpenMP approaches are compared with pure MPI, and benchmark results on
different platforms are presented. Numerous case studies demonstrate the
performance-related aspects of hybrid MPI/OpenMP programming, and application
categories that can take advantage of hybrid programming are identified. Tools
for hybrid programming such as thread/process placement support and performance
analysis are presented in a "how-to" section.
Detailed Description
Tutorial goals:
Straightforward
programming of clusters of shared memory nodes often leads to unsatisfactory
performance results. The participant learns hybrid parallel programming. Pure
message passing (one MPI process on each core) and mixed model programming
(multi-threaded MPI processes) only partially fit to the architecture of modern
HPC systems. The tutorial teaches about solving those performance problems, but
also teaches technical aspects of mixed model programming. At the end of the
tutorial, the attendee will be sensitive about many pitfalls in parallel
programming on clusters of SMP nodes.
Targeted audience:
People who
are in charge with the development of efficient parallel software on clusters
of shared memory nodes.
Content level:
25% Introductory, 50% Intermediate, 25%
Advanced
Audience prerequisites:
Some
knowledge about parallel programming with MPI and OpenMP.
Why the topic is relevant to
SC attendees:
Most
systems in HPC and supercomputing environments are clusters of SMP nodes,
ranging from clusters of dual/quad-core CPUs to large constellations in Tera-
and Peta-scale computing. Numerical software for these systems often scales
worse than expected. This tutorial helps to find the appropriate programming
model and to prevent pitfalls with mixed model (MPI+OpenMP) programming.
General description of
tutorial content:
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with quad-core single/multi CPU boards, but also "constellation" type
systems with large SMP nodes. Parallel programming must combine the distributed
memory parallelization on the node inter-connect with the shared memory
parallelization inside of each node.
This
tutorial analyzes the strength and weakness of several parallel programming
models on clusters of SMP nodes. Various hybrid MPI+OpenMP programming models
are compared with pure MPI. Benchmark results of several platforms are
presented. Bandwidth and latency is shown for intra-socket, inter-socket and
inter-node communication. The affinity
of processes and their threads and memory is a key-factor. The thread-safety
status of several existing MPI libraries is also discussed. Case studies with
the Multi-zone NAS Parallel Benchmarks will be provided to demonstrate various
aspect of hybrid MPI/OpenMP programming.
This
tutorial analyzes strategies to overcome typical drawbacks of easily usable
programming schemes on clusters of SMP nodes.
Detailed
Outline
Resume / Curriculum Vitae
Dr. Rolf Rabenseifner
Rolf
Rabenseifner studied mathematics and physics at the University of Stuttgart.
Since 1984, he has worked at the High-Performance Computing-Center Stuttgart
(HLRS). He led the projects DFN-RPC, a remote procedure call tool, and
MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without
loosing the full MPI interface. In his dissertation, he developed a controlled
logical clock as global time for trace-based profiling of parallel and distributed
applications. Since 1996, he has been a member of the MPI-2 Forum and since
Dec. 2007 he is in the steering committee of the MPI-3 Forum and was
responsible for new MPI-2.1 standard. From January to April 1999, he was an
invited researcher at the Center for High-Performance Computing at Dresden
University of Technology.
Currently,
he is head of Parallel Computing - Training and Application Services at HLRS.
He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge
Benchmark Suite. In recent projects, he studied parallel I/O, parallel
programming models for clusters of SMP nodes, and optimization of MPI
collective routines. In workshops and summer schools, he teaches parallel
programming models in many universities and labs in Germany.
Homepage: http://www.hlrs.de/people/rabenseifner/
List of publications: https://fs.hlrs.de//projects/rabenseifner/publ/
International teaching: https://fs.hlrs.de//projects/rabenseifner/publ/#tutorials
Dr.
Georg Hager
Georg Hager
studied theoretical physics at the University of Bayreuth, specializing in
nonlinear dynamics, and holds a PhD in Computational Physics from the
University of Greifswald. He is a senior researcher in the HPC Services group
at Erlangen Regional Computing Center (RRZE), which is part of the University
of Erlangen-Nuremberg. Recent research includes architecture-specific
optimization strategies for current microprocessors, performance modeling on
chip and system levels, and special topics in shared memory and hybrid
programming. His daily work encompasses all aspects of user support in High
Performance Computing like tutorials and training, code parallelization,
profiling and optimization and the assessment of novel computer architectures
and tools.
Homepage: http://blogs.fau.de/hager
List of publications: http://blogs.fau.de/hager/publications/
List of talks and teaching activities: http://blogs.fau.de/hager/talks/
Book: Georg Hager and Gerhard Wellein: Introduction to High Performance Computing
for Scientists and Engineers. CRC Press, ISBN 978-1439811924, 356 pages, July 2010.
Dr. Gabriele Jost
Gabriele
Jost obtained her doctorate in Applied Mathematics from the University of
Göttingen, Germany. For more than a decade she worked for various vendors
(Suprenum GmbH, Thinking Machines Corporation, and NEC) of high performance
parallel computers in the areas of vectorization, parallelization, performance
analysis and optimization of scientific and engineering applications.
In 1998 she
joined the NASA Ames Research Center in Moffett Field, California, USA as a
Research Scientist. Here her work focused on evaluating and enhancing tools for
parallel program development and investigating the usefulness of different
parallel programming paradigms.
In 2005 she
moved from California to the Pacific Northwest and joined Sun Microsystems as a
staff engineer in the Compiler Performance Engineering team. In 2006 she joined
Oracle as a Principal Software Engineer and worked on performance analysis of
application server software. In 2008 she decided to return to California and
pursue her passion for High Performance Computing. She joined the Texas
Advanced Computing Center and works as a Research Scientist supporting TeraGrid
users remotely from Monterey, CA.
List of publications:
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/j/Jost:Gabriele.html
Book: Barbara Chapman, Gabriele Jost, and Ruud van
der Pas: Using OpenMP. MIT Press,
Oct. 2007, ISBN
978-0262533027.
Keywords:
Clusters, Optimization, Parallel Programming, Performance, Tools
URL of this page:
https://fs.hlrs.de/projects/rabenseifner/publ/SC2011-hybrid.html