Authors
Rolf
Rabenseifner, High Performance Computing Center Stuttgart (HLRS), Germany
Georg
Hager, University of Erlangen-Nuremberg, Germany
Gabriele
Jost, Texas Advanced Computing Center / Naval Postgraduate School, USA
Rainer
Keller, High Performance Computing Center Stuttgart (HLRS), Germany
Half-day Tutorial at Supercomputing 2008 (SC2008)
Abstract
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with dual or quad boards and single or multi-core CPUs, but also
"constellation" type systems with large SMP nodes. Parallel
programming may combine the distributed memory parallelization on the node
interconnect with the shared memory parallelization inside of each node.
This
tutorial analyzes the strength and weakness of several parallel programming
models on clusters of SMP/multi-core nodes. Various hybrid MPI+OpenMP
programming models are compared with pure MPI. Benchmark results of several
platforms are presented. The thread-safety quality of several existing MPI
libraries is also discussed. Case studies will be provided to demonstrate
various aspects of hybrid MPI/OpenMP programming. Another option is the use of
distributed virtual shared-memory technologies. Application categories that can
take advantage of hybrid programming are identified. Multi-socket-multi-core systems
in highly parallel environments are given special consideration.
Detailed
Description
Tutorial
goals:
Straightforward
programming of clusters of shared memory nodes often leads to unsatisfactory
performance results. The participant learns hybrid parallel programming. Pure
message passing (one MPI process on each core) and mixed model programming
(multi-threaded MPI processes) only partially fit to the architecture of modern
HPC systems. The tutorial teaches about solving those performance problems, but
also teaches technical aspects of mixed model programming. At the end of the
tutorial, the attendee will be sensitive about many pitfalls in parallel
programming on clusters of SMP nodes. He/she has learned about the
thread-safety level of MPI libraries and also about the limits of pure OpenMP
enabled by virtual shared memory technology. The participant can also learn
from sample applications as, e.g., a hybrid implementation of sparse
matrix-vector multiply used in iterative solvers.
Targeted
audience:
People who
are in charge with the development of efficient parallel software on clusters
of shared memory nodes. The tutorial is also interesting for developers of thread-safe
MPI libraries.
Content
level:
25% Introductory, 50% Intermediate, 25%
Advanced
Audience
prerequisites:
Some knowledge
about parallel programming with MPI and OpenMP.
Why the
topic is relevant to SC attendees:
Most
systems in HPC and supercomputing environments are clusters of SMP nodes,
ranging from clusters of dual-core CPUs to large constellations in Tera-scale
computing. Numerical software for these systems often scales worse than
expected. This tutorial helps to find the appropriate programming model and to
prevent pitfalls with mixed model (MPI+OpenMP) programming.
General
description of tutorial content:
Most HPC
systems are clusters of shared memory nodes. Such systems can be PC clusters
with dual or quad boards, but also "constellation" type systems with
large SMP nodes. Parallel programming must combine the distributed memory
parallelization on the node inter-connect with the shared memory
parallelization inside of each node.
This
tutorial analyzes the strength and weakness of several parallel programming
models on clusters of SMP nodes. Various hybrid MPI+OpenMP programming models
are compared with pure MPI. Benchmark results of several platforms are
presented. A hybrid-masteronly programming model can be used more efficiently
on some vector-type systems, but also on clusters of dual-CPUs. On other
systems, one CPU is not able to saturate the inter-node network and the
commonly used masteronly programming model suffers from insufficient inter-node
bandwidth. The thread-safety quality of several existing MPI libraries is also
discussed. Case studies from the fields of CFD (NAS Parallel Benchmarks and Multi-zone
NAS Parallel Benchmarks, in detail), Climate Modeling (POP2, maybe) and
Particle Simulation (GTC, maybe) will be provided to demonstrate various aspect
of hybrid MPI/OpenMP programming.
Another
option is the use of distributed virtual shared-memory technologies which
enable the utilization of "near-standard" OpenMP on distributed
memory architectures. The performance issues of this approach and its impact on
existing applications are discussed. This tutorial analyzes strategies to
overcome typical drawbacks of easily usable programming schemes on clusters of
SMP nodes.
Detailed Outline
Resume /
Curriculum Vitae
Dr. Rolf
Rabenseifner
Rolf
Rabenseifner studied mathematics and physics at the University of Stuttgart.
Since 1984, he has worked at the High-Performance Computing-Center Stuttgart
(HLRS). He led the projects DFN-RPC, a remote procedure call tool, and
MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without
loosing the full MPI interface. In his dissertation, he developed a controlled
logical clock as global time for trace-based profiling of parallel and
distributed applications. Since 1996, he has been a member of the MPI-2 Forum
and since Dec. 2007 he is in the steering committee of the MPI-3 Forum. From
January to April 1999, he was an invited researcher at the Center for
High-Performance Computing at Dresden University of Technology.
Currently,
he is head of Parallel Computing - Training and Application Services at HLRS.
He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge
Benchmark Suite. In recent projects, he studied parallel I/O, parallel
programming models for clusters of SMP nodes, and optimization of MPI
collective routines. In workshops and summer schools, he teaches parallel
programming models in many universities and labs in Germany.
Homepage: http://www.hlrs.de/people/rabenseifner/
List of publications: http://www.hlrs.de/people/rabenseifner/publ/publications.html
International teaching: http://www.hlrs.de/people/rabenseifner/publ/publications.html#tutorials
Dr. Georg Hager
Georg Hager
studied theoretical physics at the
Homepage: http://www.blogs.uni-erlangen.de/hager
List of publications: http://www.blogs.uni-erlangen.de/hager/topics/Publications/
Dr. Gabriele Jost
Gabriele
Jost obtained her doctorate in Applied Mathematics from the University of
Göttingen, Germany. For more than a decade she worked for various vendors
(Suprenum GmbH, Thinking Machines Corporation, and NEC) of high performance
parallel computers in the areas of vectorization, parallelization, performance
analysis and optimization of scientific and engineering applications.
In 1998 she
joined the NASA Ames Research Center in Moffett Field, California, USA as a
Research Scientist. Here her work focused on evaluating and enhancing tools for
parallel program development and investigating the usefulness of different
parallel programming paradigms.
In 2005 she
moved from California to the Pacific Northwest and joined Sun Microsystems as a
staff engineer in the Compiler Performance Engineering team. Her task is the
analysis of compiler generated code and providing feedback and suggestions for
improvement to the compiler group. Her research interest remains in area of
performance analysis and evaluation of programming paradigms for high
performance computing.
Currently,
she is working at the Texas Advanced Computing Center / Naval Postgraduate
School.
List of publications:
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/j/Jost:Gabriele.html
Rainer Keller
Rainer
Keller is a scientific employee at the High Performance Computing Center
Stuttgart (HLRS) since 2001. He earned his diploma in Computer Science at the
University of Stuttgart. Currently, he is the head of the group Applications,
Models and Tools at the HLRS.
His
professional interest are Parallel Computation using and working on MPI with
Open MPI and shared memory parallelization with OpenMP, as well as distributed
computing using the Meta-Computing Library PACX-MPI.
His work
includes performance analysis and optimization of parallel applications, as
well as the assessment of and porting to new hardware technologies, including
the training of HLRS users in parallel application development. He is involved
in several European projects, such as HPC-Europa.
Homepage: http://www.hlrs.de/people/keller/
List of publications: http://www.hlrs.de/people/keller/PAPERS/pubs.html
Keywords: Clusters,
Optimization, Parallel Programming, Performance, Tools
URL of this page:
http://www.hlrs.de/people/rabenseifner/publ/SC2008-hybrid.html