SC2004-Tutorial Alice Koniges et al.

Application Supercomputing on Scalable Architectures

Alice Koniges, Lawrence Livermore National Laboratory (LLNL),
Mark Seager, Lawrence Livermore National Laboratory (LLNL),
Rolf Rabenseifner, High Performance Computing Center Stuttgart (HLRS), University of Stuttgart,
David Eder, Lawrence Livermore National Laboratory (LLNL),
Michael Resch, High Performance Computing Center Stuttgart (HLRS), University of Stuttgart

Full-day Tutorial at Supercomputing 2004 (SC2004).

Abstract

Teraflop performance is no longer a thing of the future as complex integrated 3D simulations drive supercomputer development. Today, most HPC systems are clusters of SMP nodes ranging from dual-CPU-PC clusters to the largest systems at the world's major computing centers.

What are the major issues facing application code developers today? How do the challenges vary from cluster computing to the complex hybrid architectures with super scalar and vector processors? What skills and tools are required, both of the application developer and the system itself? Finally, what are the paths both architecturally and algorithmically to petaflop performance?

In this tutorial, we address these questions and give tips, tricks, and tools of the trade for large-scale application development. In the introduction, we provide an overview of terminology, hardware and performance. Advanced topics are mixed-mode (combined MPI/OpenMP) programming, vector tips, and cluster environments. We describe the latest issues in implementing scalable parallel programming. We draw from a series of large application suites and discuss specific challenges and problems encountered in parallelizing these applications. Finally we discuss upcoming architectures such as BlueGene/L and the latest vector systems.

Detailed Description

Tutorial Goals:

The skill set required of application programmers and their support teams at major computing centers continues to grow with a curve similar to Moore's Law. This tutorial covers all of the basic ideas necessary for major application development. Specific examples are used so that developers and managers can understand what constitutes good application performance and what tools are needed to attain it. The latest material on mixed-mode programming (combined MPI/OpenMP) is covered in some depth with emphasis on performance tricks for a variety of hybrid architectures available at major computer centers throughout the world. Additional material on the nuts and bolts of application programming, from debuggers like TotalView, performance tools like Vampir, and scripting languages such as Yorick and Python help those interested in developing practical applications sort through the variety of tools and resources available.

Who should attend?

Those interested in high-end applications of parallel computing should attend this tutorial. This tutorial topic appeals to a variety of SC attendees as seen from our past tutorial attendance records. The attendees range from managers at industrial firms to graduate students. The introductory material provides enough background for beginners to understand the basic issues of parallel code development with references for further study. The more advanced material is aimed at both researchers in parallel computing methodology and applications programmers who want an overview of what constitutes good parallel performance and how to attain it. The tutorial will also provide support in picking the right architecture for the right application.

To help in this choice, we address controversial issues such as the difference between a smaller number of high-performance vector processors compared with many COTS-like processors in large scale systems and bring two leaders in this field to show why different choices are appropriate for certain application mixtures.

Content Level

25% Introductory, 50% Intermediate, 25% Advanced

Audience Prerequisites

There are no specific prerequisites although a basic understanding of parallel computing is helpful for the advanced material. Past audience members particularly liked the question and answer format at the end of the day where they were able to directly question major decision makers on architecture choices. The tutorial setting is good for this because unlike large panel presentations, the tutorial attendees are unencumbered in asking questions. Therefore we have added an additional similar period at the end of the morning session where we bring in a new expert with a different perspective

Sample Material

Parallel programming materials including mixed-mode programming are adapted from the on-line course (http://www.hlrs.de/organization/par/par_prog_ws/). A sample of the applications is given on the web site for Industrial Strength Parallel Computing (http://www.elsevier.com/wps/find/bookdescription.cws_home/677844/description) and some Gordon-Bell prize-winning studies at (http://www.cs.odu.edu/~keyes/bell.html) and (http://www.llnl.gov/CASC/asciturb). Other applications discussed include Earth Simulator codes and coupled ALE simulations.

Tutorial Outline

Parallel Computing Resources and Performance Issues (60 min, Alice Koniges)
- Parallel Architecture Overview
- Recent Computer Center Choices and Comparisons
- Measuring and Reporting Performance
- Application Speed-up
Programming Models and Languages (30 min, Rolf Rabenseifner)
- Parallelization Strategies
- Major Implementations (MPI, OpenMP)
- Concepts from MPI-2
Implementation (60 min, Alice Koniges)
- Debugging Strategies
- Performance Analysis Tools
- Single CPU Optimizing issues
- Global Optimization
- Comments on I/O and Scripting Languages
A comparison of sustained performance on a different hardware architectures (30 min, Michael Resch)
- Estimates for sustained performance
- Total Cost of Ownership
- Architectural issues - bandwidth and latency
- The scaling problem
-------- Lunch-Break ---------
Mixed Model Programming on Hybrid Systems (60 min, Rolf Rabenseifner)
- MPI+OpenMP Programming Rules
- Parallel Programming Models on Clusters of SMP Nodes
- Mixed Model Communication Benchmarks on Several Platforms
- Overlapping Communication and Computation
- Future Developments
Putting it all together: How to Design Real Applications (60 min, David Eder)
- General Remarks on Parallelization
- Optimization Techniques
- Hybrid MPI/OpenMP Strategies
Application Development Path to a petaFLOP/s (60 min, Mark Seager)
- BlueGene/L
- Clusters
- Future

Authors' Biographies

Alice E. Koniges is a Member of the Accelerated Strategic Computing Initiative (ASCI) research team at the Lawrence Livermore National Laboratory in California. She has recently returned from a loan to the Max-Planck Institute in Garching, Germany (Computer Center and Plasma Physics Institute) where she was a consultant to users at this institute helping with the conversion of applications codes for MPP computers. From 1995 to 1997, she was leader of the Parallel Applications Technology Program at Lawrence Livermore Lab. This was Livermore's portion of the largest ($40Million) CRADA (Cooperative Research and Development Agreement) ever undertaken by the Dept. of Energy. The scope of the agreement provided for the design of parallel industrial supercomputing codes on MPP platforms. She is also Editor of the book by Morgan Kaufmann Publishers of San Francisco "Industrial Strength Parallel Computing." She has a Ph.D. in Applied and Numerical Mathematics from Princeton University, an MA and an MSME from Princeton, and a BA in Engineering Sciences from the University of California, San Diego. (http://www.llnl.gov/CASC/people/koniges/).

Mark Seager. Recognized leader in tera-scale computing systems design, procurements and integration with 19 years experience in parallel computing. Played a significant role in developing the US DOE Accelerated Strategic Computing Initiative's computing and problem solving environment (PSE) strategies including shaping the cluster of SMP's approach of "Option Blue" and "Option White." Developed the computational strategy and integrated architecture for LLNL multi-programmatic and institutional computing. Developed LLNL high-performance commodity (IA-32/Linux/Open Source) clustering strategy. Current principal investigator for the ASCI Platforms at Livermore with responsibility for executing the tri-laboratory efforts in tera-scale computing strategy, integration and support. Lead ASCI Purple ($290M) procurement team. Negotiated over $300M In contracts for scalable system procurements with multiple vendors. Previously PI for ASCI/Problem Solving Environment including coordinated computing strategy, platform support, applications development support, distributed computing environment, visualization and numerical methods and tri-laboratory networking. Lead planning effort to develop ten-year strategic vision: "Full Spectrum Computing." Defined vision, architecture and obtained funding for Scaleable I/O Facility. Drove technical specification and evaluation for first Federal MPP competitive procurement. Supervised design and implementation of major center networks. Defined and supervised programming teams with responsibility for: Cray's running NLTSS and UNICOS operating systems; networks; super mini-computers, workstations and desktop systems; department databases and computer resource utilization accounting. Primary customer requirement gathering contact for future department products. Coordinated aspects of migration from local developed products to industry standard operating systems and network protocols. Developed scheduling methodologies for MPP's and evaluation models. Developed techniques for visualization of parallel application execution. Developed major sparse linear algebra package for the solution of symmetric and non-symmetric sparse linear systems (SLAP). Worked with large code groups to integrate SLAP into production applications. Developed numerical methods for parallel architectures.

Rolf Rabenseifner studied mathematics and physics at the University of Stuttgart. Since 1984, he has worked at the High-Performance Computing-Center Stuttgart (HLRS). He led the projects DFN-RPC, a remote procedure call tool, and MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without loosing the full MPI interface. In his dissertation, he developed a controlled logical clock as global time for trace-based profiling of parallel and distributed applications. Since 1996, he has been a member of the MPI-2 Forum. From January to April 1999, he was an invited researcher at the Center for High-Performance Computing at Dresden University of Technology. Currently, he is head of the parallel computing department of the HLRS. He is involved in MPI profiling and benchmarking. In workshops and summer schools, he teaches parallel programming models in many universities and labs in Germany. (http://www.hlrs.de/people/rabenseifner/).

David Eder is a computational physicist at the Lawrence Livermore National Laboratory in California. He has extensive experience with application codes for the study of multiphysics problems. His latest endeavors include ALE (Arbitrary Lagrange Eulerian) on unstructured and block-structured grids for simulations that span many orders of magnitude. He was awarded a research prize in 2000 for use of advanced codes to design the National Ignition Facility 192 beam laser currently under construction. He has a PhD in Astrophysics from Princeton University and a BS in Mathematics and Physics from the Univ. of Colorado.

Michael Resch: Since 2003, Prof. Dr.-Ing. Michael M. Resch is the director of the High Performance Computing Center Stuttgart (HLRS), Germany. Since 2002 he holds a full professorship as the chair for High Performance Computing at the Universit�t Stuttgart, Germany. He is the director of the Center for Simulation Technology at the Universit�t Stuttgart and the speaker of the board of the Center for Competence in High Performance Computing of the State of Baden-W�rttemberg. He is a member of the steering committee of the German Grid initiative d-grid and a member of the scientific advisory board of the Swiss Centro Svizzero di Calcolo Scientifico (CSCS). He was the leader of the HLRS working group that received the NSF Award for High Performance Distributed Computing in 1999 and won the HPC Challenge Award at SC?2003 in November 2003. He holds a Dipl.-Ing. (MSc) in Technical Mathematics from the Technical University of Graz/Austria and a PhD in Mechanical Engineering from the University of Stuttgart/Germany. He has been an Assistant Professor for Computer Science at the University of Houston until 2002. (http://www.hlrs.de/people/resch/).

Keywords:

Applications
Parallel Programming
Performance
Architecture
Tools
New technologies
Scalability
Simulation
Languages

URL of this page: http://www.hlrs.de/people/rabenseifner/publ/SC2004-tutorial.html.