Intro to PGAS (UPC and CAF) and Hybrid for Multicore Programming

Alice Koniges, Berkeley Lab, NERSC

Katherine Yelick, UC Berkeley and Berkeley Lab, NERSC

Rolf Rabenseifner, High Performance Computing Center Stuttgart

Reinhold Bader, Leibniz Supercomputing Center Munich

David Eder, Lawrence Livermore National Laboratory

A full-day tutorial proposed for SC11

Abstract

PGAS (Partitioned Global Address Space) languages offer both an alternative to traditional parallelization approaches (MPI and OpenMP), and the possibility of being combined with MPI for a multicore hybrid programming model. In this tutorial we cover PGAS concepts and two commonly used PGAS languages, Coarray Fortran (CAF, as specified in the Fortran standard) and the extension to the C standard, Unified Parallel C (UPC). Hands-on exercises to illustrate important concepts are interspersed with the lectures. Attendees will be paired in groups of two to accommodate attendees without laptops. Basic PGAS features, syntax for data distribution, intrinsic functions and synchronization primitives are discussed. Additional topics include parallel programming patterns, future extensions of both CAF and UPC, and hybrid programming. In the hybrid programming section we show how to combine PGAS languages with MPI, and contrast this approach to combining OpenMP with MPI. Real applications using hybrid models are given.

Detailed Description

Tutorial goals

This tutorial represents a unique collaboration between the Berkeley PGAS/UPC group and experienced hands-on PGAS and hybrid instructors. Participants will be provided with the technical foundations necessary to write library or application codes using CAF or UPC, and an introduction to experimental techniques for combining MPI with PGAS languages.

The tutorial will stress some of the advantages of PGAS programming models including

· potentially easier programmability and therefore higher productivity than with purely MPI-based programming due to one-sided communication semantics, integration of the type system and other language features included with the parallel facilities

· optimization potential for the language processor (compiler + runtime system)

· improved scalability compared to OpenMP at the same level of usage complexity due to better locality control

· flexibility with respect to architectures – PGAS may be deployed on shared memory multi-core systems as well as (with some care required) on large-scale MPP architectures

The tutorial's strategy to provide an integrated view of both CAF and UPC will allow the audience to get a clear picture of similarities and differences between these two approaches to PGAS programming. Hybrid programming using both OpenMP and PGAS will be illustrated and compared.

Targeted Audiences and Relevance

The PGAS base is growing and targets a wide range of SC attendees. Application programmers, vendors and library designers coming from both C and Fortran backgrounds, will attend this tutorial. Multicore architectures are the norm now, from high end systems to desktops. This tutorial therefore addresses computer professionals with access to a very wide variety of programming platforms.

Content level

30% introductory, 40% intermediate, 30% advanced

Audience prerequisites

Participants should have knowledge of at least one of the Fortran 95 and C programming languages, possibly both, and be comfortable with running example programs in a Linux environment. Technical assistants and other personnel will be available for help with the exercises. In addition, a basic knowledge of traditional parallel programming models (MPI and OpenMP) is useful for the more advanced parts of the tutorial. Attendees will be paired in groups of two to accommodate attendees without laptops. If you have a laptop, a secure shell should be installed (e.g. OpenSSH or PuTTY) to be able to login on the parallel compute server that will be provided for the exercises, see also

http://www.nersc.gov/nusers/help/access/ssh_apps.php .

General Description

After an introduction to general PGAS concepts as well as to the status of the standardization efforts, the basic syntax for declaration and use of shared data is presented; the requirements and rules for synchronization of accesses to shared data are explained (PGAS memory model). This is followed by the topic of dynamic memory management for shared entities. Then, advanced synchronizations mechanisms like locks, atomic procedures as well as collective procedures are discussed, as well as their usefulness for implementation of certain parallel programming patterns. The section on hybrid programming explains the way MPI makes allowances for hybrid models, and how this can be matched with PGAS-based implementations. Finally, still existing deficiencies in the present language definitions of CAF and UPC will be indicated; an outlook will be provided for possible future extensions, which are presently still under discussion among language developers, and should allow to overcome most of the above-mentioned deficiencies.

Description of Exercises for hands-on sessions

The hands-on sessions are interspersed with the presentations such that approximately one hour of presentation is followed by 30 minutes of exercises. The exercises will come from a pool of exercises that have been tested on courses given throughout Europe, as well as additional exercises for the newest material.

The NERSC computer center will make available a special partition of their Cray XT machines and a set of accounts to accommodate the hands-on exercises. This model has already been successfully deployed at the previous SC10 PGAS tutorial. In the event that a natural disaster or a system crash takes this planned system down, the users will have access to the same exercises on an SGI UltraViolet system at LRZ. Attendees will use laptops that can open a ssh window; they will be grouped in pairs to accommodate people without a laptop, and also to handle any other account issues that come up. Attendees may do the exercises in pairs in both UPC and CAF, to allow comparison of both languages. When possible, C programmers will be paired with Fortran programmers. For advanced programmers or those who want to stay in one language, additional exercise material will be provided for efficient use of the exercise time. UC Berkeley teaching assistants from the course CS 267, “Applications of Parallel Computers,” may be available as needed to help with the hands-on exercises.

Presently planned examples include

an elementary exercise to understand the handling of the compilers and runtime systems of UPC and CAF
parallelization of a matrix-vector multiplication
parallelization of a simple 2-dimensional jacobi code
parallelization of a ray tracing code
a real-world hybrid MPI/CAF application for inspection and test runs

and this list will be updated as the tutorial material is finalized.

Detailed outline of the tutorial

Basic PGAS concepts

execution model, memory model
resource mapping, run time environments
standardization efforts, comparison with other paradigms

Hands-on session: First UPC and CAF examples and exercises

[-- Coffee break --]

UPC and CAF basic syntax

declaration of shared data / coarrays
intrinsic procedures for handling shared data
Synchronization:

motivation – race conditions; rules for access to shared entities by different threads/images
synchronization constructs and modes
program termination

Dynamic entities and their management:

UPC pointers and allocation calls
CAF allocatable entities and dynamic type components
object-orientation in CAF and its limitations

Hands-on session: Exercises on basic syntax and dynamic data

[-- Lunch break --]

Advanced synchronization concepts

locks and split-phase barriers
atomic procedures and their usage
collective operations

Some parallel patterns and hints on library design:

parallelization concepts with and without halo cells
work sharing; master-worker
procedure interfaces

Hands-on session: Heat example parallelization

[-- Coffee break --]

Hybrid programming

Notes on current architectures
MPI allowances for hybrid models
Hybrid OpenMP examples
Hybrid PGAS examples and performance/implementation comparison

Hands-on session: hybrid

Real Applications

[-- End --]

About the Presenters

Dr. Alice Koniges is a Physicist and Computer Scientist at the National Energy Research Scientific Computing Center (NERSC) at the Berkeley Lab, where she leads the Petascale Computing Initiative and fusion research projects including co-design for exascale. Her current research interests include programming models, benchmarking and optimization, applications in plasma physics, material science, energy research, and arbitrary Lagrange Eulerian methods for time-dependent PDE’s. Previous to working at the Berkeley Lab, she held various positions at the Lawrence Livermore National Laboratory, including management of the Lab’s institutional computing. She recently led the effort to develop a new code that is used predict the impacts of target shrapnel and debris on the operation of the National Ignition Facility (NIF), the world’s most powerful laser. She was the first woman to receive a PhD in Applied and Computational Mathematics at Princeton University and also has MSE and MA degrees from Princeton and a BA in Applied Mechanics from the University of California, San Diego. She is editor and lead author of the book “Industrial Strength Parallel Computing,” (Morgan Kaufmann Publishers 2000) and has published more than 80 refereed technical papers.

Dr. Katherine Yelick is the Associate Laboratory Director for Computing Sciences at Lawrence Berkeley National Laboratory, Director of the National Energy Research Scientific Computing (NERSC) Center and a Professor of Electrical Engineering and Computer Sciences at the University of California at Berkeley. She is the author or co-author of two books and more than 100 refereed technical papers on parallel languages, compilers, algorithms, libraries, architecture, and storage. She co-invented the UPC and Titanium languages and demonstrated their applicability across architectures through the use of novel runtime and compilation methods. She also co-developed techniques for self-tuning numerical libraries, including the first self-tuned library for sparse matrix kernels which automatically adapt the code to properties of the matrix structure and machine. Her work includes performance analysis and modeling as well as optimization techniques for memory hierarchies, multicore processors, communication libraries, and processor accelerators. She earned her Ph.D. in Electrical Engineering and Computer Science from MIT and has been a professor of Electrical Engineering and Computer Sciences at UC Berkeley since 1991 with a joint research appointment at Berkeley Lab since 1996. She has received multiple research and teaching awards and is a member of the California Council on Science and Technology and a member of the National Academies committee on Sustaining Growth in Computing Performance.

Dr. Rolf Rabenseifner studied mathematics and physics at the University of Stuttgart. Since 1984, he has worked at the High-Performance Computing-Center Stuttgart (HLRS). He led the projects DFN-RPC, a remote procedure call tool, and MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without losses to full MPI functionality. In his dissertation, he developed a controlled logical clock as global time for trace-based profiling of parallel and distributed applications. Since 1996, he has been a member of the MPI-2 Forum and since Dec. 2007 he is in the steering committee of the MPI-3 Forum. From January to April 1999, he was an invited researcher at the Center for High-Performance Computing at Dresden University of Technology. Currently, he is head of Parallel Computing - Training and Application Services at HLRS. He is involved in MPI profiling and benchmarking e.g., in the HPC Challenge Benchmark Suite. In recent projects, he studied parallel I/O, parallel programming models for clusters of SMP nodes, and optimization of MPI collective routines. In workshops and summer schools, he teaches parallel programming models in many universities and labs in Germany.

Dr. Reinhold Bader studied physics and mathematics at the Ludwigs-Maximilians University in Munich, completing his studies with a PhD in theoretical solid state physics in 1998. Since the beginning of 1999, he has worked at Leibniz Supercomputing Centre (LRZ) as a member of the scientific staff, being involved in HPC user support, procurements of new systems, benchmarking of prototypes in the context of the PRACE project, courses for parallel programming, and configuration management for the HPC systems deployed at LRZ. As a member of the German delegation to WG5, the international Fortran Standards Committee, he also takes part in the discussions on further development of the Fortran language. He has published a number of contributions to ACMs Fortran Forum and is responsible for development and maintenance of the Fortran interface to the GNU Scientific Library.

Dr. David Eder is a computational physicist and group leader at the Lawrence Livermore National Laboratory in California. He has extensive experience with application codes for the study of multiphysics problems. His latest endeavors include ALE (Arbitrary Lagrange Eulerian) on unstructured and block-structured grids for simulations that span many orders of magnitude. He was awarded a research prize in 2000 for use of advanced codes to design the National Ignition Facility 192 beam laser currently under construction. He has a PhD in Astrophysics from Princeton University and a BS in Mathematics and Physics from the Univ. of Colorado. He has published approximately 80 research papers.

Keywords

· Languages

· Parallel Programming

· Performance

· Applications

URL of this page (shortened):
https://fs.hlrs.de/projects/rabenseifner/publ/SC2011-PGAS.html