# Hybrid MPI & OpenMP Parallel Programming



## MPI + OpenMP and other models on clusters of SMP nodes

Rolf Rabenseifner<sup>1)</sup> Rabenseifner@hlrs.de Georg Hager<sup>2)</sup> Georg.Hager@rrze.uni-erlangen.de Gabriele Jost<sup>3)</sup> gjost@tacc.utexas.edu

<sup>1)</sup> High Performance Computing Center (HLRS), University of Stuttgart, Germany
 <sup>2)</sup> Regional Computing Center (RRZE), University of Erlangen, Germany
 <sup>3)</sup> Texas Advanced Computing Center, The University of Texas at Austin, USA

Tutorial M02 at SC10, November 15, 2010, New Orleans, LA, USA

 Hybrid Parallel Programming

 Slide 1
 Höchstleistungsrechenzentrum Stuttgart





## Outline

| <b>•</b> | slide n                                                                               | umber |                          |
|----------|---------------------------------------------------------------------------------------|-------|--------------------------|
| •        | Introduction / Motivation                                                             | 2     | J                        |
| •        | Programming models on clusters of SMP nodes                                           | 6     |                          |
| •        | Case Studies / pure MPI vs hybrid MPI+OpenMP                                          | 13    | 8:30 - 10:00             |
| •        | Practical "How-To" on hybrid programming                                              | 48    | J                        |
| •        | Mismatch Problems                                                                     | 97    | )                        |
| •        | Opportunities: Application categories that can<br>benefit from hybrid parallelization | 126   |                          |
| •        | Thread-safety quality of MPI libraries                                                | 136   | <pre>10:30 - 12:00</pre> |
| •        | Tools for debugging and profiling MPI+OpenMP                                          | 143   |                          |
| •        | Other options on clusters of SMP nodes                                                | 150   |                          |
| •        | Summary                                                                               | 164   | J                        |
| ٠        | Appendix                                                                              | 172   |                          |
| •        | Content (detailed)                                                                    | 188   |                          |

Hybrid Parallel Programming Slide 2 / 169



cores

shared

memory

### **Motivation**

- Efficient programming of clusters of SMP nodes SMP nodes:
  - Dual/multi core CPUs
  - Multi CPU shared memory
  - Multi CPU ccNUMA
  - Any mixture with shared memory programming model
- Hardware range ٠
  - mini-cluster with dual-core CPUs
  - large constellations with large SMP nodes
    - ... with several sockets (CPUs) per SMP node
    - ... with several cores per socket
  - $\rightarrow$  Hierarchical system layout



SMP nodes

**Node Interconnect** 

- Cluster of ccNUMA/SMP nodes
- Hybrid MPI/OpenMP programming seems natural
  - MPI between the nodes
  - OpenMP inside of each SMP node



# Motivation





### Goals of this tutorial



• Sensitize to problems on clusters of SMP nodes

see sections  $\rightarrow$  Case studies  $\rightarrow$  Mismatch problems

- Technical aspects of hybrid programming
  - see sections  $\rightarrow$  Programming models on clusters  $\rightarrow$  Examples on hybrid programming
- Opportunities with hybrid programming
  - see section → Opportunities: Application categories that can benefit from hybrid paralleliz.
- Issues and their Solutions
  - with sections  $\rightarrow$  Thread-safety quality of MPI libraries
    - → Tools for debugging and profiling for MPI+OpenMP



Rabenseifner, Hager, Jost



•Less frustration &

 More success

with your parallel program on clusters of SMP nodes

# Outline



• Introduction / Motivation

### Programming models on clusters of SMP nodes

- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary



Hybrid Parallel Programming





# Major Programming models on hybrid systems

- Pure MPI (one MPI process on each core)
- Hybrid MPI+OpenMP
  - shared memory OpenMP
  - distributed memory MPI



- Other: Virtual shared memory systems, PGAS, HPF, ...
- Often hybrid programming (MPI+OpenMP) slower than pure MPI

   why?







Hybrid Parallel Programming Slide 9 / 169

Rabenseifner, Hager, Jost

TACC

### **Hybrid Masteronly**



Masteronly MPI only outside of parallel regions

### **Advantages**

Rabenseifner, Hager, Jost

- No message passing inside of the SMP nodes
- No topology problem

for (iteration ....)

#pragma omp parallel
 numerical code
/\*end omp parallel \*/

Hybrid Parallel Programming

Slide 10 / 169

/\* on master thread only \*/ MPI\_Send (original data to halo areas in other SMP nodes) MPI\_Recv (halo data from the neighbors) } /\*end for loop

### **Major Problems**

- All other threads are sleeping while master thread communicates!
- Which inter-node bandwidth?
- MPI-lib must support at least MPI\_THREAD\_FUNNELED



### **Overlapping Communication and Computation**

MPI communication by one or a few threads while other threads are computing

```
if (my_thread_rank < ...) {</pre>
```

```
MPI_Send/Recv....
```

i.e., communicate all halo data

} else {

Execute those parts of the application that do <u>not</u> need halo data (on <u>non-communicating</u> threads)

Execute those parts of the application that <u>need</u> halo data (on all threads)

Hybrid Parallel Programming Slide 11 / 169

}



### Pure OpenMP (on the cluster)



OpenMP only distributed virtual shared memory

- Distributed shared virtual memory system needed
- Must support clusters of SMP nodes
- e.g., Intel<sup>®</sup> Cluster OpenMP
  - Shared memory parallel inside of SMP nodes
  - Communication of modified parts of pages at OpenMP flush (part of each OpenMP barrier)



i.e., the OpenMP memory and parallelization model is prepared for clusters!



Hybrid Parallel Programming Slide 12 / 169



# Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes

### • Case Studies / pure MPI vs hybrid MPI+OpenMP

- The Multi-Zone NAS Parallel Benchmarks
- For each application we discuss:
  - Benchmark implementations based on different strategies and programming paradigms
  - Performance results and analysis on different hardware architectures
- Compilation and Execution Summary

Gabriele Jost (University of Texas, TACC/Naval Postgraduate School, Monterey CA)

- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid paralleli.
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary

Hybrid Parallel Programming Slide 13 / 169





### The Multi-Zone NAS Parallel Benchmarks



|                        | MPI/OpenMP       | MLP                 | Nested<br>OpenMP |
|------------------------|------------------|---------------------|------------------|
| Time step              | sequential       | sequential          | sequential       |
| inter-zones            | MPI<br>Processes | MLP<br>Processes    | OpenMP           |
| exchange<br>boundaries | Call MPI         | data copy+<br>sync. | OpenMP           |
| intra-zones            | OpenMP           | OpenMP              | OpenMP           |

zones

- Multi-zone versions of the NAS Parallel Benchmarks • LU,SP, and BT
- Two hybrid sample implementations
- Load balance heuristics part of sample codes •
- www.nas.nasa.gov/Resources/Software/software.html







### Using MPI/OpenMP: ADI Method



. . . !\$OMP PARALLEL DEFAULT(SHARED) !\$OMP& PRIVATE(m, i, j, k...) !\$OMP DO do k = 2, nz-1do j = 2, ny-1 do i = 2, nx-1do m = 1, 5u(m,i,j,k) = dt \* rsd(m, i, j, k-1)end do end do end do end do !\$OMP END DO nowait . . . !\$OMP END PARALLEL

subroutine zsolve(u, rsd,...)



Hybrid Parallel Programming Slide 15 / 169 Rabe





### **Pipelined Thread Execution in SSOR**

```
subroutine ssor
!$OMP PARALLEL DEFAULT (SHARED)
!$OMP& PRIVATE(m, i, j, k...)
   call sync1 ()
   do k = 2, nz-1
!$OMP DO
     do j = 2, ny-1
        do i = 2, nx-1
          do m = 1, 5
       rsd(m, i, j, k) =
         dt * rsd(m, i, j, k-1)
          end do
        end do
     end do
 !$OMP END DO nowait
   end do
   call sync2 ()
   . . .
 !$OMP END PARALLEL
Hybrid Parallel Programming
                    Rabenseifner, Hager, Jost
Slide 16 / 169
```

subbroutine sync1 ...neigh = iam -1 do while (isync(neigh) .eq. 0) !\$OMP FLUSH(isync) end do isync(neigh) = 0 !\$OMP FLUSH(isync) ... subroutine sync2

```
...
neigh = iam -1
do while (isync(neigh) .eq. 1)
!$OMP FLUSH(isync)
end do
isync(neigh) = 1
!$OMP FLUSH(isync)
```





### SC10 New Orleans, LA

### **Benchmark Architectures**

- Sun Constellation (Ranger)
- Cray XT5
- IBM Power 6



Hybrid Parallel Programming Slide 18 / 169

Rabenseifner, Hager, <u>Jost</u>





### Hybrid code on cc-NUMA architectures

#### OpenMP:

- Support only per MPI process
- Version 2.5 does not provide support to control to map threads onto CPUs. Support to specify thread affinities was under discussion for 3.0 but has not been included
- MPI:
  - Initially not designed for NUMA architectures or mixing of threads and processes, MPI-2 supports threads in MPI
  - API does not provide support for memory/thread placement
- Vendor specific APIs to control thread and memory placement:
  - Environment variables
  - System commands like *numactl* → http://www.halobates.de/numaapi3.pdf







## Sun Constellation Cluster Ranger (1)

- Located at the Texas Advanced Computing Center (TACC), University of Texas at Austin (http://www.tacc.utexas.edu)
- 3936 Sun Blades, 4 AMD Quad-core 64bit 2.3GHz processors per node (blade), 62976 cores total 2
- 123TB aggregrate memory ٠
- Peak Performance 579 Tflops ٠
- InfiniBand Switch interconnect ٠
- Sun Blade x6420 Compute Node: ٠
  - 4 Sockets per node
  - 4 cores per socket
  - HyperTransport System Bus
  - 32GB memory

Core network Core 0 3 Core ore Core R





### Sun Constellation Cluster Ranger (2)

- Compilation:
  - PGI pgf90 7.1



- mpif90 -tp barcelona-64 -r8 -mp
- Cache optimized benchmarks Execution:
  - MPI MVAPICH
  - setenv OMP\_NUM\_THREADS *nthreads*
  - Ibrun numactl bt-mz.exe
- numactl controls
  - Socket affinity: select sockets to run
  - Core affinity: select cores within socket
  - Memory policy:where to allocate memory
  - http://www.halobates.de/numaapi3.pdf



Hybrid Parallel Programming Slide 21 / 169

Rabenseifner, Hager, <u>Jost</u>





- Highly hierarchical
- Shared Memory:
  - Cache-coherent, Nonuniform memory access (ccNUMA) 16-way Node (Blade)
- Distributed memory:
  - Network of ccNUMA blades
    - Core-to-Core
    - Socket-to-Socket
    - Blade-to-Blade
    - Chassis-to-Chassis







Hybrid Parallel Programming Slide 22 / 169 Ra





#### MPI ping-pong micro On-Node Communication Scaling (between 2 Sockets) 1400 1 · 0→ · benchmark results 1200 1.0→2 1:0→3 (MB/s) 1000 on Ranger Bandwidth 800 Inside one node: 600 4: 0→2 4: 0→3 Ping-pong socket 0 with 1, 2, 3 and 1, 2, or 4 simultaneous comm. (quad-core) 100 KB 1мв 10<sub>MB</sub> 10 KB 1KB Missing Connection: Communication On NEM Node-2-Node Communication Scaling between socket 0 and 3 is slower 1000 Maximum bandwidth: $\triangleright$ 800 Bandwidth (MB/s) 1 x 1180, 2 x 730, 4 x 300 MB/s 600 Node-to-node inside one chassis 400 with 1-6 node-pairs (= 2-12 procs) Effective Perfect scaling for up to 6 200 simultaneous communications Max. bandwidth : 6 x 900 MB/s 0.1 KB 10 KB 100 KB 1мв 10MF 1KB NEM to NEM Scaling Performance Chassis to chassis (distance: 7 hops) 1000 with 1 MPI process per node and 1-12 - 11 800 Communication simultaneous communication links Channel (MB/s) 009 009 Max: 2 x 900 up to 12 x 450 MB/s $\geq$ **3andwidth per** Hybrid Parallel Programming Rabenseifner, Hager, Jost Slide 23 / 169 200 "Exploiting Multi-Level Parallelism on the Sun Constellation System", L. Koesterke, et al., TACC, TeraGrid08 Paper 0.1 кв 100 KB 10<sub>MB</sub> 1<sub>KB</sub> 10 KB 1MB Message Size



### SUN: NPB-MZ Class E Scalability on Ranger





# **NUMA Control: Process Placement**

• Affinity and Policy can be changed externally through numactl at the socket and core level.





### **NUMA Operations: Memory Placement**



Memory: Socket References

Example: numactl -N 1 -1 ./a.out

Memory allocation:

- MPI
  - local allocation is best
- OpenMP
  - Interleave best for large, completely shared arrays that are randomly accessed by different threads
  - local best for private arrays
  - Once allocated, a memory-structure is fixed



٠



### NUMA Operations (cont. 3)

| • |                                            |                           |            |                                                     |                                                                   |
|---|--------------------------------------------|---------------------------|------------|-----------------------------------------------------|-------------------------------------------------------------------|
| Ī |                                            | cmd                       | option     | arguments                                           | description                                                       |
|   | Socket Affinity                            | numactl                   | -N         | {0,1,2,3}                                           | Only execute<br>process on cores<br>of this (these)<br>socket(s). |
|   | Memory Policy                              | numactl                   | -1         | {no argument}                                       | Allocate on current socket.                                       |
|   | Memory Policy                              | numactl                   | -i         | {0,1,2,3}                                           | Allocate round<br>robin (interleave)<br>on these sockets.         |
|   | Memory Policy                              | numactl                   | preferred= | {0,1,2,3}<br>select only one                        | Allocate on this<br>socket; fallback<br>to any other if<br>full . |
|   | Memory Policy                              | numactl                   | -m         | {0,1,2,3}                                           | Only allocate on<br>this (these)<br>socket(s).                    |
|   | Core Affinity                              | numactl                   | -C         | {0,1,2,3,<br>4,5,6,7,<br>8,9,10,11,<br>12,13,14,15} | Only execute<br>process on this<br>(these) Core(s).               |
|   | Hybrid Parallel Programi<br>Slide 27 / 169 | ming<br>Rabenseifner, Hag |            | H L                                                 | R S                                                               |
|   | •                                          | -                         |            | TACC                                                | <b>_</b>                                                          |



### Hybrid Batch Script: 4 tasks, 4 threads/task

|              | job script (Bourne shell)                                                                                                                                                                                                               | job script (C shell)                                                                                                                                                                                                                            |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|              | 4 MPI per<br>#! -pe 4way 32 node                                                                                                                                                                                                        | <br>#!-pe 4way 32<br>                                                                                                                                                                                                                           |
|              | export OMP_NUM_THREADS=4                                                                                                                                                                                                                | setenv OMP_NUM_THREADS 4                                                                                                                                                                                                                        |
|              | ibrun numa.sh                                                                                                                                                                                                                           | ibrun numa.csh                                                                                                                                                                                                                                  |
| for mvapich2 | numa.sh<br>#!/bin/bash<br>export MV2_USE_AFFINITY=0<br>export MV2_ENABLE_AFFINITY=0<br>export VIADEV_USE_AFFINITY=0<br>#TasksPerNode<br>TPN=`echo \$PE   sed 's/way//`<br>[ ! \$TPN ] && echo TPN NOT defined!<br>[ ! \$TPN ] && exit 1 | numa.csh<br>#!/bin/tcsh<br>setenv MV2_USE_AFFINITY 0<br>setenv MV2_ENABLE_AFFINITY 0<br>setenv VIADEV_USE_AFFINITY 0<br>#TasksPerNode<br>set TPN = `echo \$PE   sed 's/way//``<br>if(! \${%TPN}) echo TPN NOT defined!<br>if(! \${%TPN}) exit 0 |
| 11.          | socket=\$(( \$PMI_RANK % \$TPN ))                                                                                                                                                                                                       | @ socket = \$PMI_RANK % \$TPN                                                                                                                                                                                                                   |
| Sli          | numactl -N \$socket -m \$socket ./a.out                                                                                                                                                                                                 | numactl -N \$socket -m \$socket ./a.out                                                                                                                                                                                                         |



### Numactl – Pitfalls: Using Threads across Sockets

bt-mz.1024x8 yields best load-balance

-pe 2way 8192 export OMP\_NUM\_THREADS=8

my\_rank=\$PMI\_RANK
local\_rank=\$(( \$my\_rank % \$myway ))
numnode=\$(( \$local\_rank + 1 ))

<u>Original:</u> numactl -N \$numnode -m \$numnode \$\*

### **Bad performance!**

- Each process runs 8 threads on 4 cores
- Memory allocated on one socket





network



### Numactl – Pitfalls: Using Threads across Sockets

### bt-mz.1024x8



Hybrid Parallel Programming Slide 30 / 169

Rabenseifner, Hager, <u>Jost</u>

H L R TACC



<u>3</u>

0

3

### Cray XT5

- Results obtained by the courtesy of the HPCMO Program and the Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (<u>http://www.erdc.hpc.mil/index</u>)
- Cray XT5 is located at the Arctic Region Supercomputing Center (ARSC) (http://www.arsc.edu/resources/pingo)
  - 432- Cray XT5 compute nodes with
    - 32 GB of shared memory per node (4 GB per core)
    - 2 quad core 2.3 GHz AMD Opteron processors per node.
    - 1 Seastar2+ Interconnect Module per node.
  - Cray Seastar2+ Interconnect between all compute and login nodes



Rabenseifner, Hager, <u>Jost</u>



R S

2 H L



## Cray XT5: NPB-MZ Class D Scalability





### Cray XT5: CrayPat Performance Analysis

- module load xt-craypat
- Compilation:
  - ➢ ftn -fastsse -tp barcelona-64 -r8 -mp=nonuma,[trace]
- Instrument:
  - > pat\_build -w -T TraceOmp, -g mpi,omp bt.exe bt.exe.pat
- Execution :
  - ➤ (export PAT\_RT\_HWPC {0,1,2,..})
  - ➢ export OMP\_NUM\_THREADS 4
  - > aprun -n NPROCS -S 1 -d 4 ./bt.exe.pat
- Generate report:
  - pat\_report -O load\_balance,thread\_times,program\_time,mpi\_callers -O profile\_pe.th \$1





# Cray XT5: BT-MZ 32x4 Function Profile





## Cray XT5: BT-MZ Load-Balance 32x4 vs 128x1

| Table 2: Load Balance across PE's by FunctionGroup                                 | Table 2: Load Balance across PE's by FunctionGroup                               |
|------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| Time %   Time   Calls  Experiment=1<br>     Group<br>      PE[mmm]<br>      Thread | Time %   Time   Calls  Group<br>    PE[mmm]<br>100.0%   24.277514   38258  Total |
| 100.0%   1.782603   18662  Total                                                   | 54.2%   13.166225   4545 IMPI                                                    |
| 86,1%   1,535163   7783  USER                                                      | <br>   0.5%   16.454993   4846  pe.91                                            |
| <br>   2.7%   1.535987   6813  pe.0                                                | 0.5%   14.058598   2434  pe.29<br>   0.0%   0.289479   2434  pe.0                |
| 3   0.7%   1.535987   6188  thread.1<br>3   0.7%   1.535871   6188  thread.3       | ====================================                                             |
| 3   0.7%   1.535829   6188  thread.2<br>3   0.7%   1.466954   6813  thread.0       | 0.7%   23.205797   9093  pe.0<br>   0.3%   10.084200   26873  pe.110             |
| 2.7%   1.535147   7783  pe.18                                                      | 0.3%   8.070997   17983  pe.91                                                   |
| 3   0.7%   1.535147   7072  thread.1<br>3   0.7%   1.534995   7072  thread.3       | bt-mz-C.128x1                                                                    |
| 3   0.7%   1.534968   7072  thread.2<br>3   0.6%   1.290502   7783  thread.0       | • maximum, median, minimum PE are shown                                          |
| 2.7%   1.534239   7783  pe.16<br>                                                  | • bt-mz.C.128x1 shows large imbalance in User                                    |
| 3   0.7%   1.534239   7072  thread.1<br>3   0.7%   1.534101   7072  thread.3       | <ul> <li>and MPI time</li> <li>bt-mz.C.32x4 shows well balanced times</li> </ul> |
| 3   0.7%   1.534076   7072  thread.2<br>3   0.6%   1.268085   7783  thread.0       | bt-mz-C.32x4                                                                     |
| Hybrid Parallel Programming<br>Slide 35 / 169 Rabenseifner, Hag                    |                                                                                  |
|                                                                                    | TACC                                                                             |

#### skipped **Running Hybrid on Cray XT4** Shared Memory: Hyper, Transport Cache-coherent 4-way Node — Distributed memory: • <u>1</u> network Core Core Network of nodes \_ **Core-to-Core** Core • Node-to-Node • memory 1 Core Core HLRS Hybrid Parallel Programming Slide 36 / 169 Rabenseifner, Hager, Jost

Pitfalls:

skipped

### Process and Thread Placement on Cray XT4 (1)





Pitfalls:

skipped

### **Process and Thread Placement on Cray XT4 (2)**



export OMP\_NUM\_THREADS=4 export MPICH\_RANK\_REORDER\_DISPLAY=1 Rank 1 aprun -n 2 -N 1 sp-mz.B.2 network 2 nodes, 8 cores, 8 threads [PE\_0]: rank 0 is on **nid01759**; [PE\_0]: rank 1 is on nid01882; Rank 0 SP-MZ Benchmark Completed. Class. = 304x 208x 17 Size = Iterations = 47.20 Time in seconds = Total processes = Total threads 8 = Short execution time Mop/s total 6427.18 = Mop/s/thread 803.40 because both 4-way MPI = Operation type = floating point processes are running SUCCESSFUL Verification = Version 3.3 = on different sockets Compile date 28 May 2009 = Hybrid Parallel Programming нι R Rabenseifner, Hager, Jost Slide 38 / 169 

## **Example Batch Script Cray XT4**



### Cray XT4 at ERDC:

skipped

- 1 quad-core AMD Opteron per node
- ftn -fastsse -tp barcelona-64 -mp -o bt-mz.128



## **IBM Power 6**



- Results obtained by the courtesy of the HPCMO Program and the Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (<u>http://www.erdc.hpc.mil/index</u>)
- The IBM Power 6 System is located at (http://www.navo.hpc.mil/davinci\_about.html)
- 150 Compute Nodes
- 32 4.7GHz Power6 Cores per Node (4800 cores total)
- 64 GBytes of dedicated memory per node
- QLOGOC Infiniband DDR interconnect



## NPB-MZ Class D on IBM Power 6: Exploiting SMT for 2048 Core Results









## **Conventional Multi-Threading**







## **Simultaneous Multi-Threading**

Shared functional units



Rabenseifner, Hager, Jost

Charles Grassl, IBM





## **Performance Analysis on IBM Power 6**

- Compilation:
  - mpxlf\_r -O4 -qarch=pwr6 -qtune=pwr6 -qsmp=omp -pg
- Execution :
  - > export OMP\_NUM\_THREADS 4
  - > poe launch \$PBS\_O\_WORKDIR./sp.C.16x4.exe
  - Generates a file gmount.MPI\_RANK.out for each MPI Process
- Generate report:
  - gprof sp.C.16x4.exe gmon\*

| % cumulative |         | self    |        | self    | total   |                               |
|--------------|---------|---------|--------|---------|---------|-------------------------------|
| time         | seconds | seconds | calls  | ms/call | ms/call | name                          |
| 16.7         | 117.94  | 117.94  | 205245 | 0.57    | 0.57    | .@10@x_solve@OL@1 [2]         |
| 14.6         | 221.14  | 103.20  | 205064 | 0.50    | 0.50    | .@15@z_solve@OL@1 [3]         |
| 12.1         | 307.14  | 86.00   | 205200 | 0.42    | 0.42    | .@12@y_solve@OL@1 [4]         |
| 6.2          | 350.83  | 43.69   | 205300 | 0.21    | 0.21    | .080compute_rhs00L0100L06 [5] |



## **Conclusions:**



### BT-MZ:

- Inherent workload imbalance on MPI level
- #nprocs = #nzones yields poor performance
- #nprocs < #zones => better workload balance, but decreases parallelism
- Hybrid MPI/OpenMP yields better load-balance, maintains amount of parallelism

### • SP-MZ:

- > No workload imbalance on MPI level, pure MPI should perform best
- MPI/OpenMP outperforms MPI on some platforms due contention to network access within a node
- LU-MZ:
  - Hybrid MPI/OpenMP increases level of parallelism
- "Best of category" depends on many factors
  - Depends on many factors
  - Hard to predict
  - Good thread affinity is essential



## Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP

## Practical "How-To" on hybrid programming

Georg Hager, Regionales Rechenzentrum Erlangen (RRZE)

- Mismatch Problems
- Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary



Hybrid Parallel ProgrammingSlide 48 / 169Rabenseifner, Hager, Jost

TACC



## Hybrid Programming How-To: Overview

- A practical introduction to hybrid programming
  - How to compile and link
  - Getting a hybrid program to run on a cluster
- Running hybrid programs efficiently on multi-core clusters
  - Affinity issues
    - ccNUMA
    - Bandwidth bottlenecks
  - Intra-node MPI/OpenMP anisotropy
    - MPI communication characteristics
    - OpenMP loop startup overhead
  - Thread/process binding





## How to compile, link and run

- Use appropriate OpenMP compiler switch (-openmp, -xopenmp, -mp, -qsmp=openmp, ...) and MPI compiler script (if available)
- Link with MPI library
  - Usually wrapped in MPI compiler script
  - If required, specify to link against thread-safe MPI library
    - Often automatic when OpenMP or auto-parallelization is switched on
- Running the code
  - Highly non-portable! Consult system docs! (if available...)
  - If you are on your own, consider the following points
  - Make sure OMP\_NUM\_THREADS etc. is available on all MPI processes
    - Start "env VAR=VALUE ... <YOUR BINARY>" instead of your binary alone
    - Use Pete Wyckoff's *mpiexec* MPI launcher (see below): http://www.osc.edu/~pw/mpiexec
  - Figure out how to start less MPI processes than cores on your nodes

Hybrid Parallel Programming Slide 50 / 169

Rabenseifner, <u>Hager</u>, Jost



## SC10 Nev Orleans, LA

## Some examples for compilation and execution (1)

### NEC SX9

- NEC SX9 compiler
- mpif90 -C hopt -P openmp ... # -ftrace for profiling info
- Execution:
- \$ export OMP\_NUM\_THREADS=<num\_threads>
- \$ MPIEXPORT="OMP\_NUM\_THREADS"
- \$ mpirun -nn <# MPI procs per node> -nnp <# of nodes> a.out
- Standard Intel Xeon cluster (e.g. @HLRS):
  - Intel Compiler
  - mpif90 -openmp ...
  - Execution (handling of OMP\_NUM\_THREADS, see next slide):
  - \$ mpirun\_ssh -np <num MPI procs> -hostfile machines a.out





## Some examples for compilation and execution (2)

### Handling of OMP\_NUM\_THREADS

- <u>without</u> any support by mpirun:
  - E.g. with mpich-1
  - Problem:

mpirun has no features to export environment variables to the via ssh automatically started MPI processes

- Solution: Set
   export OMP\_NUM\_THREADS=<# threads per MPI process>
   in ~/.bashrc (if a bash is used as login shell)
- If you want to set OMP\_NUM\_THREADS individually when starting the MPI processes:
  - Add test -s ~/myexports && . ~/myexports in your ~/.bashrc
  - Add echo '\$OMP\_NUM\_THREADS=<# threads per MPI process>' > ~/myexports before invoking mpirun
  - Caution: Several invocations of mpirun cannot be executed at the same time with this trick!



## Some examples for compilation and execution (3)

Handling of OMP\_NUM\_THREADS (continued)

 with support by OpenMPI -x option: export OMP\_NUM\_THREADS= <# threads per MPI process> mpiexec -x OMP\_NUM\_THREADS -n <# MPI processes> ./executable



Hybrid Parallel Programming

Rabenseifner, <u>Hager</u>, <u>Jost</u>



## SC 10 New Orleans, LA

## Some examples for compilation and execution (4)

- Sun Constellation Cluster:
  - mpif90 -fastsse -tp barcelona-64 -mp ...
  - SGE Batch System
  - setenv OMP\_NUM\_THREADS
  - ibrun numactl.sh a.out
  - Details see TACC Ranger User Guide (www.tacc.utexas.edu/services/userguides/ranger/#numactl)
- Cray XT5:
  - ftn -fastsse -tp barcelona-64 -mp=nonuma ...
  - aprun -n nprocs -N nprocs\_per\_node a.out





# Interlude: Advantages of mpiexec or similar mechanisms



- Uses PBS/Torque Task Manager ("TM") interface to spawn MPI processes on nodes
  - As opposed to starting remote processes with ssh/rsh:
    - Correct CPU time accounting in batch system
    - Faster startup
    - Safe process termination
    - Understands PBS per-job nodefile
    - Allowing password-less user login not required between nodes
  - Support for many different types of MPI
    - All MPICHs, MVAPICHs, Intel MPI, ...
  - Interfaces directly with batch system to determine number of procs
  - Downside: If you don't use PBS or Torque, you're out of luck...
- Provisions for starting less processes per node than available cores
  - Required for hybrid programming
  - "-pernode" and "-npernode #" options does not require messing around with nodefiles

Hybrid Parallel Programming Slide 55 / 169

Rabenseifner, <u>Hager,</u> Jost



## Running the code

Examples with mpiexec



• Example for using mpiexec on a dual-socket quad-core cluster:

```
$ export OMP_NUM_THREADS=8
$ mpiexec -pernode ./a.out
```

• Same but 2 MPI processes per node:

```
$ export OMP_NUM_THREADS=4
$ mpiexec -npernode 2 ./a.out
```

```
• Pure MPI:
```

\$ export OMP\_NUM\_THREADS=1 # or nothing if serial code
\$ mpiexec ./a.out



## Running the code *efficiently*?



- Symmetric, UMA-type compute nodes have become rare animals
  - NEC SX
  - Intel 1-socket ("Port Townsend/Melstone/Lynnfield") see case studies
  - Hitachi SR8000, IBM SP2, single-core multi-socket Intel Xeon... (all dead)
- Instead, systems have become "non-isotropic" on the node level
  - ccNUMA (AMD Opteron, SGI Altix, IBM Power6 (p575), Intel Nehalem)
  - Multi-core, multi-socket
    - Shared vs. separate caches
    - Multi-chip vs. single-chip
    - Separate/shared buses





Rabenseifner, <u>Hager</u>, Jost



Memory

Memory

# Issues for running code efficiently on "non-isotropic" nodes

- ccNUMA locality effects
  - Penalties for inter-LD access
  - Impact of contention
  - Consequences of file I/O for page placement
  - Placement of MPI buffers
- Multi-core / multi-socket anisotropy effects
  - Bandwidth bottlenecks, shared caches
  - Intra-node MPI performance
    - Core  $\leftrightarrow$  core vs. socket  $\leftrightarrow$  socket
  - OpenMP loop overhead depends on mutual position of threads in team









## A short introduction to ccNUMA

- ccNUMA:
  - whole memory is transparently accessible by all processors
  - but physically distributed
  - with varying bandwidth and latency
  - and potential contention (shared memory paths)





## Example: HP DL585 G5

4-socket ccNUMA Opteron 8220 Server

• CPU

- 64 kB L1 per core
- 1 MB L2 per core
- No shared caches
- On-chip memory controller (MI)
- 10.6 GB/s local memory bandwidth
- HyperTransport 1000 network
  - 4 GB/s per link per direction
- 3 distance categories for core-to-memory connections:
  - same LD
  - 1 hop
  - 2 hops
- Q1: What are the real penalties for non-local accesses?
  - Q2: What is the impact of contention?

Hybrid Parallel Programming Slide 60 / 169

Rabenseifner, <u>Hager</u>, Jost

H L R S







## ccNUMA Memory Locality Problems

- Locality of reference is key to scalable performance on ccNUMA
  - Less of a problem with pure MPI, but see below
- What factors can destroy locality?
- MPI programming:
  - processes lose their association with the CPU the mapping took place on originally
  - OS kernel tries to maintain strong affinity, but sometimes fails
- Shared Memory Programming (OpenMP, hybrid):
  - threads losing association with the CPU the mapping took place on originally
  - improper initialization of distributed data
  - Lots of extra threads are running on a node, especially for hybrid
- All cases:
  - Other agents (e.g., OS kernel) may fill memory with data that prevents optimal placement of user data

Hybrid Parallel Programming Slide 63 / 169

Rabenseifner, <u>Hager</u>, Jost





## **Avoiding locality problems**

- How can we make sure that memory ends up where it is close to the CPU that uses it?
  - See the following slides
- How can we make sure that it stays that way throughout program execution?
  - See end of section



Hybrid Parallel Programming Slide 64 / 169

Rabenseifner, <u>Hager</u>, Jost





## **Solving Memory Locality Problems: First Touch**

• "Golden Rule" of ccNUMA:

```
A memory page gets mapped into the local memory of the processor that first touches it!
```

- Except if there is not enough local memory available
- this might be a problem, see later
- Some OSs allow to influence placement in more direct ways
  - cf. libnuma (Linux), MPO (Solaris), ...
- Caveat: "touch" means "write", not "allocate"
- Example:

1Portal

```
double *huge = (double*)malloc(N*sizeof(double));
// memory not mapped yet
for(i=0; i<N; i++) // or i+=PAGE_SIZE
    huge[i] = 0.0; // mapping takes place here!</pre>
```

• It is sufficient to touch a single item to map the entire page





## ccNUMA problems beyond first touch

- OS uses part of main memory for disk buffer (FS) cache
  - If FS cache fills part of memory, apps will probably allocate from foreign domains
  - $\rightarrow$  non-local access!
  - Locality problem even on hybrid and pure MPI with "asymmetric" file I/O, i.e. if not all MPI processes perform I/O



- Remedies
  - Drop FS cache pages after user job has run (admin's job)
    - Only prevents cross-job buffer cache "heritage"
  - "Sweeper" code (run by user)
  - Flush buffer cache after I/O if necessary ("sync" is not sufficient!)





## ccNUMA problems beyond first touch

- Real-world example: ccNUMA vs. UMA and the Linux buffer cache
- Compare two 4-way systems: AMD Opteron ccNUMA vs. Intel UMA, 4 GB main memory





### Intra-node MPI characteristics: IMB Ping-Pong benchmark



### IMB Ping-Pong: Latency

Intra-node vs. Inter-node on Woodcrest DDR-IB cluster (Intel MPI 3.1)







## **OpenMP Overhead**



- As with intra-node MPI, OpenMP loop start overhead varies with the mutual position of threads in a team
- Possible variations
  - Intra-socket vs. inter-socket
  - Different overhead for "parallel for" vs. plain "for"
  - If one multi-threaded MPI process spans multiple sockets,
    - ... are neighboring threads on neighboring cores?
    - ... or are threads distributed "round-robin" across cores?

```
Test benchmark: Vector triad
```

```
#pragma omp parallel
for(int j=0; j < NITER; j++){
#pragma omp (parallel) for
for(i=0; i < N; ++i)
    a[i]=b[i]+c[i]*d[i];
    if(OBSCURE)
        dummy(a,b,c,d);</pre>
```

Hybrid Parallel Programming Slide 71 / 169 Rabenseifner, Hager, Jost Look at performance for small array sizes!





## **OpenMP Overhead**



### Thread synchronization overhead

Barrier overhead in CPU cycles: pthreads vs. OpenMP vs. spin loop



| Barrier overnead in CPU cycles: pthreads vs. OpeniviP vs. spin loop |                              |                    |  |
|---------------------------------------------------------------------|------------------------------|--------------------|--|
|                                                                     |                              |                    |  |
| 2 Threads                                                           | Q9550 (shared L2)            | i7 920 (shared L3) |  |
| pthreads_barrier_wait                                               | 23739                        | 6511               |  |
| omp barrier (icc 11.0)                                              | 399                          | 469                |  |
| Spin loop                                                           | 231                          | 270                |  |
|                                                                     |                              |                    |  |
| 4 Threads                                                           | Q9550                        | i7 920 (shared L3) |  |
| pthreads_barrier_wait                                               | 42533                        | 9820               |  |
| omp barrier (icc 11.0)                                              | 977                          | 814                |  |
| Ciplin I pain                                                       | 1100                         |                    |  |
| Spin loop                                                           | 1106                         | 475                |  |
| pthreads $\rightarrow$ OS kernel call                               | op does fine for shared cach |                    |  |

### Thread synchronization overhead

Barrier overhead: OpenMP icc vs. gcc







814

gcc obviously uses a pthreads barrier for the OpenMP barrier:

| 2 Threads | Q9550 (shared L2) | i7 920 (shared L3) |
|-----------|-------------------|--------------------|
| gcc 4.3.3 | 22603             | 7333               |
| icc 11.0  | 399               | 469                |
|           |                   |                    |
| 4 Threads | Q9550             | i7 920 (shared L3) |
| gcc 4.3.3 | 64143             | 10901              |

977

Correct pinning of threads:

- Manual pinning in source code (see below) or
- likwid-pin: http://code.google.com/p/likwid/

Hybrid Parallel Programming Slide 74 / 169

icc 11.0



### Thread synchronization overhead

Barrier overhead: Topology influence





| Xeon E5420 2 Threads   | shared L2 | same socket | different socket |
|------------------------|-----------|-------------|------------------|
| pthreads_barrier_wait  | 5863      | 27032       | 27647            |
| omp barrier (icc 11.0) | 576       | 760         | 1269             |
| Spin loop              | 259       | 485         | 11602            |



| Nehalem 2 Threads      | Shared SMT<br>threads | shared L3 | different socket |
|------------------------|-----------------------|-----------|------------------|
| pthreads_barrier_wait  | 23352                 | 4796      | 49237            |
| omp barrier (icc 11.0) | 2761                  | 479       | 1206             |
| Spin loop              | 17388                 | 267       | 787              |

- SMT can be a big performance problem for synchronizing threads
  - Well known for a long time...

Hybrid Parallel Programming Slide 75 / 169





# Thread/Process Affinity ("Pinning")

- Highly OS-dependent system calls
  - But available on all systems

```
Linux: sched_setaffinity(), PLPA (see below) → hwloc
Solaris: processor_bind()
Windows: SetThreadAffinityMask()
```

- Support for "semi-automatic" pinning in some compilers/environments
  - Intel compilers > V9.1 (KMP\_AFFINITY environment variable)
  - Pathscale
  - SGI Altix dplace (works with logical CPU numbers!)
  - Generic Linux: taskset, numactl, likwid-pin (see below)
- Affinity awareness in MPI libraries

Seen on SUN Ranger slides

HLRS



- OpenMPI
- Intel MPI

Hybrid Parallel Programming Slide 76 / 169

Rabenseifner, <u>Hager</u>, Jost

Widely usable example: Using PLPA under Linux!





## **Process/Thread Binding With PLPA**



# How do we figure out the topology?



- ... and how do we enforce the mapping without changing the code?
- Compilers and MPI libs may still give you ways to do that
- But LIKWID supports all sorts of combinations:

Like I Knew What I'm Doing



- Open source tool collection (developed at RRZE):
  - http://code.google.com/p/likwid



## Likwid Tool Suite



- Command line tools for Linux:
  - works with standard linux 2.6 kernel
  - supports Intel and AMD CPUs
  - Supports all compilers whose OpenMP implementation is based on pthreads
- Current tools:
  - likwid-topology: Print thread and cache topology (similar to Istopo from the hwloc package)
  - likwid-pin: Pin threaded application without touching code
  - likwid-perfCtr: Measure performance counters (similar to SGI's perfex or lipfpm tools)
  - likwid-features: View and enable/disable hardware prefetchers (Core2 only at the moment)
  - likwid-bench: Low-level benchmark construction tool





# likwid-topology – Topology information

- Based on cpuid information
- Functionality:
  - Measured clock frequency
  - Thread topology
  - Cache topology
  - Cache parameters (-c command line switch)
  - ASCII art output (-g command line switch)
- Currently supported:
  - Intel Core 2 (45nm + 65 nm)
  - Intel Nehalem
  - AMD K10 (Quadcore and Hexacore)
  - AMD K8



ng Rabenseifner, <u>Hager</u>, Jost



# **Output of likwid-topology**



| CPU name: Intel Core i7 processor<br>CPU clock: 2666683826 Hz<br>************************************ |        |        |        |  |
|-------------------------------------------------------------------------------------------------------|--------|--------|--------|--|
| Sockets:                                                                                              |        | 2      |        |  |
| Cores per soc<br>Threads per c                                                                        |        | 4<br>2 |        |  |
| HWThread                                                                                              | Thread | Core   | Socket |  |
| 0                                                                                                     | 0      | 0      | 0      |  |
| 1                                                                                                     | 1      | 0      | 0      |  |
| 2                                                                                                     | 0      | 1      | 0      |  |
| 3                                                                                                     | 1      | 1      | 0      |  |
| 4                                                                                                     | 0      | 2      | 0      |  |
| 5                                                                                                     | 1      | 2      | 0      |  |
| 6                                                                                                     | 0      | 3      | 0      |  |
| 7                                                                                                     | 1      | 3      | 0      |  |
| 8                                                                                                     | 0      | 0      | 1      |  |
| 9                                                                                                     | 1      | 0      | 1      |  |
| 10                                                                                                    | 0      | 1      | 1      |  |
| 11                                                                                                    | 1      | 1      | 1      |  |
| 12                                                                                                    | 0      | 2      | 1      |  |
| 13                                                                                                    | 1      | 2      | 1      |  |
| 14                                                                                                    | 0      | 3      | 1      |  |
| 15                                                                                                    | 1      | 3      | 1      |  |

Hybrid Parallel Programming Slide 82 / 169



## likwid-topology continued



```
Socket 0: (01234567)
Socket 1: ( 8 9 10 11 12 13 14 15 )
Cache Topology
Level:
     1
Size:
     32 kB
Cache groups: (01)(23)(45)(67)(89)(1011)(1213)(1415)
Level:
     2
Size:
     256 kB
Cache groups: (01)(23)(45)(67)(89)(1011)(1213)(1415)
Level:
     3
Size:
     8 MB
Cache groups: (01234567) (89101112131415)
```

... and also try the ultra-cool –g option!



# likwid-pin



- Inspired and based on **ptoverride** (Michael Meier, RRZE) and **taskset**
- Pins process and threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
- Allows user to specify skip mask (i.e., supports many different compiler/MPI combinations)
- Can also be used as replacement for taskset
- Uses logical (contiguous) core numbering when running inside a restricted set of cores
- Supports logical core numbering inside node, socket, core
- Usage examples:
  - env OMP\_NUM\_THREADS=6 likwid-pin -t intel -c 0,2,4-6 ./myApp parameters
  - env OMP\_NUM\_THREADS=6 likwid-pin -c S0:0-2@S1:0-2 ./myApp
  - env OMP\_NUM\_THREADS=2 mpirun -npernode 2 \

```
likwid-pin -s 0x3 -c 0,1 ./myApp parameters
```







### **Case study: 3D Jacobi Solver** Basic implementation (2 arrays; no blocking etc...)



- Halo cells
- Data Exchange through cyclic SendReceive operation





## MPI/OpenMP Parallelization – 3D Jacobi

- Cubic 3D computational domain with periodic BCs in all directions
- Use single-node IB/GE cluster with one dual-core chip per node
- Homogeneous distribution of workload, e.g. on 8 procs



### Performance Data for 3D MPI/hybrid Jacobi

Strong scaling,  $N^3 = 480^3$ 

FullHybrid: Thread 0: Communication + boundary cell updates Thread 1: Inner cell updates





skipped



JDS parallel sparse matrix-vector multiply – storage scheme



### JDS Sparse MVM – Kernel Code OpenMP parallelization



- Implement c(:) = m(:,:) \* b(:)
- Operation count = 2N<sub>nz</sub>

skipped

```
do diag=1, zmax
  diagLen = jd_ptr(diag+1) - jd_ptr(diag)
  offset = jd_ptr(diag) - 1
!$OMP PARALLEL DO
  do i=1, diagLen
     c(i) = c(i) + val(offset+i) * b(col_idx(offset+i))
  enddo
!$OMP END PARALLEL DO
enddo
```

- Long inner loop (max. N<sub>r</sub>): OpenMP parallelization / vectorization
- Short outer loop (number of jagged diagonals)
- Multiple accesses to each element of result vector c[]
  - optimization potential!
- Stride-1 access to matrix data in val []
- Indexed (indirect) access to RHS vector ъ[]

Hybrid Parallel Programming Slide 91 / 169

Rabenseifner, <u>Hager</u>, Jost



## JDS Sparse MVM MPI parallelization

skipped







### JDS Sparse MVM Parallel MVM implementations: MPP

- One MPI process per processor
- Non-blocking MPI communication
- *Potential* overlap of communication and computation
  - However, MPI progress is only possible inside MPI calls on many implementations
- SMP Clusters: Intra-node and internode MPI



Hybrid Parallel Programming Slide 93 / 169

Rabenseifner, <u>Hager</u>, Jost



### JDS Sparse MVM Parallel MVM implementations: Hybrid









- Do not use hybrid if the pure MPI code scales ok
- Be aware of intranode MPI behavior
- Always observe the topology dependence of
  - Intranode MPI
  - OpenMP overheads
- Enforce proper thread/process to core binding, using appropriate tools (whatever you use, but use SOMETHING)
- Multi-LD OpenMP processes on ccNUMA nodes require correct page placement
- Finally: Always compare the best pure MPI code with the best OpenMP code!



# Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming

### Mismatch Problems

- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary



Hybrid Parallel Programming





Slide 98 / 169







# The Topology Problem with

one MPI process on each core

Application example on 80 cores:

- Cartesian application with 5 x 16 = 80 sub-domains
- On system with 10 x dual socket x quad-core



- 12 x inter-node connections per node
  - 4 x inter-socket connection per node domain decomposition

## Bad affinity of cores to thread ranks

Hybrid Parallel Programming Slide 101 / 169

Rabenseifner, Hager, Jost



Two levels of

R

## pure MPI one MPI process on each core

Two levels of

R

Application example on 80 cores:

The Topology Problem with

- Cartesian application with  $5 \times 16 = 80$  sub-domains
- On system with 10 x dual socket x quad-core



- 12 x inter-node connections per node
  - 2 x inter-socket connection per node domain decomposition

# **Good** affinity of cores to thread ranks

Hybrid Parallel Programming Slide 102 / 169

Rabenseifner, Hager, Jost



ΗL

# The Topology Problem with



MPI: inter-node communication OpenMP: inside of each SMP node

hybrid MPI+OpenMP

Exa.: 2 SMP nodes, 8 cores/node

skipped



Problem

 Does application topology inside of SMP parallelization fit on inner hardware topology of each SMP node?

Solutions:

- Domain decomposition inside of each thread-parallel MPI process, and
- first touch strategy with OpenMP

Successful examples:

- Multi-Zone NAS Parallel Benchmarks (MZ-NPB)









## The Mapping Problem with mixed model



pure MPI

# Unnecessary intra-node communication

Problem:

- If several MPI process on each SMP node

 $\rightarrow$  unnecessary intra-node communication

Solution:

Only one MPI process per SMP node

Remarks:

- MPI library must use appropriate fabrics / protocol for intra-node communication
- Intra-node bandwidth higher than inter-node bandwidth
  - $\rightarrow$  problem may be small
  - MPI implementation may cause

unnecessary data copying

→ waste of memory bandwidth

Quality aspects of the MPI library

Hybrid Parallel Programming Slide 107 / 169



pure MPI Mixed model (several multi-threaded MPI processes per SMP node)

# Sleeping threads and network saturation

# with Masteronly

MPI only outside of parallel regions

#### for (iteration ....)

ł

#pragma omp parallel
numerical code
/\*end omp parallel \*/

/\* on master thread only \*/ MPI\_Send (original data to halo areas in other SMP nodes) MPI\_Recv (halo data from the neighbors) } /\*end for loop



#### Problem 1:

- Can the master thread saturate the network?
   Solution:
- If not, use mixed model
- i.e., several MPI processes per SMP node

#### Problem 2:

- Sleeping threads are wasting CPU time
   Solution:
- Overlapping of computation and communication

#### Problem 1&2 together:

R

 Producing more idle time through lousy bandwidth of master thread

Hybrid Parallel Programming Slide 108 / 169

Rabenseifner, Hager, Jost





## **OpenMP: Additional Overhead & Pitfalls**

- Using OpenMP
  - $\rightarrow$  may prohibit compiler optimization
  - ightarrow may cause significant loss of computational performance
- Thread fork / join overhead
- On ccNUMA SMP nodes:

See, e.g., the necessary –O4 flag with mpxlf\_r on IBM Power6 systems

- Loss of performance due to missing memory page locality or missing first touch strategy
- E.g. with the masteronly scheme:
  - One thread produces data
  - Master thread sends the data with MPI
  - $\rightarrow$  data may be internally communicated from one memory to the other one
- Amdahl's law for each level of parallelism
- Using MPI-parallel application libraries?  $\rightarrow$  Are they prepared for hybrid?



#### **Overlapping Communication and Computation**

MPI communication by one or a few threads while other threads are computing

Three problems:

- the application problem:
  - one must separate application into:
    - code that can run before the halo data is received
    - code that needs halo data

#### → very hard to do !!!

- the thread-rank problem:
  - comm. / comp. via thread-rank
  - cannot use work-sharing directives

→ loss of major OpenMP support (see next slide)

• the load balancing problem

```
if (my_thread_rank < 1) {
    MPI_Send/Recv....
} else {
    my_range = (high-low-1) / (num_threads-1) + 1;
    my_low = low + (my_thread_rank+1)*my_range;
    my_high=high+ (my_thread_rank+1+1)*my_range;
    my_high = max(high, my_high)
    for (i=my_low; i<my_high; i++) {
</pre>
```

```
Hybrid Parallel Programming
Slide 110 / 169 Rabense
```

```
TACC
```



#### **Overlapping Communication and Computation**

MPI communication by one or a few threads while other threads are computing

#### Subteams

 Important proposal for OpenMP 3.x or OpenMP 4.x

Barbara Chapman et al.: Toward Enhancing OpenMP's Work-Sharing Directives. In proceedings, W.E. Nagel et al. (Eds.): Euro-Par 2006, LNCS 4128, pp. 645-654, 2006.

#pragma omp parallel #pragma omp single onthreads( 0 ) MPI Send/Recv.... #pragma omp for onthreads( 1 : omp get numthreads()-1 ) for (.....) { /\* work without halo information \*/ } /\* barrier at the end is only inside of the subteam \*/ **#pragma omp barrier** #pragma omp for for (.....) { /\* work based on halo information \*/ } /\*end omp parallel \*/

Hybrid Parallel Programming Slide 111 / 169







### **Experiment: Matrix-vector-multiply (MVM)**



 Jacobi-Davidson-Solver on IBM SP Power3 nodes with 16 CPUs per node

Masteronly

funneled &

reserved

- funneled&reserved is always faster in this experiments
- Reason: Memory bandwidth is already saturated by 15 CPUs, see inset
- Inset: Speedup on 1 SMP node using different number of threads

Source: R. Rabenseifner, G. Wellein:

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications, Vol. 17, No. 1, 2003, Sage Science Press.







## **Overlapping: Using OpenMP tasks**





#### Case study: Communication and Computation in **Gyrokinetic Tokamak Simulation (GTS) shift routine**



SEMI-INDEPENDENT

!reorder remaining particles: fill holes do iterations = 1,N fill\_hole(p\_array); !compute particles to be shifted !send number of particles to move right *!\$omp parallel do* shift\_p=particles\_to\_shift(p\_array); 5 communicate amount of shifted! particles and return if equal to 0 shift p=x+yMPI ALLREDUCE(shift\_p,sum\_shift\_p) 0 INDEPENDENT if (sum\_shift\_p==0) { return; } 11 !pack particle to move right and left *!\$omp parallel do* 13 do m=1.xdo m=1.x15 sendright(m)=p array(f(m)); enddo enddo !\$omp parallel do 17 do n=1.vdo n=1, ysendleft(n)=p\_array(f(n)); 19 enddo enddo

skipped

NDEPENDENT 25 MPI SENDRECV(x, length = 2,...); !send to right and receive from left 27 MPI\_SENDRECV(sendright, length=g(x),..) !send number of particles to move left 29 MPL SENDRECV(v, length = 2...); ! send to left and receive from right 31 MPI SENDRECV(sendleft, length=g(y),..); 33 !adding shifted particles from right !\$omp parallel do 35  $p_array(h(m)) = sendright(m);$ 37 !adding shifted particles from left 39 *!*\$omp parallel do 41 p array (h(n)) = sendleft(n);43

GTS shift routine

Work on particle array (packing for sending, reordering, adding after sending) can be overlapped with data independent MPI communication using **OpenMP tasks**.

Slides, courtesy of Alice Koniges, NERSC, LBNL Hybrid Parallel Programming Slide 115 / 169 Rabenseifner, Hager, Jost



#### Overlapping can be achieved with OpenMP tasks (1st part)



Overlapping MPI\_Allreduce with particle work

- **Overlap**: Master thread encounters (!\$omp master) tasking statements and creates work for the thread team for deferred execution. MPI Allreduce call is immediately executed.
- MPI implementation has to support at least MPI\_THREAD\_FUNNELED
- Subdividing tasks into smaller chunks to allow better *load balancing* and *scalability* among threads.



skipped

Slides, courtesy of Alice Koniges, NERSC, LBNL





#### Overlapping can be achieved with OpenMP tasks (2<sup>nd</sup> part)

| !\$omp parallel                                                         | 1  |
|-------------------------------------------------------------------------|----|
| ! \$omp_master<br>! \$omp_task                                          | 3  |
| fill_hole(p_array);<br><u>!\$omp_end_task</u>                           | 5  |
| MPI_SENDRECV(x, length = 2,);<br>MPI_SENDRECV(sendright, length=g(x),); | 7  |
| $MPI_SENDRECV(y, length = 2,);$                                         | 9  |
| !\$omp end master                                                       |    |
| !\$omp end parallel<br>}                                                | 11 |

skipped

Overlapping particle reordering

Particle reordering of remaining particles (above) and adding sent particles into array (right) & sending or receiving of shifted particles can be independently executed.

| !\$omp parallel                                                    |    |
|--------------------------------------------------------------------|----|
| ! \$omp master                                                     | 2  |
| ! adding shifted particles from right                              |    |
| do m=1,x-stride , stride                                           | 4  |
| ?\$omp task<br>do mm=0, stride -1,1                                | 6  |
|                                                                    | 0  |
| p_array(h(m)) = sendright(m);<br>enddo                             | 8  |
| !\$omp end task                                                    | 0  |
| enddo                                                              | 10 |
| !\$omp task                                                        |    |
| dom=m,x                                                            | 12 |
| $p_array(h(m)) = sendright(m);$                                    |    |
| enddo                                                              | 14 |
| !\$omp end task                                                    | 1/ |
| MDI SENIDRECV( son diaft langth $-\alpha(y)$ ):                    | 16 |
| <pre>MPI_SENDRECV(sendleft,length=g(y),); ! \$omp end master</pre> | 18 |
| !\$omp end parallel                                                | 10 |
| : comp ena parattei                                                | 20 |
| ladding shifted particles from left                                | -  |
| !\$omp parallel do                                                 | 22 |
| do n=1, y                                                          |    |
| $p_{array}(h(n)) = sendleft(n);$                                   | 24 |
| enddo                                                              |    |
|                                                                    |    |

Overlapping remaining MPI\_Sendrecv



#### Slides, courtesy of Alice Koniges, NERSC, LBNL



## OpenMP tasking version outperforms original shifter, especially in larger poloidal domains





- Performance breakdown of GTS shifter routine using 4 OpenMP threads per MPI process with varying domain decomposition and particles per cell on Franklin Cray XT4.
- MPI communication in the shift phase uses a **toroidal MPI communicator** (constantly 128).
- Large performance differences in the 256 MPI run compared to 2048 MPI run!
- Speed-Up is expected to be higher on larger GTS runs with hundreds of thousands CPUs since MPI communication is more expensive.



🔁 H L R S 🕷

Slides, courtesy of Alice Koniges, NERSC, LBNL



## **OpenMP/DSM**

- Distributed shared memory (DSM) //
- Distributed virtual shared memory (DVSM) //
- Shared virtual memory (SVM)
- Principles
  - emulates a shared memory
  - on distributed memory hardware
- Implementations
  - e.g., Intel® Cluster OpenMP



Hybrid Parallel Programming Slide 119 / 169





## Intel<sup>®</sup> Compilers with Cluster OpenMP Consistency Protocol

Basic idea:

- Between OpenMP barriers, data exchange is not necessary, i.e., visibility of data modifications to other threads only after synchronization.
- When a page of sharable memory is not up-to-date, it becomes *protected*.
- Any access then faults (SIGSEGV) into Cluster OpenMP runtime library, which requests info from remote nodes and updates the page.
- Protection is removed from page.
- Instruction causing the fault is re-started, this time successfully accessing the data.



Hybrid Parallel Programming Slide 120 / 169



## Comparison: MPI based parallelization $\leftarrow \rightarrow$ DSM

- MPI based:
  - Potential of boundary exchange between two domains in one large message
    - $\rightarrow$  Dominated by **bandwidth** of the network
- DSM based (e.g. Intel<sup>®</sup> Cluster OpenMP):
  - Additional latency based overhead in each barrier
    - $\rightarrow$  May be marginal
  - Communication of updated data of pages
    - $\rightarrow$  Not all of this data may be needed
    - $\rightarrow$  i.e., too much data is transferred
    - → Packages may be to small
    - → Significant latency
  - Communication not oriented on boundaries of a domain decomposition
    - → probably more data must be transferred than necessary

Hybrid Parallel Programming Slide 121 / 169

Rabenseifner, Hager, Jost

by rule of thumb: Communication may be 10 times slower than with MPI

R S

Н

OpenMP only

## **Comparing results with heat example**

skipped



• Normal OpenMP on shared memory (ccNUMA) NEC TX-7



## Heat example: Cluster OpenMP Efficiency



Cluster OpenMP on a Dual-Xeon cluster
 Efficiency only with small
 communication foot-print



Hybrid Parallel Programming Slide 123 / 169

skipped





## Back to the mixed model – an Example



## No silver bullet



- The analyzed programming models do **not** fit on hybrid architectures
  - whether drawbacks are minor or major
    - depends on applications' needs
  - But there are major opportunities  $\rightarrow$  next section
- In the NPB-MZ case-studies
  - We tried to use optimal parallel environment
    - for pure MPI
    - for hybrid MPI+OpenMP
  - i.e., the developers of the MZ codes and we tried to minimize the mismatch problems

 $\rightarrow$  the opportunities in next section dominated the comparisons



## Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary



## **Nested Parallelism**



- Example NPB: BT-MZ (Block tridiagonal simulated CFD application)
  - Outer loop:
    - limited number of zones
    - zones with different workload  $\rightarrow$  speedup <  $\frac{\text{Sum of workload of all zones}}{\text{Max workload of a zone}}$
  - Inner loop:
    - OpenMP parallelized (static schedule)
    - Not suitable for distributed memory parallelization
- Principles:
  - Limited parallelism on outer level
  - Additional inner level of parallelism
  - Inner level not suitable for MPI
  - Inner level may be suitable for static OpenMP worksharing



- $\rightarrow$  limited parallelism

## Load-Balancing (on same or different level of parallelism)



- OpenMP enables
  - Cheap dynamic and guided load-balancing
  - Just a parallelization option (clause on omp for / do directive)
  - Without additional software effort
  - Without explicit data movement
- On MPI level

- #pragma omp parallel for schedule(dynamic)
  for (i=0; i<n; i++) {
   /\* poorly balanced iterations \*/ ...</pre>
- Dynamic load balancing requires moving of parts of the data structure through the network
- Significant runtime overhead
- Complicated software / therefore not implemented
- MPI & OpenMP
  - Simple static load-balancing on MPI level, dynamic or guided on OpenMP level

medium quality cheap implementation

Hybrid Parallel Programming Slide 128 / 169

Rabenseifner, Hager, Jost

### **Memory consumption**



Shared nothing

- Heroic theory
- In practice: Some data is duplicated

#### MPI & OpenMP

With n threads per MPI process:

- Duplicated data may be reduced by factor n



Hybrid Parallel Programming Slide 129 / 169





## Case study: MPI+OpenMP memory usage of NPB



Hongzhang Shan, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright: Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platorms. Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.





## Memory consumption (continued)

• Future:

With 100+ cores per chip the memory per core is limited.

- Data reduction through usage of shared memory may be a key issue
- Domain decomposition on each hardware level
  - Maximizes
    - Data locality
    - Cache reuse
  - Minimizes
    - ccNUMA accesses
    - Message passing
- No halos between domains inside of SMP node
  - Minimizes
    - Memory consumption



## How many threads per MPI process?



- SMP node = with **m sockets** and **n cores/socket**
- How many threads (i.e., cores) per MPI process?
  - Too many threads per MPI process
    - $\rightarrow$  overlapping of MPI and computation may be necessary,
    - → some NICs unused?
  - Too few threads
    - $\rightarrow$  too much memory consumption (see previous slides)
- Optimum
  - somewhere between 1 and m x n threads per MPI process,
  - Typically:
    - Optimum = n, i.e., 1 MPI process per socket
    - Sometimes = n/2 i.e., 2 MPI processes per socket
    - Seldom = 2n, i.e., each MPI process on 2 sockets



# Opportunities, if MPI speedup is limited due to algorithmic problems



- Algorithmic opportunities due to larger physical domains inside of each MPI process
  - $\rightarrow$  If multigrid algorithm only inside of MPI processes
  - → If separate preconditioning inside of MPI nodes and between MPI nodes
  - $\rightarrow$  If MPI domain decomposition is based on physical zones







## To overcome MPI scaling problems

- Reduced number of MPI messages, reduced aggregated message size
   Compared to pure MPI
- MPI has a few scaling problems
  - Handling of more than 10,000 MPI processes
  - Irregular Collectives: MPI\_....v(), e.g. MPI\_Gatherv()
    - Scaling applications should not use MPI\_....v() routines
  - MPI-2.1 Graph topology (MPI\_Graph\_create)
    - MPI-2.2 MPI\_Dist\_graph\_create\_adjacent
  - Creation of sub-communicators with MPI\_Comm\_create
    - > MPI-2.2 introduces a new scaling meaning of MPI\_Comm\_create
  - ... see P. Balaji, et al.: MPI on a Million Processors. Proceedings EuroPVM/MPI 2009.
- Hybrid programming reduces all these problems (due to a smaller number of processes)





## Summary: Opportunities of hybrid parallelization (MPI & OpenMP)

Nested Parallelism

 $\rightarrow$  Outer loop with MPI / inner loop with OpenMP

• Load-Balancing

 $\rightarrow$  Using OpenMP *dynamic* and *guided* worksharing

- Memory consumption
  - $\rightarrow$  Significantly reduction of replicated data on MPI level
- Opportunities, if MPI speedup is limited due to algorithmic problem
   → Significantly reduced number of MPI processes
- Reduced MPI scaling problems
  - $\rightarrow$  Significantly reduced number of MPI processes



## Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization

### Thread-safety quality of MPI libraries

- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes
- Summary



Hybrid Parallel Programming Slide 136 / 169





## **Thread-safety of MPI Libraries**

- Make most powerful usage of hierarchical structure of hardware:
- Efficient programming of clusters of SMP nodes
   SMP nodes:
  - Dual/multi core CPUs
  - Multi CPU shared memory
  - Multi CPU ccNUMA
  - Any mixture with shared memory programming model



- No restriction to the usage of OpenMP for intranode-parallelism:
  - OpenMP does not (yet) offer binding threads to processors
  - OpenMP does not guarantee thread-ids to stay fixed.
- OpenMP is based on the implementation dependant thread-library: LinuxThreads, NPTL, SolarisThreads.

Hybrid Parallel Programming Slide 137 / 169 Ra



## **MPI rules with OpenMP / Automatic SMP-parallelization**



Special MPI-2 Init for multi-threaded MPI processes:

```
int MPI_Init_thread( int * argc, char ** argv[],
                      int thread level required,
                      int * thead level provided);
int MPI_Query_thread( int * thread_level_provided);
int MPI Is main thread(int * flag);
```

REQUIRED values (increasing order):



– MPI THREAD SINGLE: Only one thread will execute THREAD MASTERONLY: MPI processes may be multi-threaded, (virtual value, but only master thread will make MPI-calls not part of the standard) AND only while other threads are sleeping - MPI THREAD FUNNELED: Only master thread will make MPI-calls - MPI THREAD SERIALIZED: Multiple threads may make MPI-calls, but only one at a time – MPI THREAD MULTIPLE: Multiple threads may call MPI, with no restrictions returned provided may be less than REQUIRED by the application Hybrid Parallel Programming ΗL Rabenseifner, Hager, Jost Slide 138 / 169

## Calling MPI inside of OMP MASTER

- Inside of a parallel region, with "**OMP MASTER**"
- Requires MPI THREAD FUNNELED, i.e., only master thread will make MPI-calls
- **Caution:** There isn't any synchronization with "OMP MASTER"! Therefore, "OMP BARRIER" normally necessary to guarantee, that data or buffer space from/for other threads is available before/after the MPI call!

**!\$OMP BARRIER !**\$OMP MASTER call MPI\_Xxx(...) **!\$OMP END MASTER !\$OMP BARRIER** 

#pragma omp barrier #pragma omp master MPI\_Xxx(...);

#pragma omp barrier

- But this implies that all other threads are sleeping!
- The additional barrier implies also the necessary cache flush!





## ... the barrier is necessary – example with MPI\_Recv



**!**\$OMP PARALLEL **!\$OMP DO** do i=1,1000 a(i) = buf(i)end do **!\$OMP END DO NOWAIT !**SOMP BARRIER **!**\$OMP MASTER call MPI\_RECV(buf,...) **!\$OMP END MASTER !**SOMP BARRIER **!\$OMP DO** do i=1,1000 c(i) = buf(i)end do **!\$OMP END DO NOWAIT !\$OMP END PARALLEL** 

skipped

Hybrid Parallel ProgrammingSlide 140 / 169Rabenseifner, Hager, Jost

#pragma omp parallel

#pragma omp for nowait
 for (i=0; i<1000; i++)
 a[i] = buf[i];</pre>

#pragma omp for nowait
 for (i=0; i<1000; i++)
 c[i] = buf[i];</pre>

/\* omp end parallel \*/





## Thread support in MPI libraries

• The following MPI libraries offer thread support:

| Implementation | Thread support level                          |
|----------------|-----------------------------------------------|
| MPlch-1.2.7p1  | Always announces MPI_THREAD_FUNNELED.         |
| MPIch2-1.0.8   | ch3:sock supports MPI_THREAD_MULTIPLE         |
|                | ch:nemesis has "Initial Thread-support"       |
| MPIch2-1.1.0a2 | ch3:nemesis (default) has MPI_THREAD_MULTIPLE |
| Intel MPI 3.1  | Full mpi_thread_multiple                      |
| SciCortex MPI  | MPI_THREAD_FUNNELED                           |
| HP MPI-2.2.7   | Full MPI_THREAD_MULTIPLE (with libmtmpi)      |
| SGI MPT-1.14   | Not thread-safe?                              |
| IBM MPI        | Full mpi_thread_multiple                      |
| Nec MPI/SX     | MPI_THREAD_SERIALIZED                         |

 Testsuites for thread-safety may still discover bugs in the MPI libraries

Hybrid Parallel Programming Slide 141 / 169 Rabenseifner, Hager, Jost





## Thread support within Open MPI

• In order to enable thread support in Open MPI, configure with:

```
configure --enable-mpi-threads
```

- This turns on:
  - Support for full MPI\_THREAD\_MULTIPLE
  - internal checks when run with threads (--enable-debug)

```
configure --enable-mpi-threads --enable-progress-threads
```

- This (additionally) turns on:
  - Progress threads to asynchronously transfer/receive data per network BTL.
- Additional Feature:
  - Compiling with debugging support, but without threads will check for recursive locking

Hybrid Parallel Programming Slide 142 / 169 Rat



## Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries

### Tools for debugging and profiling MPI+OpenMP

- Other options on clusters of SMP nodes
- Summary



Hybrid Parallel Programming Slide 143 / 169



## Thread Correctness – Intel ThreadChecker 1/3



- Intel ThreadChecker operates in a similar fashion to helgrind,
- Compile with -tcheck, then run program using tcheck\_cl:

Application finished

| es <br> <br> |
|--------------|
|              |
|              |
|              |
|              |
|              |
| ea           |
| e.           |
|              |
|              |
|              |
|              |
|              |
|              |





#### Thread Correctness – Intel ThreadChecker 2/3

#### One may output to HTML:

tcheck\_cl --format HTML --report pthread\_race.html pthread\_race

| ID | Short<br>Description           | Severity<br>Name | Count | Context[Best]       | Description                                                                                                                                                                   | 1st Access[Best]    | 2nd Access[Best]    |
|----|--------------------------------|------------------|-------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|---------------------|
| 1  | Write -><br>Write<br>data-race | Error            | 1     | "pthread_race.c":25 | Memory write of<br>global_variable at<br>"pthread_race.c":31<br>conflicts with a prior<br>memory write of<br>global_variable at<br>"pthread_race.c":31<br>(output dependence) | "pthread_race.c":31 | "pthread_race.c":31 |
| 2  | Thread<br>termination          | Information      | 1     | Whole Program 1     | Thread termination at<br>"pthread_race.c":43<br>includes stack<br>allocation of 8,004 MB<br>and use of 4,672 KB                                                               | "pthread_race.c":43 | "pthread_race.c":43 |
| 3  | Thread<br>termination          | Information      | 1     | Whole Program 2     | Thread termination at<br>"pthread_race.c":43<br>includes stack<br>allocation of 8,004 MB<br>and use of 4,672 KB                                                               | "pthread_race.c":43 | "pthread_race.c":43 |
| 4  | Thread<br>termination          | Information      | 1     | Whole Program 3     | Thread termination at<br>"pthread_race.c":37 -<br>includes stack<br>allocation of 8 MB and<br>use of 4,25 KB                                                                  | "pthread_race.c":37 | "pthread_race.c":37 |

Hybrid Parallel Programming Slide 145 / 169

Courtesy of Rainer Keller, HLRS and ORNL

5



#### Thread Correctness – Intel ThreadChecker 3/3

• If one wants to compile with threaded Open MPI (option for IB):

configure --enable-mpi-threads --enable-debug --enable-mca-no-build=memory-ptmalloc2 CC=icc F77=ifort FC=ifort CFLAGS=`-debug all -inline-debug-info tcheck' CXXFLAGS=`-debug all -inline-debug-info tcheck' FFLAGS=`-debug all -tcheck' LDFLAGS=`tcheck'

```
• Then run with:
```



Hybrid Parallel Programming Slide 146 / 169 Rabenseifner, Hager, Jost Rabenseifner, Hager, Jost HLR S



#### **Performance Tools Support for Hybrid Code**

• Paraver examples have already been shown, tracing is done with linking against (closed-source) omptrace **Or** ompitrace





Courtesy of Rainer Keller, HLRS and ORNL



#### Scalasca – Example "Wait at Barrier"



Hybrid Parallel Programming Slide 148 / 169 Rabenseifner, Hager, Jost





#### Scalasca – Example "Wait at Barrier", Solution



Hybrid Parallel Programming Slide 149 / 169 Rabenseifner, Hager, Jost

Screenshots, courtesy of KOJAK JSC, FZ Jülich

### Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP

#### Other options on clusters of SMP nodes

• Summary



Hybrid Parallel Programming Slide 150 / 169





# How to achieve a hierarchical domain decomposition (DD)?



- Cartesian grids:
  - Several levels of subdivide
  - Ranking of MPI\_COMM\_WORLD three choices:
    - a) Sequential ranks through original data structure
      - + locating these ranks correctly on the hardware
        - can be achieved with one-level DD on finest grid
           + special startup (mpiexec) with optimized rank-mapping
    - b) Sequential ranks in comm\_cart (from MPI\_CART\_CREATE)
      - requires optimized MPI\_CART\_CREATE, or special startup (mpiexec) with optimized rank-mapping
    - c) Sequential ranks in MPI\_COMM\_WORLD
      - + additional communicator with sequential ranks in the data structure
      - + self-written and optimized rank mapping.
- Unstructured grids:
  - $\rightarrow$  next slide

Hybrid Parallel Programming Slide 152 / 169



# How to achieve a hierarchical domain decomposition (DD)?



- Unstructured grids:
  - Multi-level DD:
    - Top-down: Several levels of (Par)Metis
    - Bottom-up: Low level DD + higher level recombination
  - Single-level DD (finest level)
    - Analysis of the communication pattern in a first run (with only a few iterations)
    - Optimized rank mapping to the hardware before production run
    - E.g., with CrayPAT + CrayApprentice



Hybrid Parallel Programming Slide 153 / 169





#### Top-down – several levels of (Par)Metis



Steps:

- Load-balancing (e.g., with ParMetis) on outer level, i.e., between all SMP nodes
  - Independent (Par)Metis inside of each node
- Metis inside of each socket
- Subdivide does not care on balancing of the outer boundary
- processes can get a lot of neighbors with inter-node communication
- unbalanced communication

R S





#### Bottom-up – Multi-level DD through recombination



#### **Profiling solution**



- First run with profiling
  - Analysis of the communication pattern
- Optimization step
  - Calculation of an optimal mapping of ranks in MPI\_COMM\_WORLD to the hardware grid (physical cores / sockets / SMP nodes)
- Restart of the application with this optimized locating of the ranks on the hardware grid
- Example: CrayPat and CrayApprentice



Hybrid Parallel Programming Slide 156 / 169





The vendors will

(or must) deliver

scalable MPI

libraries for their

largest systems!

### Scalability of MPI to hundreds of thousands ...

Weak scalability of pure MPI

- As long as the application does not use
  - MPI\_ALLTOALL
  - MPI\_<collectives>V (i.e., with length arrays)

and application

- distributes all data arrays
- one can expect:
  - Significant, but still scalable memory overhead for halo cells.
  - MPI library is internally scalable:
    - E.g., mapping ranks → hardware grid
      - Centralized storing in shared memory (OS level)
      - In each MPI process, only used neighbor ranks are stored (cached) in process-local memory.
    - Tree based algorithm wiith O(log N)
      - From 1000 to 1000,000 process O(Log N) only doubles!

Hybrid Parallel Programming Slide 157 / 169

Rabenseifner, Hager, Jost

r, Jost

HLRS





R S

ΗL

#### **Remarks on Cache Optimization**

- After all parallelization domain decompositions (DD, up to 3 levels) are done:
- Additional DD into data blocks
  - that fit to 2<sup>nd</sup> or 3<sup>rd</sup> level cache.
  - It is done inside of each MPI process (on each core).
  - Outer loops over these blocks
  - Inner loops inside of a block
  - Cartesian example: 3-dim loop is split into
    - do i\_block=1,ni,stride\_i do j\_block=1,nj,stride\_j do k\_block=1,nk,stride\_k do i=i\_block,min(i\_block+stride\_i-1, ni) do j=j\_block,min(j\_block+stride\_j-1, nj) do k=k\_block,min(k\_block+stride\_k-1, nk)  $a(i,j,k) = f(b(i\pm0,1,2, j\pm0,1,2, k\pm0,1,2))$ ... ... end do Access to 13-point stencil

end do

Hybrid Parallel Programming Slide 158 / 169



#### **Remarks on Cost-Benefit Calculation**

Costs

- for optimization effort
  - e.g., additional OpenMP parallelization
  - e.g., 3 person month x 5,000 € = 15,000 € (full costs)

Benefit

- from reduced CPU utilization
  - e.g., Example 1:
    - 100,000 € hardware costs of the cluster
    - x 20% used by this application over whole lifetime of the cluster
    - x 7% performance win through the optimization
    - = 1,400 € → total loss = 13,600 €
  - e.g., Example 2:
     **10 Mio € system** x 5% used x 8% performance win
     = 40,000 € → total win = 25,000 €





- Parallelization always means
  - expressing locality.

skipped

- If the application has no locality,
  - Then the parallelization needs not to model locality
  - $\rightarrow$  UPC with its round robin data distribution may fit
- If the application has locality,
  - then it must be expressed in the parallelization
- Coarray Fortran (CAF) expresses data locality explicitly through "codimension":
  - A(17,15)[3]
    - = element A(17,13) in the distributed array A in process with rank 3





- Future shrinking of memory per core implies
  - Communication time becomes a bottleneck
  - → Computation and communication must be overlapped, i.e., latency hiding is needed
- With PGAS, halos are not needed.

skipped

- But it is hard for the compiler to access data such early that the transfer can be overlapped with enough computation.
- With MPI, typically too large message chunks are transferred.
  - This problem also complicates overlapping.
- Strided transfer is expected to be slower than contiguous transfers
  - Typical packing strategies do not work for PGAS on compiler level
  - Only with MPI, or with explicit application programming with PGAS





- Point-to-point neighbor communication
  - PGAS or MPI nonblocking may fit if message size makes sense for overlapping.
- Collective communication
  - Library routines are best optimized
  - Non-blocking collectives (comes with MPI-3.0) versus calling MPI from additional communication thread
  - Only blocking collectives in PGAS library?



skipped

Hybrid Parallel Programming Slide 162 / 169





- For extreme HPC (many nodes x many cores)
  - Most parallelization may still use MPI
  - Parts are optimized with PGAS, e.g., for better latency hiding
  - PGAS efficiency is less portable than MPI
  - #ifdef ... PGAS
  - Requires mixed programming PGAS & MPI
    - $\rightarrow$  will be addressed by MPI-3.0

Hybrid Parallel Programming Slide 163 / 169

skipped



### Outline



- Introduction / Motivation
- Programming models on clusters of SMP nodes
- Case Studies / pure MPI vs hybrid MPI+OpenMP
- Practical "How-To" on hybrid programming
- Mismatch Problems
- Opportunities: Application categories that can benefit from hybrid parallelization
- Thread-safety quality of MPI libraries
- Tools for debugging and profiling MPI+OpenMP
- Other options on clusters of SMP nodes

#### Summary



Hybrid Parallel Programming Slide 164 / 169





#### Acknowledgements

- We want to thank
  - Gerhard Wellein, RRZE
  - Alice Koniges, NERSC, LBNL
  - Rainer Keller, HLRS and ORNL
  - Jim Cownie, Intel
  - KOJAK project at JSC, Research Center Jülich
  - HPCMO Program and the Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (http://www.erdc.hpc.mil/index)



Hybrid Parallel Programming Slide 165 / 169







#### MPI + OpenMP

- Significant opportunity  $\rightarrow$  higher performance on smaller number of threads
- Seen with NPB-MZ examples
  - BT-MZ  $\rightarrow$  strong improvement (as expected)
  - SP-MZ  $\rightarrow$  small improvement (none was expected)
- Usable on higher number of cores
- Advantages
  - Load balancing
  - Memory consumption
  - Two levels of parallelism
    - Outer  $\rightarrow$  distributed memory  $\rightarrow$  halo data transfer  $\rightarrow$  MPI
    - Inner  $\rightarrow$  shared memory  $\rightarrow$  ease of SMP parallelization  $\rightarrow$  OpenMP
- You can do it  $\rightarrow$  "How To"





# Summary – the bad news ,

#### MPI+OpenMP: There is a huge amount of pitfalls

- Pitfalls of MPI
- Pitfalls of OpenMP
  - − On ccNUMA  $\rightarrow$  e.g., first touch
  - Pinning of threads on cores
- Pitfalls through combination of MPI & OpenMP
  - E.g., topology and mapping problems
  - Many mismatch problems
- Tools are available 😁
  - It is not easier than analyzing pure MPI programs  $\bigcirc$
- Most hybrid programs  $\rightarrow$  Masteronly style
- Overlapping communication and computation with several threads
  - Requires thread-safety quality of MPI library
  - Loss of OpenMP worksharing support → using OpenMP tasks as workaround

Hybrid Parallel Programming Slide 167 / 169



#### Summary – good and bad



- Optimization
  - 1 MPI process mismatch 1 MPI process problem per core ...... ^– somewhere between may be the optimum
- Efficiency of MPI+OpenMP is not for free:
   The efficiency strongly depends on
   the amount of work in the source code development

Hybrid Parallel Programming Slide 168 / 169





# Alternatives

SC10 New Orteons, LA

#### Pure MPI

- + Ease of use
- Topology and mapping problems may need to be solved (depends on loss of efficiency with these problems)
- Number of cores may be more limited than with MPI+OpenMP
- + Good candidate for perfectly load-balanced applications

#### Pure OpenMP

- + Ease of use
- Limited to problems with tiny communication footprint
- source code modifications are necessary (Variables that are used with "shared" data scope must be allocated as "sharable")
- ± (Only) for the <u>appropriate</u> application a suitable tool





#### Summary



R

- This tutorial tried to
  - help to negotiate obstacles with hybrid parallelization,
  - give hints for the design of a hybrid parallelization,
  - and technical hints for the implementation  $\rightarrow$  "How To",
  - show tools if the application does not work as designed.
- This tutorial was not an introduction into other parallelization models:
  - Partitioned Global Address Space (PGAS) languages (Unified Parallel C (UPC), Co-array Fortran (CAF), Chapel, Fortress, Titanium, and X10).
  - High Performance Fortran (HPF)
  - → Many rocks in the cluster-of-SMP-sea do not vanish into thin air by using new parallelization models
  - $\rightarrow$  Area of interesting research in next years

Hybrid Parallel Programming Slide 170 / 169



RS

Н

#### Conclusions

- Future hardware will be more complicated
  - − Heterogeneous  $\rightarrow$  GPU, FPGA, ...
  - ccNUMA quality may be lost on cluster nodes
  - ...
- High-end programming  $\rightarrow$  more complex
- Medium number of cores → more simple (if #cores / SMP-node will not shrink)
- MPI+OpenMP → work horse on large systems
- Pure MPI → still on smaller cluster
- OpenMP → on large ccNUMA nodes (not ClusterOpenMP)

Thank you for your interest —

#### Q & A Please fill in the feedback sheet – Thank you

Hybrid I Slide 17

Hybrid Parallel Programming Slide 171 / 169



# Appendix

- Abstract
- Authors
- References (with direct relation to the content of this tutorial)
- Further references



Hybrid Parallel Programming Slide 172 F



#### Abstract



Half-Day Tutorial (Level: 20% Introductory, 50% Intermediate, 30% Advanced)

Authors. Rolf Rabenseifner, HLRS, University of Stuttgart, Germany Georg Hager, University of Erlangen-Nuremberg, Germany Gabriele Jost, Texas Advanced Computing Center, The University of Texas at Austin, USA

**Abstract.** Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with single/multi-socket and multi-core SMP nodes, but also "constellation" type systems with large SMP nodes. Parallel programming may combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.

This tutorial analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. The thread-safety quality of several existing MPI libraries is also discussed. Case studies will be provided to demonstrate various aspects of hybrid MPI/OpenMP programming. Another option is the use of distributed virtual shared-memory technologies. Application categories that can take advantage of hybrid programming are identified. Multi-socket-multi-core systems in highly parallel environments are given special consideration.

Details. https://fs.hlrs.de/projects/rabenseifner/publ/SC2010-hybrid.html

Hybrid Parallel Programming Slide 173



#### **Rolf Rabenseifner**





Dr. Rolf Rabenseifner studied mathematics and physics at the University of Stuttgart. Since 1984, he has worked at the High-Performance Computing-Center Stuttgart (HLRS). He led the projects DFN-RPC, a remote procedure call tool, and MPI-GLUE, the first metacomputing MPI combining different vendor's MPIs without loosing the full MPI interface. In his dissertation, he developed a controlled logical clock as global time for trace-based profiling of parallel and distributed applications. Since 1996, he has been a member of the MPI-2 Forum and since Dec. 2007, he is in the steering committee of the MPI-3 Forum. From January to April 1999, he was an invited researcher at the Center for High-Performance Computing at Dresden University of Technology. Currently, he is head of Parallel Computing - Training and Application Services at HLRS. He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge Benchmark Suite. In recent projects, he studied parallel I/O, parallel programming models for clusters of SMP nodes, and optimization of MPI collective routines. In workshops and summer schools, he teaches parallel programming models in many universities and labs in Germany.





#### **Georg Hager**





Georg Hager holds a PhD in computational physics from the University of Greifswald. He has been working with high performance systems since 1995, and is now a senior research scientist in the HPC group at Erlangen Regional Computing Center (RRZE). His daily work encompasses all aspects of HPC user support and training, assessment of novel system and processor architectures, and supervision of student projects and theses. Recent research includes architecture-specific optimization for current microprocessors, performance modeling on processor and system levels, and the efficient use of hybrid parallel systems. A full list of publications, talks, and other HPC-related stuff he is interested in can be found in his blog: <u>http://blogs.fau.de/hager</u>.

Hybrid Parallel Programming Slide 175



#### **Gabriele Jost**





Gabriele Jost obtained her doctorate in Applied Mathematics from the University of Göttingen, Germany. For more than a decade she worked for various vendors (Suprenum GmbH, Thinking Machines Corporation, and NEC) of high performance parallel computers in the areas of vectorization, parallelization, performance analysis and optimization of scientific and engineering applications.

In 2005 she moved from California to the Pacific Northwest and joined Sun Microsystems as a staff engineer in the Compiler Performance Engineering team, analyzing compiler generated code and providing feedback and suggestions for improvement to the compiler group. She then decided to explore the world beyond scientific computing and joined Oracle as a Principal Engineer working on performance analysis for application server software. That was fun, but she realized that her real passions remains in area of performance analysis and evaluation of programming paradigms for high performance computing and that she really liked California. She is now a Research Scientist at the Texas Advanced Computing Center (TACC), working remotely from Monterey, CA on all sorts of exciting projects related to large scale parallel processing for scientific computing.

Hybrid Parallel Programming Slide 176





#### **References** (with direct relation to the content of this tutorial)

- NAS Parallel Benchmarks: http://www.nas.nasa.gov/Resources/Software/npb.html
- R.v.d. Wijngaart and H. Jin, NAS Parallel Benchmarks, Multi-Zone Versions, NAS Technical Report NAS-03-010, 2003
- H. Jin and R. v.d.Wijngaart, Performance Characteristics of the multi-zone NAS Parallel Benchmarks, Proceedings IPDPS 2004
- G. Jost, H. Jin, D. an Mey and F. Hatay,
   Comparing OpenMP, MPI, and Hybrid Programming, Proc. Of the 5th European Workshop on OpenMP, 2003
- E. Ayguade, M. Gonzalez, X. Martorell, and G. Jost,
   Employing Nested OpenMP for the Parallelization of Multi-Zone CFD Applications, Proc. Of IPDPS 2004





Rolf Rabenseifner, Hybrid Parallel Programming on HPC Platforms. In proceedings of the Fifth European Workshop on OpenMP, EWOMP '03,

Aachen, Germany, Sept. 22-26, 2003, pp 185-194, www.compunity.org.

• Rolf Rabenseifner,

**Comparison of Parallel Programming Models on Clusters of SMP Nodes.** In proceedings of the 45nd Cray User Group Conference, CUG SUMMIT 2003, May 12-16, Columbus, Ohio, USA.

• Rolf Rabenseifner and Gerhard Wellein,

Comparison of Parallel Programming Models on Clusters of SMP Nodes.
In Modelling, Simulation and Optimization of Complex Processes (Proceedings of the International Conference on High Performance Scientific Computing, March 10-14, 2003, Hanoi, Vietnam) Bock, H.G.; Kostina, E.; Phu, H.X.; Rannacher, R. (Eds.), pp 409-426, Springer, 2004.

 Rolf Rabenseifner and Gerhard Wellein,
 Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.

In the International Journal of High Performance Computing Applications, Vol. 17, No. 1, 2003, pp 49-62. Sage Science Press.

Hybrid Parallel Programming Slide 178





- Rolf Rabenseifner, Communication and Optimization Aspects on Hybrid Architectures. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, J. Dongarra and D. Kranzlmüller (Eds.), Proceedings of the 9th European PVM/MPI Users' Group Meeting, EuroPVM/MPI 2002, Sep. 29 - Oct. 2, Linz, Austria, LNCS, 2474, pp 410-420, Springer, 2002.
   Rolf Rabenseifner and Gerhard Wellein, Communication and Optimization Aspects of Parallel Programming Medals of
  - Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.

In proceedings of the Fourth European Workshop on OpenMP (EWOMP 2002), Roma, Italy, Sep. 18-20th, 2002.

 Rolf Rabenseifner,
 Communication Bandwidth of Parallel Programming Models on Hybrid Architectures.

Proceedings of WOMPEI 2002, International Workshop on OpenMP: Experiences and Implementations, part of ISHPC-IV, International Symposium on High Performance Computing, May, 15-17., 2002, Kansai Science City, Japan, LNCS 2327, pp 401-412.







- Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, ISBN 978-1439811924.
- Barbara Chapman et al.: Toward Enhancing OpenMP's Work-Sharing Directives. In proceedings, W.E. Nagel et al. (Eds.): Euro-Par 2006, LNCS 4128, pp. 645-654, 2006.
- Barbara Chapman, Gabriele Jost, and Ruud van der Pas: Using OpenMP.
   The MIT Press, 2008

The MIT Press, 2008.

 Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur and Jesper Larsson Traeff: MPI on a Million Processors.

EuroPVM/MPI 2009, Springer.

- Alice Koniges et al.: **Application Acceleration on Current and Future Cray Platforms.** Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.
- H. Shan, H. Jin, K. Fuerlinger, A. Koniges, N. J. Wright: Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platorms. Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.







J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments.

Proc. of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010. Preprint: http://arxiv.org/abs/1004.4431

• H. Stengel:

#### Parallel programming on hybrid hardware: Models and applications.

Master's thesis, Ohm University of Applied Sciences/RRZE, Nuremberg, 2010. http://www.hpc.rrze.uni-erlangen.de/Projekte/hybrid.shtml



Hybrid Parallel Programming Slide 181





#### **Further references**

 Sergio Briguglio, Beniamino Di Martino, Giuliana Fogaccia and Gregorio Vlad, Hierarchical MPI+OpenMP implementation of parallel PIC applications on clusters of Symmetric MultiProcessors,

10th European PVM/MPI Users' Group Conference (EuroPVM/MPI'03), Venice, Italy, 29 Sep - 2 Oct, 2003

 Barbara Chapman,
 Parallel Application Development with the Hybrid MPI+OpenMP Programming Model,

Tutorial, 9th EuroPVM/MPI & 4th DAPSYS Conference, Johannes Kepler University Linz, Austria September 29-October 02, 2002

- Luis F. Romero, Eva M. Ortigosa, Sergio Romero, Emilio L. Zapata, Nesting OpenMP and MPI in the Conjugate Gradient Method for Band Systems, 11th European PVM/MPI Users' Group Meeting in conjunction with DAPSYS'04, Budapest, Hungary, September 19-22, 2004
- Nikolaos Drosinos and Nectarios Koziris, Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs,

10th European PVM/MPI Users' Group Conference (EuroPVM/MPI'03), Venice, Italy, 29 Sep - 2 Oct, 2003

Hybrid Parallel Programming Slide 182





#### **Further references**

 Holger Brunst and Bernd Mohr, Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with VampirNG Proceedings for IWOMP 2005, Eugene, OR, June 2005. http://www.fz-juelich.de/zam/kojak/documentation/publications/

 Felix Wolf and Bernd Mohr, Automatic performance analysis of hybrid MPI/OpenMP applications Journal of Systems Architecture, Special Issue "Evolutions in parallel distributed and network-based processing", Volume 49, Issues 10-11, Pages 421-439, November 2003.

http://www.fz-juelich.de/zam/kojak/documentation/publications/

 Felix Wolf and Bernd Mohr, Automatic Performance Analysis of Hybrid MPI/OpenMP Applications short version: Proceedings of the 11-th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP 2003), Genoa, Italy, February 2003.

long version: Technical Report FZJ-ZAM-IB-2001-05.

http://www.fz-juelich.de/zam/kojak/documentation/publications/







#### **Further references**

- Frank Cappello and Daniel Etiemble, MPI versus MPI+OpenMP on the IBM SP for the NAS benchmarks, in Proc. Supercomputing'00, Dallas, TX, 2000. http://citeseer.nj.nec.com/cappello00mpi.html www.sc2000.org/techpapr/papers/pap.pap214.pdf
- Jonathan Harris, Extending OpenMP for NUMA Architectures, in proceedings of the Second European Workshop on OpenMP, EWOMP 2000. www.epcc.ed.ac.uk/ewomp2000/proceedings.html
- D. S. Henty,

Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling,

in Proc. Supercomputing'00, Dallas, TX, 2000. http://citeseer.nj.nec.com/henty00performance.html www.sc2000.org/techpapr/papers/pap.pap154.pdf







 Matthias Hess, Gabriele Jost, Matthias Müller, and Roland Rühle, Experiences using OpenMP based on Compiler Directed Software DSM on a PC Cluster,

in WOMPAT2002: Workshop on OpenMP Applications and Tools, Arctic Region Supercomputing Center, University of Alaska, Fairbanks, Aug. 5-7, 2002. http://www.hlrs.de/people/mueller/papers/wompat2002/wompat2002.pdf

• John Merlin,

Distributed OpenMP: Extensions to OpenMP for SMP Clusters,

in proceedings of the Second EuropeanWorkshop on OpenMP, EWOMP 2000. www.epcc.ed.ac.uk/ewomp2000/proceedings.html

- Mitsuhisa Sato, Shigehisa Satoh, Kazuhiro Kusano, and Yoshio Tanaka, Design of OpenMP Compiler for an SMP Cluster, in proceedings of the 1st European Workshop on OpenMP (EWOMP'99), Lund, Sweden, Sep. 1999, pp 32-39. http://citeseer.nj.nec.com/sato99design.html
- Alex Scherer, Honghui Lu, Thomas Gross, and Willy Zwaenepoel, Transparent Adaptive Parallelism on NOWs using OpenMP,
  - in proceedings of the Seventh Conference on Principles and Practice of Parallel Programming (PPoPP '99), May 1999, pp 96-106.



Hybrid Parallel Programming Slide 185 Rat







Weisong Shi, Weiwu Hu, and Zhimin Tang,
 Shared Virtual Memory: A Survey,
 Technical report No. 980005, Center for High Performance Computing,
 Institute of Computing Technology, Chinese Academy of Sciences, 1998,
 www.ict.ac.cn/chpc/dsm/tr980005.ps.

- Lorna Smith and Mark Bull, Development of Mixed Mode MPI / OpenMP Applications, in proceedings of Workshop on OpenMP Applications and Tools (WOMPAT 2000), San Diego, July 2000. www.cs.uh.edu/wompat2000/
- Gerhard Wellein, Georg Hager, Achim Basermann, and Holger Fehske, Fast sparse matrix-vector multiplication for TeraFlop/s computers, in proceedings of VECPAR'2002, 5th Int'l Conference on High Performance Computing and Computational Science, Porto, Portugal, June 26-28, 2002, part I, pp 57-70. http://vecpar.fe.up.pt/



Hybrid Parallel Programming Slide 186





- Agnieszka Debudaj-Grabysz and Rolf Rabenseifner, Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes. In proceedings, W. E. Nagel, W. V. Walter, and W. Lehner (Eds.): Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Aug. 29 - Sep. 1, Dresden, Germany, LNCS 4128, Springer, 2006.
- Agnieszka Debudaj-Grabysz and Rolf Rabenseifner, Nesting OpenMP in MPI to Implement a Hybrid Communication Method of Parallel Simulated Annealing on a Cluster of SMP Nodes. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Beniamino Di Martino, Dieter Kranzlmüller, and Jack Dongarra (Eds.), Proceedings

of the 12th European PVM/MPI Users' Group Meeting, EuroPVM/MPI 2005,

Sep. 18-21, Sorrento, Italy, LNCS 3666, pp 18-27, Springer, 2005



Hybrid Parallel Programming Slide 187 R



#### Content

| <ul> <li>Introduction / Motivation</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Major programming models</li> <li>Pure MPI</li> <li>Hybrid Masteronly Style</li> <li>Overlapping Communication and Computation</li> <li>Pure OpenMP</li> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>Benchmark Architectures</li> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT5 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming48</li> <li>How to compile, link and run</li> </ul> |
| <ul> <li>Pure MPI</li> <li>Hybrid Masteronly Style</li> <li>Overlapping Communication and Computation</li> <li>Pure OpenMP</li> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>Benchmark Architectures</li> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>How to compile, link and run</li> </ul>      |
| <ul> <li>Hybrid Masteronly Style</li> <li>Overlapping Communication and Computation</li> <li>Pure OpenMP</li> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>Benchmark Architectures</li> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>How to compile, link and run</li> </ul>                        |
| <ul> <li>Overlapping Communication and Computation 11</li> <li>Pure OpenMP</li> <li>12</li> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP . 13</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>Benchmark Architectures</li> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming</li> <li>Media</li> </ul>                                                          |
| <ul> <li>Pure OpenMP</li> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>The Multi-Zone NAS Parallel Benchmarks</li> <li>Benchmark Architectures</li> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>Practical XT5 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming</li> <li>Mage Architecture</li> <li>How to compile, link and run</li> </ul>                                                              |
| <ul> <li>Case Studies / pure MPI vs. hybrid MPI+OpenMP . 13</li> <li>The Multi-Zone NAS Parallel Benchmarks 14</li> <li>Benchmark Architectures 18</li> <li>On the Sun Constellation Cluster Ranger 20</li> <li>NUMA Control (numactl) 25</li> <li>On a Cray XT5 cluster 31</li> <li>On a Cray XT4 cluster 36</li> <li>On a IBM Power6 system 40</li> <li>Conclusions 47 • O</li> <li>Practical "How-To" on hybrid programming 48</li> <li>How to compile, link and run 50</li> </ul>                                                                                                                                                                          |
| <ul> <li>The Multi-Zone NAS Parallel Benchmarks 14</li> <li>Benchmark Architectures 18</li> <li>On the Sun Constellation Cluster Ranger 20</li> <li>NUMA Control (numactl) 25</li> <li>On a Cray XT5 cluster 31</li> <li>On a Cray XT4 cluster 36</li> <li>On a IBM Power6 system 40</li> <li>Conclusions 47 • O</li> <li>• Practical "How-To" on hybrid programming 48</li> <li>How to compile, link and run 50</li> </ul>                                                                                                                                                                                                                                    |
| <ul> <li>Benchmark Architectures</li> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming</li> <li>How to compile, link and run</li> </ul>                                                                                                                                                                                                                                                                                         |
| <ul> <li>On the Sun Constellation Cluster Ranger</li> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>47 • O</li> <li>• Practical "How-To" on hybrid programming</li> <li>48</li> <li>How to compile, link and run</li> </ul>                                                                                                                                                                                                                                                                                            |
| <ul> <li>NUMA Control (numactl)</li> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>47 • O</li> <li>• Practical "How-To" on hybrid programming</li> <li>48</li> <li>How to compile, link and run</li> </ul>                                                                                                                                                                                                                                                                                                                                             |
| <ul> <li>On a Cray XT5 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>47 • O</li> <li>• Practical "How-To" on hybrid programming</li> <li>48</li> <li>How to compile, link and run</li> <li>50</li> </ul>                                                                                                                                                                                                                                                                                                                                                                 |
| <ul> <li>On a Cray XT4 cluster</li> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming48</li> <li>How to compile, link and run</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| <ul> <li>On a IBM Power6 system</li> <li>Conclusions</li> <li>Practical "How-To" on hybrid programming</li> <li>How to compile, link and run</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| <ul> <li>Conclusions 47 • O</li> <li>• Practical "How-To" on hybrid programming 48</li> <li>– How to compile, link and run 50</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Practical "How-To" on hybrid programming 48     – How to compile, link and run 50                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| <ul> <li>How to compile, link and run</li> <li>50</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| •                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| <ul> <li>Running the code <i>efficiently</i>?</li> <li>57</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| <ul> <li>A short introduction to ccNUMA 59</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| <ul> <li>– ccNUMA Memory Locality Problems / First Touch 63</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| <ul> <li>– ccNUMA problems beyond first touch</li> <li>66</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| <ul> <li>Bandwidth and latency</li> <li>68</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Hybrid Parallel Programming                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Slide 188 Rabenseifner, Hager, Jost                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

|                                                                     | slide      |
|---------------------------------------------------------------------|------------|
| <ul> <li>OpenMP and Threading overhead</li> </ul>                   | 71         |
| <ul> <li>Thread/Process Affinity ("Pinning")</li> </ul>             | 76         |
| <ul> <li>Example: 3D Jacobi Solver</li> </ul>                       | 87         |
| <ul> <li>Example: Sparse Matrix-Vector-Multiply with JDS</li> </ul> | <b>9</b> 0 |
| <ul> <li>Hybrid MPI/OpenMP: "how-to"</li> </ul>                     | 96         |
| Mismatch Problems                                                   | 97         |
| <ul> <li>Topology problem</li> </ul>                                | 99         |
| <ul> <li>Mapping problem with mixed model</li> </ul>                | 106        |
| <ul> <li>Unnecessary intra-node communication</li> </ul>            | 107        |
| <ul> <li>Sleeping threads and network saturation problem</li> </ul> | ı 108      |
| <ul> <li>Additional OpenMP overhead</li> </ul>                      | 109        |
| <ul> <li>Overlapping communication and computation</li> </ul>       | 110        |
| <ul> <li>Communication overhead with DSM</li> </ul>                 | 119        |
| <ul> <li>Back to the mixed model</li> </ul>                         | 124        |
| <ul> <li>No silver bullet</li> </ul>                                | 125        |
| <ul> <li>Opportunities: Application categories that can</li> </ul>  | 126        |
| benefit from hybrid parallelization                                 |            |
| <ul> <li>Nested Parallelism</li> </ul>                              | 127        |
| <ul> <li>Load-Balancing</li> </ul>                                  | 128        |
| <ul> <li>Memory consumption</li> </ul>                              | 129        |
| <ul> <li>Opportunities, if MPI speedup is limited due</li> </ul>    | 133        |
| to algorithmic problem                                              |            |
| <ul> <li>To overcome MPI scaling problems</li> </ul>                | 134        |
| - Summary                                                           | 135        |
| <u> </u>                                                            |            |
|                                                                     |            |
| TRUC                                                                |            |



#### Content

| Thread-safety quality of MPI libraries                          |     |  |  |  |  |  |  |
|-----------------------------------------------------------------|-----|--|--|--|--|--|--|
| <ul> <li>MPI rules with OpenMP</li> </ul>                       | 138 |  |  |  |  |  |  |
| <ul> <li>Thread support of MPI libraries</li> </ul>             | 141 |  |  |  |  |  |  |
| <ul> <li>Thread Support within OpenMPI</li> </ul>               | 142 |  |  |  |  |  |  |
| Tools for debugging and profiling MPI+OpenMP 143                |     |  |  |  |  |  |  |
| <ul> <li>Intel ThreadChecker</li> </ul>                         | 144 |  |  |  |  |  |  |
| <ul> <li>Performance Tools Support for Hybrid Code</li> </ul>   | 146 |  |  |  |  |  |  |
| • Other options on clusters of SMP nodes                        |     |  |  |  |  |  |  |
| <ul> <li>Pure MPI – multi-core aware</li> </ul>                 | 151 |  |  |  |  |  |  |
| <ul> <li>Hierarchical domain decomposition</li> </ul>           | 152 |  |  |  |  |  |  |
| <ul> <li>Scalability of MPI to hundreds of thousands</li> </ul> | 157 |  |  |  |  |  |  |
| <ul> <li>Remarks on Cache Optimization</li> </ul>               | 158 |  |  |  |  |  |  |
| <ul> <li>Remarks on Cost-Benefit Calculation</li> </ul>         | 159 |  |  |  |  |  |  |
| <ul> <li>Remarks on MPI and PGAS (UPC &amp; CAF)</li> </ul>     | 160 |  |  |  |  |  |  |
| • Summary 16                                                    |     |  |  |  |  |  |  |
| <ul> <li>Acknowledgements</li> </ul>                            | 165 |  |  |  |  |  |  |
| – Summaries                                                     | 166 |  |  |  |  |  |  |
| - Conclusions                                                   | 171 |  |  |  |  |  |  |

| Appendix                                                                                  | . 172 |
|-------------------------------------------------------------------------------------------|-------|
| – Abstract                                                                                | 173   |
| – Authors                                                                                 | 174   |
| <ul> <li>References (with direct relation to the<br/>content of this tutorial)</li> </ul> | 177   |
| <ul> <li>Further references</li> </ul>                                                    | 181   |
| Content                                                                                   | . 188 |

Hybrid Parallel Programming Slide 189



