

# News Updates: SX-ACE's Operations and Applications Development for the Future

#### Hiroaki Kobayashi

Deputy Directer for the HPC Strategy of Cybersicence Center, Professor of Graduate School of Information Sciences Tohoku University koba@tohoku.ac.jp

> 24th WSSP Dec. 5-6, 2016



#### Tohoku Univ.'s New Supercomputer System (2015.2.20~)HPCI



# HPCI: High Performance Computing Infrastructure in Japan





#### Organization of Tohoku Univ. SX-ACE System





## Features of Tohoku Univ. SX-ACE System

| Significant Performance Improvement with Lower Power and Less Space |                         |                   |                   |             |  |  |  |
|---------------------------------------------------------------------|-------------------------|-------------------|-------------------|-------------|--|--|--|
|                                                                     |                         | SX-9 (2008)       | SX-ACE (2014)     | Improvement |  |  |  |
|                                                                     | Number of Cores         | 1                 | 4                 | 4x          |  |  |  |
| CPU<br>Performance                                                  | Total Flop/s            | 118.4Gflop/s      | 276Gflop/s        | 2.3x        |  |  |  |
|                                                                     | Memory Bandwidth        | 256GB/sec         | 256GB/sec         | 1           |  |  |  |
|                                                                     | ADB Capacity            | 256KB             | 4MB               | 16x         |  |  |  |
| Tatal                                                               | Total Flop/s            | 34.1Tfop/s        | 706.6Tflop/s      | 20.7x       |  |  |  |
| Iotal                                                               | Total Memory Bandwidth  | 73.7TB/s          | 655TB/s           | 8.9x        |  |  |  |
| Footprint, Power                                                    | Total Memory Capacity   | 18TB              | 160TB             | 8.9x        |  |  |  |
| Consumption                                                         | Power Consumption (Max) | 590kVA            | 1,080kVA          | 1.8x        |  |  |  |
|                                                                     | Footprint               | 293m <sup>2</sup> | 430m <sup>2</sup> | 1.5x        |  |  |  |

| Powerful CPU/Node Performance and Higher B/F rate |                      |              |            |       |  |  |  |  |
|---------------------------------------------------|----------------------|--------------|------------|-------|--|--|--|--|
|                                                   |                      | SX-ACE(2014) | K(2011)    | Ratio |  |  |  |  |
|                                                   | Clock Frequency      | 1GHz         | 2GHz       | 0.5x  |  |  |  |  |
|                                                   | Flop/s per Core      | 64Gflop/s    | 16Gflop/s  | 4x    |  |  |  |  |
| CPU                                               | Cores per CPU        | 4            | 8          | 0.5x  |  |  |  |  |
| (Node)                                            | Flop/s per CPU       | 256Gflop/s   | 128Gflop/s | 2x    |  |  |  |  |
| Performance                                       | Bandwidth            | 256GB/s      | 64GB/s     | 4x    |  |  |  |  |
|                                                   | Bytes per Flop (B/F) | 1            | 0.5        | 2x    |  |  |  |  |
|                                                   | Memory Capacity      | 64GB         | 16GB       | 4x    |  |  |  |  |

A Balanced System for High Sustained Performance, resulting in High Productivity in the Wide Area of Applications in Academia and Industry



#### Cooling Facility of HPC Building





## Power Consumption of the Cooling System Effect of Fresh-Air Cooling



## Node-Core Activity:

Effect of Automatic Core-Node Activation/Deactivation Control



Month/Day



## Operation Statistics of SX-ACE (Normalized by SX-9 Data)





#### Performance Evaluation of SX-ACE by Using the HPCG Benchmark

- ★ HPCG (High Performance Conjugate Gradients) is designed to exercise computational and data access patterns that more closely match a broad set of important applications,
  - ✓ HPL for top500 is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications.
- ★ HPCG is a complete, stand-alone code that measures the performance of basic operations in a unified code:
  - ✓ Sparse matrix-vector multiplication.
  - $\checkmark$  Sparse triangular solve.
  - ✓ Vector updates.
  - $\checkmark$  Global dot products.
  - ✓ Local symmetric Gauss-Seidel smoother.
  - Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids.
  - $\checkmark$  Reference implementation is written in C++ with MPI and OpenMP support.



## Features of the SX-ACE Vector Processor

- 4 high-performance core Configuration, each with High-Performance Vector-Processing Unit and Scalar Processing Unit
  - 272Gflop/s of VPU + 4Gflop/s of SPU per socket
    - 68Gflop/s + I Gflop/s per core
  - IMB private ADB per core (4MB per socket)
    - Software-controlled on-chip memory for vector load/store
    - 4x compared with SX-9
    - 4-way set-associative
    - MSHR with 512 entries (address+data)
    - 256GB/s to/from Vec. Reg.
      - 4B/F for Multiply-Add operations
  - 256 GB/s memory bandwidth, Shared with 4 cores
    - IB/F in 4-core Multiply-Add operations
      - $\sim$  4B/F in 1-core Multiply-Add operations
    - 128 memory banks per socket
- Other improvement and new mechanisms to enhance vector processing capability, especially for efficient handling of short vectors operations and indirect memory accesses
  - Out of Order execution for vector load/store operations
  - Advanced data forwarding in vector pipes chaining
  - Shorter memory latency than SX-9

# Source: NEC SX-ACE Processor Architecture



#### **Floor Plan of the CPU**





# **Optimizations of the HPCG Benchmark for SX-ACE**

34

32

30

28

26

24

MG Result (Gflop/s)

12

- Data packing for vector-friendly matrix memory allocation of sparse matrices
- Parallelization of 27-point stencil computation by using coloring and hyperplane methods
- Selective reusable-data caching and blocking for effective use of ADB









# HPCG Updates: Evaluation of HPCG Ver3.0 on SX-ACE







## Scalability of the HPCG Benchmark



24th WSSP

Dec. 5-6, 2016



# Performance Evaluation by Using HPGMG

# (High Performance Geometric Multi-Grid)

- HPGMG-FV solves variable-coefficient elliptic problems on isotropic cartesian grids
  - Using the finite volume method (FV) and Full Multigrid (FMG).
- Filling the gap between HPL and HPCG
  - Tracking real application's behavior
    - memory bound, but cache friendly
      - 120 points stencil of Gauss- Seidel Red-Black (GSRB)
    - · MPI, OpenMP, OpenACC implementations are available
      - Enabling fair comparison with GPUs, Accelerators

| Benchmark | Kernel      | Required B/F |  |  |  |
|-----------|-------------|--------------|--|--|--|
| HPL       | DGEMM       | < 0.1        |  |  |  |
| HPGMG     | GSRB        | >1           |  |  |  |
| HPCG      | SpMV, SYMGS | > 4          |  |  |  |





# HPGMG Results (As of Nov. 2016 at SC16)





# Leading Science and Engineering Fields supported by the Supercomputer of Tohoku University

Next-Generation CFD Analysis

Turbine Design



#### Industrial Use









24th WSSP



Antenna Analysis

Perpendicular Magnetic Recording Medium Design



Heat Shock Analysis



**Combustion Flow Simulation** 



19

Nano Material Design



#### Tsunami Inundation Analysis

Earthquake Analysis







Ozone-hole Analysis



Dec. 5-6, 2016





Case I:Trade off between Vector Length and Stride Performance of QSFDM GLOBE codes on SX-ACE

# Interchange loops to reduce the stride length

```
do jz=max(npol01+n2o+1,mz1b),min(idix11-1,mz1e)
   !cdir select(vector)
        do jx=3,nx1-n2o
IV-->
                                                                           Loop length :
629
  !cdir expand=nl
||*->
       do jl=1,nl
        work1=work1+rxx1(jl,jx,jz)
rxx1(jl, x, jz) =
Ш
            ((2.0e0*tau1(jl,jx,jz)-dt)/(2.0e0*tau1(jl,jx,jz)+dt))
&
*rxx1(jl,jx,jz)
      &
&
            -(2.0e0/(2.0e0*tau1(jl,jx,jz)+dt))
                                                                           Stride length :
            *( + coeff1*( dpai1(jl,jx,jz)-2.0e0*damu1(jl,jx,jz))
Π
      &
              + coeff2*(-dpai1(jl,jx,jz)+4.0e0*damu1(jl,jx,jz))
&
                                                                                  5
+ coeff3*( dpai1(jl,jx,jz)-2.0e0*damu1(jl,jx,jz))
      &
              + coeff5*( dpai1(jl,jx,jz)-2.0e0*damu1(jl,jx,jz))
&
              - coeff6*( dpai1(jl,jx,jz)-2.0e0*damu1(jl,jx,jz)) )
Ш
      &
```



#### Case I:Trade off between Vector Length and Stride Performance of QSFDM GLOBE codes on SX-ACE



Even in the same system series, HPC codes should be re-optimized according to the evolutions of system architecture!



high-order accuracy finite difference methods, "Japan-Russia Workshop @ Nagoya, Dec 10, 2015.

24th WSSP



## Case 2: Optimize Decomposition Size:

Increasing Vectorization and Reducing Communications

#### changing the size of y, z

| x y z |     |     |      |      | Communication Elements |         |           | Communication |         | GFLOPS/proc |      |              |
|-------|-----|-----|------|------|------------------------|---------|-----------|---------------|---------|-------------|------|--------------|
|       | У   | z   | npex | npey | npez                   | x*y     | y*z       | z*x           | Total   | [sec]       | [%]  | (効率:%)       |
| 600   | 25  | 400 | 2    | 8    | 8                      | 15,000  | 10,000 24 | 240.000       | 265 000 | 5.54        | 25.8 | 24.01(37.5%) |
| 600   | 25  | 400 | 2    | 4    | 16                     |         |           | 240,000       | 265,000 | 5.42        | 25.3 | 24.09(37.6%) |
| 600   | 50  | 200 | 2    | 8    | 8                      | 30,000  | 10,000    | 120,000       | 160,000 | 4.93        | 23.6 | 24.86(38.8%) |
| 600   | 50  | 200 | 2    | 4    | 16                     |         |           |               |         | 4.88        | 23.5 | 24.89(38.9%) |
| 600   | 100 | 100 | 2    | 8    | 8                      | 60,000  | 10,000    | 60,000        | 130,000 | 4.05        | 19.2 | 24.71(38.6%) |
| 600   | 100 | 100 | 2    | 4    | 16                     |         |           |               |         | 4.14        | 19.7 | 24.78(38.7%) |
| 600   | 200 | 50  | 2    | 8    | 8                      | 120,000 | 10,000    | 20.000        | 160.000 | 5.12        | 23.7 | 23.95(37.4%) |
| 600   | 200 | 50  | 2    | 4    | 16                     |         |           | 30,000        | 160,000 | 4.95        | 23.0 | 24.14(37.7%) |
| 600   | 400 | 25  | 2    | 8    | 8                      | 240,000 | 10,000    | 15,000        | 265,000 | 7.26        | 30.4 | 21.46(33.5%) |
| 600   | 400 | 25  | 2    | 4    | 16                     |         |           |               |         | 7.17        | 30.0 | 21.61(33.8%) |

#### Keeping y = z

#### changing the size of x



#### larger x is better

, R. Egawa, Y. Isobe, and I. Miyoshi, "Performance Evaluation of MHD ation Code on SX-ACE and FX100", poster in HPDC2016, Kyoto, 2016 Dec. 5-6. 2016

24





# **Emergency Computing on SX-ACE for Disaster Analysis and Mitigation of Tsunami**





## Design and Development of A Real-Time Tsunami Inundation Forecasting System



#### Fault estimation based on GPS data

.....

< 8 min

#### GPS-Observation Simulation on SX-ACE



10-m mesh models of coastal cities

#### Information Delivery



Just-In-Time access of Visualized information by local governments

< 4 min

.....

< 8 min

......

24th WSSP

< 20 min



## System Extension for Coverage of the Entire Japan and Improvement of Dependency



東北大学

27



24th

#### What Happened on Nov. 22 Magnitude-7.4 quake likely an aftershock from five years ago **Events in Simulation Timeline Events in Actual Timeline** ······ 5:59:47 Earthquake 6:00:34 Estimation Started 6:00:06 1<sup>ST</sup> EEW(M6.0) (Fault Estimation) 10minutes 6:00:34 8<sup>th</sup> EEW(M7.1) 6:07:26 Job submission 6:01:14 11<sup>th</sup> EEW(M7.3) (Simulation at Tohoku) 10'00 6:02:23 Tsunami Information 6:09:52 Simulations Completed Tsunami Warning(Fukushima) 6:07:27 Job submission (Simulation at Osaka) 6:10:00 Simulations Completed 20'00 A tsunami rushes up the 7:31 Visualization Started Sunaoshikawa river in Tagajo, Miyagi Prefecture, early on Nov. 22. 6:29 1<sup>st</sup> Tsunami Arrival(Iwaki Onahama) Visualization was performed manually, 30'00 because Sendai and vicinity areas are not automatic visualization areas in the 7:09 Tsunami Arrival(Sendai Port) current implementation 7:26 Tsunami Warning (Miyagi) Information delivery 8:03 Max Water Level(1.4m) After the visulization was completed, the at Sendai Port inundatation infromation available on the (arbitrary) 12:50 Alert Withdrawal Web is sent to Cabinet Office, Government of Japan and Higashi-Matsushima-Shi and Ishinomaki-Shi.



#### What Happened on Nov. 22



Dec. 5-6, 2016



# **Future Vector Systems R&D\***

\*This work is partially conducted with NEC, but the contents do not reflect any future products of NEC



## Timeline of the Cyberscience Center HPC System Development and R&D For the Future





#### Future Plan of HPCI Deployment in Japan



32

![](_page_32_Picture_1.jpeg)

# But the Road to Exascale is Not So Easy...

- End of Moore's Law???
  - ✓ Cost-reduction is no longer available!?
  - Post-K delayed max 2 years due to semiconductor/device technology problems.
- Seeking flop/s-oriented, accelerator-based exotic architectures?
  - with heterogeneity in computing and memory models, in particular, large-gap between local and remote, and between yers in the doop memorphierarchy
    - design insivelynsidered, but excessiveo-design.ys eand tholications much ecial
      - ency KNL in PI

 J.3
 J.1
 ciency of Sunw?
 1
 ihulight and 1.5% of
 Jac
 Jac
 BPOSt-K Coll
 Image: SPOSt-K Coll
 Image:

- Still suffer from high operational cost mainly due to electricity expense, and who pay?
  - ✓ The operation cost for 40MW of Post-K is affordable!?

![](_page_32_Figure_13.jpeg)

![](_page_33_Picture_1.jpeg)

# Scaling may be End, but Silicon is not End! And Use it Smart!

- ✓ We are facing the end of Moore's low due to the physical limitations, and the transistor cost is hard to reduce, however
- Silicon is still fundamental constructing material for computing platforms such as plastic, steel and concrete for automobiles, buildings and home appliances.

 $\star$  So, we have to become much more smart for design of Future HEC systems.

Exploit sleeping flop/s efficiently by redesign/reinvent of memory subsystems to protect HEC systems from "Brain Infarction"

![](_page_33_Picture_7.jpeg)

Use precious silicon budget (+ advanced device technologies) to effectively design mechanisms that can supply data to computing units smoothly.

From Brute Force to Smart Force!

New Moore's law would be

Productivity doubles every two year?!

![](_page_34_Picture_1.jpeg)

#### Not Peak Performance, Turn Memory-BW into Sustained Performance!

![](_page_34_Figure_3.jpeg)

![](_page_35_Picture_1.jpeg)

## Tohoku Univ-NEC Joint Research Division of HPC Technologies and Applications

#### ★ Founded in June, 2014, 4-year period

#### 

- R&D on HPC technologies to exploit high-sustained performance of science and engineering applications on current HPC Systems and to realize Future HPC Systems targeting at 2020.
- Evaluation and Improvement of the current HPC environments through migration of SX-9 applications to SX-ACE
- Detailed Evaluation and Analysis of Modern HPC Systems, not only Vector Systems but also Scalar-Parallel and Accelerator-Based Systems
- ✓ Feasibility study of a future highly balanced HPC system for high sustained performance of practical applications in the post-peta scale era

#### **★** Faculty Members

- Hiroaki Kobayashi, Professor and division director
- Hiroyuki Takizawa, Associate Professor
- Ryusuke Egawa, Associate Professor
- Akihiko Musa (NEC), Visiting Professor
- Mitsuo Yokokawa (Kobe Univ), Visiting Professor
- Shintaro Momose (NEC), Visiting Associate Professor
- Masayuki Sato, Assistant Professor

![](_page_35_Figure_17.jpeg)

✓ In collaboration with visiting researchers from NEC and the technical staff of Cyberscience Center

![](_page_36_Picture_1.jpeg)

## Summary

- ★ SX-ACE shows high sustained performance compared with SX-9 and other modern HEC systems
  - ✓ achieved the same single core performance in practical applications even with 60% of peak performance of SX-9
  - ✓ Nol. computing-efficiency and power-efficiency in the HPCG Benchmark ranking
  - $\checkmark$  Pave the way to a new social infrastructure for homeland safety in Japan
- Well balanced HEC systems regarding memory performance is the key to success for realizing high productivity in science and engineering simulations
  - $\checkmark$  Think different with Smart Force from Brute Force in HPC design
  - ✓ Quality, not Quantity for productive HPC!
  - Demands for Supercomputers for the rest of us, especially for 2020 and beyond!