System-Level Heterogeneity with Intel Xeon … Heterogeneity with Intel® Xeon PhiTM Processors...

System-Level Heterogeneity

with Intel® Xeon PhiTM Processors

Estela Suarez

Jülich Supercomputing Centre

This project has received funding from the European Union's Seventh Framework Programme for research, technological

development and demonstration under grant agreements 287530 (DEEP) and 610476 (DEEP-ER).

Collaborative R&D in DEEP & DEEP-ER

European Union Exascale projects 20 partners Total budget: 28.3 M€ EU-funding: 14.5 M€ Combined term: 5 years

2 www.deep-project.eu www.deep-er.eu

Visit us @ ISC 2016, Frankfurt (Germany)

June 19 – 23, 2016 Booth #1340

DEEP-ER – ISC 2016, Intel Booth – 21.06.2016

DEEP/-ER Approach

Co-design between applications, system SW and HW

– Application and operational requirements do shape system architecture

– HW provides performance, scalability and energy efficiency

– System SW enables applications to leverage HW potential

Objectives

– Deliver highest scalability and workload performance

– Provide leading energy efficiency (energy per result)

– Offer familiar, easy to use programming environment and APIs based on standards

– Ensure sustainability of system architecture and SW

Design elements

– Exploit benefits of processor heterogeneity

– Leverage technology advances in storage-class memory and interconnects

DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 3

Co-Design Applications

DEEP + DEEP-ER applications

• Brain simulation (EPFL)

• Space weather simulation (KULeuven)

• Climate simulation (CYI)

• Computational fluid engineering (CERFACS)

• High temperature superconductivity (CINECA)

• Seismic imaging (CGG)

• Human exposure to electromagnetic fields (INRIA)

• Geoscience (BADW-LRZ)

• Radio astronomy (Astron)

• Oil exploration (BSC)

• Lattice QCD (UREG)

• Drive co-design cycle

• Evaluate and validate system prototypes

System-Level Heterogeneity

Accelerated Cluster Cluster-Booster Architecture

– Fixed, static ratio and assignment of accelerators to CPUs

– Static management of resources

– Accelerators do not act autonomously

– General-purpose Cluster interconnect

– Programming via local offload interfaces (OpenCL, CUDA, CELO, OpenACC, …)

– No fixed ratio or assignment between

resources (Multicore & Manycore nodes)

– Dynamic management and association

of resources

– High-throughput network in the Booster

– Programming via MPI and “global”

tasking interfaces

DEEP System Architecture

Cluster part – High single thread & complex code

performance Intel® Xeon® processors

– High any-to-any connection performance & infrastructure integration standard switched HPC fabric (InfiniBandTM)

Booster part – High throughput, autonomous

operation Intel Xeon Phi coprocessor (codenamed “Knights Corner”)

– Need for low latency, spatial application structures 3D Torus direct-connected network (EXTOLL)

– Network bridging and KNC control Booster Interface layer

Both parts – Efficiency needs use of liquid cooling

& dense packaging

DEEP Prototype Systems

Eurotech Aurora Prototype

Cluster part 128 dual-socket

Intel Xeon E5-2403

QDR InfiniBandTM

Eurotech Aurora

liquid cooling &

packaging

Booster part 384 Intel Xeon Phi

7120X nodes

implementation of

EXTOLL

interconnect

24 Booster interface

nodes with Intel

Xeon processor

Eurotech Aurora

liquid cooling &

packaging Cluster

Booster

DEEP/DEEP-ER Programming Model

ParaStation Global MPI layer

Expert-level programming

Efficient communication across the whole system

Dynamic process spawning an control in both

directions

Task-based OmpSs programming model

Pragma based, emphasizes ease of use

Efficient communication across the whole system

Dynamic spawning of massively parallel tasks in

both parts

Massively Parallel Tasks in OmpSs

Published in: Sainz, F, Bellón, J, Beltran, V, Labarta, J, “Collective Offload for Heterogeneous Clusters”, IEEE 22nd International Conference on High Performance

Computing (HiPC), 2015

Rank 0

master

Rank 0-15

slave0

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

Rank 16-31

slave1

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

Rank 240-255

slave15

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

wk1023

Figure 7: FWI hierarchical MPI architecture

64 128 256 512 1024

# nodes (16 cores)

IdealOmpSs Offload

OmpSs Offload (no I/O)

Figure 8: Scalability of FWI application on up to 1024 nodes

VI. Concl usions and fut ur e wor k

This paper presents the OmpSs O✏oad model that was

originally developed to ease the porting of complex ap-

plications to the highly heterogeneous cluster architecture

proposed on the DEEP Exascale project. The OmpSs O✏oad

model has completely fulfilled its design goals, combining

the ease of use of Intel O✏oad with the flexibility, per-

formance and scalability of the native MPI Comm spawn

API. Moreover, our approach is fully integrated with the

rest of features provided by OmpSs, such as support for

OpenMP codes and CUDA or OpenCL kernels. Although

it was originally conceived for heterogeneous clusters we

have also successfully used it to develop hierarchical MPI

applications such as FWI. We think that these hierarchical

MPI architectures will play an important role in exploiting

future Exascale systems. Hence, tools such as OmpSs Of-

fload will be essential for designing such architectures and

helping with their implementation for complex and large

applications.

As future work, we plan to integrate our allocation API

with a resource manager/job scheduler to avoid the need

to reserve all the resources that will be required before

the program is launched. We also plan to investigate the

potential of OmpSs O✏oad to improve the malleability of

existing MPI applications, as well as the implications of

using this o✏oad model from the resilience point of view.

Refer ences

[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-pert, and E. Suarez, “On the scalability of the clusters-boosterconcept: a critical assessment of the DEEP architecture,” inProceedings of the Future HPC Systems: the Challenges ofPower-Constrained Performance. ACM, 2012, p. 3.

[2] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell,X. Martorell, and J. Planas, “OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures,” ParallelProcessing Letters, vol. 21, no. 02, pp. 173–193, 2011.

[3] K. O. W. Group et al., “The OpenCL specification,” A.Munshi, Ed, 2008.

[4] C. Nvidia, “Compute Unified Device Architecture program-ming guide,” 2007.

[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty,R. Narayanaswamy, J. Wiegert, F. Chinchilla, andR. McGuire, “O✏oad compiler runtime for the Intel R

Xeon Phi R coprocessor,” in Supercomputing. Springer,2013, pp. 239–254.

[6] “OpenMP 4.0 specification,” http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].

[7] O. W. Group et al., “TheOpenACC application programminginterface,” 2011.

[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell,and E. Ayguade, “Leveraging OmpSs to exploit hardwareaccelerators,” in 26th IEEE International Symposium onComputer Architecture and High Performance Computing,SBAC-PAD 2014, Paris, France, October 22-24, 2014.IEEE, 2014, pp. 112–119. [Online]. Available: http://dx.doi.org/10.1109/SBAC-PAD.2014.26

[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “ rCUDA: Reducing the number of GPU-based accel-erators in high performance clusters,” in High PerformanceComputing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.

[10] A. Barak and A. Shiloh, “Themosix Virtual OpenCLl (VCL)cluster platform,” in Proc. Intel European Research andInnovation Conference, 2011.

[11] F. Sainz and V. Beltran. (2015) OmpSs CollectiveO✏oad. User Manual. [Online]. Available: http://pm.bsc.es/ompss-docs/user-guide/run-programs-archs-o✏oad.html

Measurements for BSC FWI (full waveform inversion) code

From DEEP to DEEP-ER

Simplified Interconnect

On-Node NVM

Self-Booting Nodes

Network attached memory

DEEP-ER Scalable I/O

Leverage presence of fast local

NVM storage

– Scalable caching of read/write data

close to requesting node

– Prefetching stages read data into

caches

– Write-back scheme saves data to

permanent storage

– Synchronous (done) and

asynchronous (WIP) versions/APIs

DEEP-ER Resiliency Scheme

BSC Full Waveform Inversion

Results

Using 60 cores per Xeon Phi coprocessor node with 180 threads

020406080

100120140160

Impact of different optimizations of wave propagator on Xeon Phi

Gflops/s

speedup

INRIA MAXW-DGTD Results

16 64 256 1024Spe

# of cores

Before

16 64 256 1024

# of cores

Before

Improvements applied below:

• Non-blocking communication

• Renumbering scheme

• Vectorisation and locality

Performance improvement up to 3.3x Almost perfect parallel efficiency now

Setup: - Human head

- DEEP Cluster

- Mesh: 1.8 million cells

- 16 processes per node

- Pure MPI.

- P1 approximation.

15 DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 *Leger R., Alvarez Mallon D., Duran A., Lanteri S., “Assessing the DEEP-ER Cluster/Booster Architecture with a fine-element

type solver for bioelectromagnetics”, Submitted to PARCO2015, Contribution ID: 25. www.parco2015.org

INRIA MAXW-DGTD I/O Results

Inria: Assessment of Human exposure to EM fields

24 MPI processes, 1 thread per process

P1 P2 P3 P4

I/O performance of MAXW-DGTD

sdv-work NVMe

Increasing

model precision

P1<P2<P3<P4

Performance gain by using Intel DC P 3700

Partners

System-Level Heterogeneity with Intel Xeon … Heterogeneity with Intel® Xeon PhiTM Processors...

Documents

2. HETEROGENEITY, PROBABILITY, AND RANDOM FIELDS 2.1 ...groundwater.ucdavis.edu/files/158699.pdf2.1 Introduction: Heterogeneity and Stochastic Analysis Spatial heterogeneity refers

Intel Xeon Hyperthreading

Xeon Processor - Fujitsu

Continuous heterogeneity

procesadores Intel Xeon

Information Heterogeneity and Intended College Enrollment · PDF fileInformation Heterogeneity and Intended College ... Information Heterogeneity and Intended College Enrollment

Honing and prooﬁng Astrophysical codes on the road to ... · of the available hardware. The Intel R Xeon PhiTM of second generation (code-named Knights Landing, henceforth KNL)

Heterogeneity II

TE Heterogeneity

Intel® Xeon® D-2100 Processor Product Brief · Intel® Xeon® D-2100 Processor Intel® Xeon® D-2100 Processor Intel® Xeon® Intel® Xeon® 1.375 MB L2 Intel® Ethernet 2x10G Intel®

SX-Aurora TSUBASA · Aurora shows over 50x performance compared to Xeon 0.0 10.0 20.0 30.0 40.0 50.0 60.0 Aurora P100 Xeon 6130 Aurora P100 Xeon 6130 Aurora P100 Xeon 6130 int32 float

Heterogeneity of neuroanatomical patterns in prodromal ...adni.loni.usc.edu/adni-publications/Heterogeneity... · Heterogeneity of neuroanatomical patterns in prodromal Alzheimer’s

Intel Xeon Processor 5500 Seriesdownload.intel.com/pressroom/kits/xeon/5500series/...The Intel Xeon processor 5500 series, with Intel Microarchitecture Nehalem, brings intelligent

11 Intel ® Xeon ® Intel ® Xeon ® Servers For Small Business

XEON - Club Prescrire

Workstation Dell Xeon

Yamaha Xeon Rem Blkang

Heterogeneity acetylcholine

Процессор Intel Xeon

Overview of the Intel Xeon and Xeon Phi tecnologies...Overview of the Intel Xeon and Xeon Phi tecnologies V. Ruggiero (v.ruggiero@cineca.it) Roma, 19 July 2017 SuperComputing Applications