1
Introduction Current Heterogeneous Systems Clusters with accelerators Accelerators are host-centric No integrated network interconnect Static assignment (1 CPU : N accelerators) PCIe can become a bottleneck Explicit programming required Research Objective Divide Architecture in Two Parts Cluster based on multi-core-chips - Executes scalar code Booster based on many-core technology - Runs highly scalable code - EXTOLL: Switchless direct 3D torus [1] => Components-off-the-shelf philosophy Goals of the Architecture Accelerator directly connected to network Static and dynamic workload assignments Scale the number of accelerators and host CPUs independently in an N to M ratio Acc Acc Acc Acc Acc Interconnection Network CN CN CN ... ... Acc Acc Acc Acc Acc Acc CN CN CN CN CN CN Interconnection Network ... ... (b) 64 Threads, 32 Threads/MIC. Results II – Application-level Evaluation MPI version of the LAMMPS application Equal Thread-to-MIC distribution Communication time between MICs can be improved by up to 32% (a) 32 Threads, 16 Threads/MIC. BN 0 Accelerator Booster Network ... ... CN i Accelerator 0 Accelerator 1 Accelerator N ... System View User View NIC BN N Accelerator NIC Cluster Network CN 1 CN M BI 0 BI 1 BI K ... Network-Attached Accelerators Communication Model Simplified, transparent user view Direct accelerator-to-accelerator communication Dynamic workload distribution Accelerators accessible from any node without host interaction Prototype Implementation High-density booster node card (BNC) with two Intel Xeon Phis (61 cores) & NICs Accelerator Access Distributed shared memory to communicate MMIO range can be anywhere in the network Loads and stores are encapsulated into network transactions Software Design Transparent to implementation No accelerator driver changes Responsible for device & MSI configuration and IRQ handling Results I – MPI Performance 0 200 400 600 800 1000 1200 1400 1600 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes) mic0-mic1 Booster mic0-mic1 OFED/SCIF mic0-mic1 TCP/SCIF mic0-remote mic0 EXTOLL 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 8KB 32KB 128KB 512KB 2MB Latency (usec) Message Size (Bytes) mic0-mic1 Booster mic0-mic1 OFED/SCIF mic0-mic1 TCP/SCIF mic0-remote mic0 EXTOLL 0 5 10 15 20 25 30 0B 2B 8B 32B 128B 512B 2KB Latency (usec) Message Size (Bytes) mic0-mic1 Booster mic0-mic1 OFED/SCIF mic0-remote mic0 EXTOLL Network-Attached Accelerators: Host-independent Accelerators for Future HPC Systems The research leading to these results has been conducted in the frame of the DEEP (Dynamically Exascale Entry Platform) project, which has received funding from the European Union's Seventh Framework Programme for research, technological development, and demonstration under grant agreement no 287530. Sarah Neuwirth, Dirk Frey, and Ulrich Brüning Institute of Computer Engineering (ZITI) – University of Heidelberg, Germany; {sarah.neuwirth,dirk.frey,ulrich.bruening}@ziti.uni-heidelberg.de BNC with Backplane Proto-BI References [1] H. Froening, M. Nuessle, C. Leber, and U. Bruening, On Achieving High Message Rates. In CCGrid ‘13 (pp. 498-505). [2] S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, Scalable Commu- nication Architecture for Network-Attached Accelerators. In HPCA ‘15 (pp. 627-638). Fig. 1: Heterogeneous system. Fig. 2: Booster-based architecture. Fig. 5: Comparison of the software stacks. Fig. 4: Memory mapping between BNs and BI. Fig. 3: Communication model and 3D torus topology [2]. Fig. 6: Prototype system. Fig. 8: Internode MIC-to-MIC Bandwidth and bi-bandwidth. (a) Bandwidth. (b) Bidirectional bandwidth. Fig. 7: Latency MIC-to-MIC internode half round-trip. (a) Small messages. (b) Large messages. Fig. 9: LAMMPS performance using a bead-spring polymer, Lennard-Jones, and copper metallic solid benchmark. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bi-Bandwidth (MB/sec) Message Size (Bytes) mic0-mic1 Booster mic0-mic1 OFED/SCIF mic0-mic1 TCP/SCIF mic0-remote mic0 EXTOLL More information available at www.deep-project.eu Conclusion Novel communication architecture Scales the number of accelerators and CPUs independently Host-independent direct accelerator-to-acce- lerator communication with very low latency High-density implementation with promising MPI performance User Level Tools Accelerator Driver Kernel PCI API Hardware User Level Tools Accelerator Driver Hardware Virtual PCI layer Loadable Modules Conventional System Booster Interface Node Built-in Modules EXTOLL Driver BN 1 BN 0 BI EXTOLL MIC 0 ... SMFU MIC N ... ... 00000000h 00000000h 00000000h FFFFFFFFh address x FFFFFFFFh FFFFFFFFh .. . .. . ... MIC 1 Accelerator MMIO MIC 1 Accelerator MMIO MIC 0 address x ... ... ... ...

Network-Attached Accelerators: Host-independent ...sc15.supercomputing.org/.../poster_files/post341s2-file2.pdfFig. 4: Memory mapping between BNs and BI. Fig. 3: Communication model

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Network-Attached Accelerators: Host-independent ...sc15.supercomputing.org/.../poster_files/post341s2-file2.pdfFig. 4: Memory mapping between BNs and BI. Fig. 3: Communication model

Introduction Current Heterogeneous Systems • Clusters with accelerators • Accelerators are host-centric • No integrated network interconnect • Static assignment (1 CPU : N accelerators) • PCIe can become a bottleneck • Explicit programming required

Research Objective Divide Architecture in Two Parts • Cluster based on multi-core-chips

- Executes scalar code • Booster based on many-core technology

- Runs highly scalable code - EXTOLL: Switchless direct 3D torus [1]

=> Components-off-the-shelf philosophy

Goals of the Architecture • Accelerator directly connected to network • Static and dynamic workload assignments • Scale the number of accelerators and host

CPUs independently in an N to M ratio

Acc

Acc

Acc

Acc

Acc

Interconnection Network

CN

CN

CN

...

...

Acc

Acc

Acc

Acc

Acc

Acc

CN

CN

CN

CN

CN

CN

Interconnection Network... ...

(b) 64 Threads, 32 Threads/MIC.

Results II – Application-level Evaluation

• MPI version of the LAMMPS application • Equal Thread-to-MIC distribution • Communication time between MICs can be

improved by up to 32%

(a) 32 Threads, 16 Threads/MIC.

BN0 Accelerator

Boos

ter N

etw

ork

...

...

CNi

Accelerator 0

Accelerator 1

Accelerator N

...

System ViewUser View

NIC

BNN Accelerator

NIC

Clu

ster

Net

wor

k

CN1

CNM

BI0

BI1

BIK

...

Network-Attached Accelerators Communication Model • Simplified, transparent user view • Direct accelerator-to-accelerator communication • Dynamic workload distribution • Accelerators accessible from any node without

host interaction

Prototype Implementation • High-density booster node card (BNC) with two

Intel Xeon Phis (61 cores) & NICs

Accelerator Access • Distributed shared memory to communicate • MMIO range can be anywhere in the network • Loads and stores are encapsulated into network

transactions

Software Design • Transparent to implementation • No accelerator driver changes • Responsible for device & MSI

configuration and IRQ handling

Results I – MPI Performance

0

200

400

600

800

1000

1200

1400

1600

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Band

wid

th (M

B/se

c)

Message Size (Bytes)

mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

8KB 32KB 128KB 512KB 2MB

Late

ncy

(use

c)

Message Size (Bytes)

mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL

0

5

10

15

20

25

30

0B 2B 8B 32B 128B 512B 2KB

Late

ncy

(use

c)

Message Size (Bytes)

mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-remote mic0 EXTOLL

Network-Attached Accelerators: Host-independent Accelerators for Future HPC Systems

The research leading to these results has been conducted in the frame of the DEEP (Dynamically Exascale Entry Platform) project, which has received funding from the European Union's Seventh Framework Programme for research, technological development, and demonstration under grant agreement no 287530.

Sarah Neuwirth, Dirk Frey, and Ulrich Brüning Institute of Computer Engineering (ZITI) – University of Heidelberg, Germany; {sarah.neuwirth,dirk.frey,ulrich.bruening}@ziti.uni-heidelberg.de

BNC with Backplane

Proto-BI

References [1] H. Froening, M. Nuessle, C. Leber, and U. Bruening, On Achieving High Message Rates. In CCGrid ‘13 (pp. 498-505). [2] S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, Scalable Commu- nication Architecture for Network-Attached Accelerators. In HPCA ‘15 (pp. 627-638).

Fig. 1: Heterogeneous system.

Fig. 2: Booster-based architecture.

Fig. 5: Comparison of the software stacks.

Fig. 4: Memory mapping between BNs and BI.

Fig. 3: Communication model and 3D torus topology [2].

Fig. 6: Prototype system.

Fig. 8: Internode MIC-to-MIC Bandwidth and bi-bandwidth. (a) Bandwidth. (b) Bidirectional bandwidth.

Fig. 7: Latency MIC-to-MIC internode half round-trip. (a) Small messages. (b) Large messages.

Fig. 9: LAMMPS performance using a bead-spring polymer, Lennard-Jones, and copper metallic solid benchmark.

0200400600800

100012001400160018002000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MBi

-Ban

dwid

th (M

B/se

c)Message Size (Bytes)

mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL

More information available at www.deep-project.eu

Conclusion • Novel communication architecture • Scales the number of accelerators and CPUs

independently • Host-independent direct accelerator-to-acce-

lerator communication with very low latency • High-density implementation with promising

MPI performance

User Level Tools

Accelerator Driver

Kernel PCI API

Hardware

User Level ToolsAccelerator Driver

Hardware

Virtual PCI layer

Loadable Modules

Conventional System Booster Interface Node

Built-inModules

EXTOLL Driver

BN1

BN0 BI

EX

TOLL

MIC0

...

SM

FU

MICN

......

00000000h00000000h

00000000h

FFFFFFFFh

addressx

FFFFFFFFh

FFFFFFFFh

.. ... .

...

MIC1

Accelerator MMIO MIC1

Accelerator MMIO MIC0

addressx

...

...

......