Network-Attached Accelerators: Host-independent ...sc15.supercomputing.org/.../poster_files/post341s2-file2.pdfFig. 4: Memory mapping between BNs and BI. Fig. 3: Communication model

Introduction Current Heterogeneous Systems • Clusters with accelerators • Accelerators are host-centric • No integrated network interconnect • Static assignment (1 CPU : N accelerators) • PCIe can become a bottleneck • Explicit programming required

Research Objective Divide Architecture in Two Parts • Cluster based on multi-core-chips

- Executes scalar code • Booster based on many-core technology

- Runs highly scalable code - EXTOLL: Switchless direct 3D torus [1]

=> Components-off-the-shelf philosophy

Goals of the Architecture • Accelerator directly connected to network • Static and dynamic workload assignments • Scale the number of accelerators and host

CPUs independently in an N to M ratio

Acc

Acc

Acc

Acc

Acc

Interconnection Network

CN

CN

CN

...

...

Acc

Acc

Acc

Acc

Acc

Acc

CN

CN

CN

CN

CN

CN

Interconnection Network... ...

(b) 64 Threads, 32 Threads/MIC.

Results II – Application-level Evaluation

• MPI version of the LAMMPS application • Equal Thread-to-MIC distribution • Communication time between MICs can be

improved by up to 32%

(a) 32 Threads, 16 Threads/MIC.

BN0 Accelerator

Boos

ter N

etw

ork

...

...

CNi

Accelerator 0

Accelerator 1

Accelerator N

...

System ViewUser View

NIC

BNN Accelerator

NIC

Clu

ster

Net

wor

k

CN1

CNM

BI0

BI1

BIK

...

Network-Attached Accelerators Communication Model • Simplified, transparent user view • Direct accelerator-to-accelerator communication • Dynamic workload distribution • Accelerators accessible from any node without

host interaction

Prototype Implementation • High-density booster node card (BNC) with two

Intel Xeon Phis (61 cores) & NICs

Accelerator Access • Distributed shared memory to communicate • MMIO range can be anywhere in the network • Loads and stores are encapsulated into network

transactions

Software Design • Transparent to implementation • No accelerator driver changes • Responsible for device & MSI

configuration and IRQ handling

Results I – MPI Performance

0

200

400

600

800

1000

1200

1400

1600

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Band

wid

th (M

B/se

c)

Message Size (Bytes)

mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

8KB 32KB 128KB 512KB 2MB

Late

ncy

(use

c)



0

5

10

15

20

25

30

0B 2B 8B 32B 128B 512B 2KB

Late

ncy

(use

c)


mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-remote mic0 EXTOLL

Network-Attached Accelerators: Host-independent Accelerators for Future HPC Systems

The research leading to these results has been conducted in the frame of the DEEP (Dynamically Exascale Entry Platform) project, which has received funding from the European Union's Seventh Framework Programme for research, technological development, and demonstration under grant agreement no 287530.

Sarah Neuwirth, Dirk Frey, and Ulrich Brüning Institute of Computer Engineering (ZITI) – University of Heidelberg, Germany; {sarah.neuwirth,dirk.frey,ulrich.bruening}@ziti.uni-heidelberg.de

BNC with Backplane

Proto-BI

References [1] H. Froening, M. Nuessle, C. Leber, and U. Bruening, On Achieving High Message Rates. In CCGrid ‘13 (pp. 498-505). [2] S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, Scalable Commu- nication Architecture for Network-Attached Accelerators. In HPCA ‘15 (pp. 627-638).

Fig. 1: Heterogeneous system.

Fig. 2: Booster-based architecture.

Fig. 5: Comparison of the software stacks.

Fig. 4: Memory mapping between BNs and BI.

Fig. 3: Communication model and 3D torus topology [2].

Fig. 6: Prototype system.

Fig. 8: Internode MIC-to-MIC Bandwidth and bi-bandwidth. (a) Bandwidth. (b) Bidirectional bandwidth.

Fig. 7: Latency MIC-to-MIC internode half round-trip. (a) Small messages. (b) Large messages.

Fig. 9: LAMMPS performance using a bead-spring polymer, Lennard-Jones, and copper metallic solid benchmark.

0200400600800

100012001400160018002000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MBi

-Ban

dwid

th (M

B/se

c)Message Size (Bytes)


More information available at www.deep-project.eu

Conclusion • Novel communication architecture • Scales the number of accelerators and CPUs

independently • Host-independent direct accelerator-to-acce-

lerator communication with very low latency • High-density implementation with promising

MPI performance

User Level Tools

Accelerator Driver

Kernel PCI API

Hardware

User Level ToolsAccelerator Driver

Hardware

Virtual PCI layer

Loadable Modules

Conventional System Booster Interface Node

Built-inModules

EXTOLL Driver

BN1

BN0 BI

EX

TOLL

MIC0

...

SM

FU

MICN

......

00000000h00000000h

00000000h

FFFFFFFFh

addressx

FFFFFFFFh

FFFFFFFFh

.. ... .

...

MIC1

Accelerator MMIO MIC1

Accelerator MMIO MIC0

addressx

...

...

......

Documents

Network-Attached Accelerators: Host-independent ...sc15.supercomputing.org/.../poster_files/post341s2-file2.pdfFig. 4: Memory mapping between BNs and BI. Fig. 3: Communication model