25
1 © NEC Corporation 2018 Vector Engine Processor of NEC’s Brand-New Supercomputer SX-Aurora TSUBASA Yohei Yamada, NEC Corporation Shintaro Momose, Ph.D., NEC Deutschland GmbH 21 August 2018

Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Embed Size (px)

Citation preview

Page 1: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

1 © NEC Corporation 2018

Vector Engine Processor ofNEC’s Brand-New Supercomputer SX-Aurora TSUBASA

Yohei Yamada, NEC Corporation

Shintaro Momose, Ph.D., NEC Deutschland GmbH

21 August 2018

Page 2: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Agenda

• Introduction

• SX-Aurora TSUBASA

• Vector Engine

• Benchmarks

• Conclusion

Page 3: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Introduction

Page 4: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

4 © NEC Corporation 2018

History of NEC’s Vector Supercomputer

35 yearsExperienceForHigh SustainedPerformance

Page 5: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

SX-Aurora TSUBASA

Page 6: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

6 © NEC Corporation 2018

SW Environment

x86 / Linux OS Fortran/C/C++ standard programming Automatic vectorization and

parallelization by proven vector compiler

Hardware

Vector Engine (VE) + x86 node High memory bandwidth Flexible configuration

SX-Aurora TSUBASA

VectorEngine

x86node

Linux

Application

Aurora architecture

VEOS

Systemcalls

X86 server

Applications are entirely executed on VE side

PCIe

The new accelerator system Aurora = VH + VEVector computing in a standard environment High sustained performance vector processing Vector capability is transparently provided on x86/Linux

Page 7: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

7 © NEC Corporation 2018

Scalable Vector Supercomputer

A500 series

A300 series

A100 series

Supercomputer Model For large scale configurations DLC with 40°C/104°F water

Rack Mount Model Flexible configuration Air Cooled

Tower Model For developer/programmer Personal supercomputer 1VE

2VE 4VE 8VE

64VE-

Page 8: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

8 © NEC Corporation 2018

Vector Engine Card

Passive Cooling TypeFor Server

Active Cooling TypeFor Tower/Workstation

Direct Liquid Cooling Type

For Supercomputer

40°C/104°F water

Air Cooled Card Water Cooled Card

Two types of packages Direct liquid cooling Hot water cooling available

Page 9: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Vector Engine

Page 10: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

10 © NEC Corporation 2018

Vector Engine Card Implementation

Vector engine processor module

Voltage regulator

AUX power inlet

PCIe Gen3 x16 interface

▌Standard PCIe card

PCIe Gen3 x16 interface

Full-length full-height card

Dual slot

<300W power

0

50

100

150

200

DGEMM STREAM HPCG

PO

WER [

WATT]

Power consumptionunder benchmark workloads

Page 11: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

11 © NEC Corporation 2018

Vector Engine Processor Module

VE processorHBM2 Silicon interposer

Organicsubstrate

Stiffener

HBM2Silicon interposer

Organic substrate

▌2.5D implementation

A VE processor and six 8Hi or 4Hi HBM2 modules on a silicon interposer

Lidless package to minimize thermal resistance

Package size: 60mm x 60mm

Interposer size: 32.5mm x 38mm

VE processor size: 15mm x 33mm

VE processor

Stiffener

Page 12: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

12 © NEC Corporation 2018

Vector Engine Processor Overview

▌ Components

8 vector cores

16MB LLC

2D mesh network on chip

DMA engine

6 HBM2 controllers and interfaces

PCI Express Gen3 x16 interface

▌ Specs

▌ Technology

16nm FinFET process

HBM2

HBM2

HBM2

HBM2

HBM2

HBM2

core

core

core

core

core

core

core

core

LLC

8M

B

HB

M2

I/F

HB

M2

I/F

HB

M2

I/F

HB

M2

I/F

HB

M2

I/F

HB

M2

I/F

LLC

8M

B

DMAEngine

PCIe I/F

Core frequency 1.6GHz

Core performance307GF(DP)614GF(SP)

CPU performance2.45TF(DP)4.91TF(SP)

Memory bandwidth 1.2TB/s

Memory capacity 24/48GB

Page 13: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

13 © NEC Corporation 2018

Vector Core

▌Vector Processing Unit (VPU)

Powerful computing capability

• 307.2GFLOPS DP / 614.4GFLOPS SP performance

High bandwidth memory access

• 409.6GB/sec Load and Store

▌Scalar Processing Unit (SPU)

Provides the basic functionality as a processor

• Fetch, decode, branch, add, exception handling, etc…

Controls the status of complete core

▌Address translation and data forwarding crossbar

To support contiguous vector memory access

• 16 elements/cycle vector address generation and translation, 17 requests/cycle issuing

• 409.6GB/sec load and 409.6GB/sec store data forwarding

ScalarProcessing Unit

VectorProcessingUnit

Address generation and translation

Request/Reply crossbar

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

RTR

Address Translation Buffer

64 ScalarRegisters

64 Vector Registers, 256 words

16 Vector Mask Registers

17 Requests / cycle

1 Inst. / cycle

1 In

st. /

cyc

le

32 elements / cycle

32

ele

men

ts /

cy

cle

Memory network

Page 14: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

14 © NEC Corporation 2018

Vector Processing Unit

Four pipelines, each 32-way parallel

– FMA0: FP fused multiply-add, integer multiply

– FMA1: FP fused multiply-add, integer multiply

– ALU0/FMA2: Integer add, multiply, mask, FP FMA

– ALU1/Store: Integer add, store, complex operation

Doubled SP performance by 32bit x 2 packed vector data support

Vector register (VR) renaming with 256 physical VRs

• 64 architectural VRs are renamed

– Enhanced preload capability

– Avoidance of WAR and WAW dependencies

OoO scheduling

Dedicated complex operation pipeline to prevent pipeline stall

• Vector sum, divide, mask population count, etc.

Total96 FMAs

Page 15: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

15 © NEC Corporation 2018

Scalar Processing Unit

General enhancements

• 4 instructions / cycle fetch and decode

• Sophisticated branch prediction

• OoO scheduling

• 8-level speculative execution

• Four scalar instruction pipes

• Two 32kB L1 caches + unified 256kB L2 cache

• Hardware prefetch

Support for contiguous vector operation

• Dedicated vector instruction pipe

• 16 elements / cycle coherency control for vector store

Mem

ory

Bran

ch

Vecto

r

Integer/FP

Unified schedulerIn

teger/FP

Decode

Fetch32kB L1I cache

Branch prediction

Predecode

32kB L1O cache

PrefetchController

Scalar register(64 architectural + 48 renaming)

256kB L2cache

Load miss queue

Vector coherencecontrol

Vector instructionMemory requestVectoraddress Memory reply

Page 16: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

16 © NEC Corporation 2018

Memory Subsystem

▌High bandwidth

409.6GB/s x2 core bandwidth

Over 3TB/s LLC bandwidth

1.2TB/s memory bandwidth

▌Caches

Scalar L1/L2 caches in each core

16MB shared LLC

▌Two memory networks

2D mesh NoC for core memory access

Ring bus for DMA and PCIe traffic

▌DMA engine

Used by both vector cores and x86 node

Can access VE memory, VE registers,and x86 memory

Page 17: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

17 © NEC Corporation 2018

Network on chip (NoC)

▌2D mesh network

Maximize bandwidth with minimal wiring

Minimizing data transfer distance

16 layered mesh

▌Deadlock avoidance

Dimension-ordered routing

Virtual channels for request and reply

▌Adaptive flow control

▌Age based QoS control

Core

Crossbar

LLC CoreL15

L11

L7

L3

L14

L2

L10

L6

L0

L15

L11

L7

L4

L1

L3

L14

L8

L5

L2

L10

L12

L9

L13

L6

L3 L14 L7 L10 L11 L6 L15 L2 L0 L13 L4 L9 L8 L5 L12 L1

L3 L14 L7 L10 L11 L6 L15 L2 L0 L13 L4 L9 L8 L5 L12 L1

L3

L14

L7

L10

L11

L4

L15

L2

L0

L13

L6

L9

L8

L5

L12

L1

Page 18: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

18 © NEC Corporation 2018

Last Level Cache (LLC)

Memory side cache

• Avoiding massive snoop traffic

• Increasing efficiency of indirect memory access

16MB, write back

Inclusive of L1 and L2

High bandwidth design

• 128 banks, in total more than 3TB/s bandwidth

Auto data scrubbing

Assignable data buffer feature

• Priority of data can be controlled by a flag for vector memory access instructions

NoC

LLC #1

HB

M2

#1

LLC #3

LLC #5

LLC #7

HB

M2

#3

HB

M2

#5

LLC #0

HB

M2

#0

LLC #2

LLC #4

LLC #6

HB

M2

#2

HB

M2

#4

200GB/s Total 3TB/s

Page 19: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Benchmarks

Benchmark conditionsSX-Aurora TSUBASA: SX-Aurora TSUBASA A500 modelIntel Xeon: Intel Xeon Gold 6142 2 sockets, 192GB DDR4-2666NVIDIA Tesla V100: Intel Xeon CPU E5-2630v4 2 sockets, 128GB DDR4-2400, NVIDIA Tesla V100 16GB

Page 20: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

20 © NEC Corporation 2018

Basic Performance

0

1

2

3

4

5

SX-AuroraTSUBASA

Intel XeonGold 6142 2P

NVIDIA TeslaV100

Perf

., e

ffic

iency (

Xeon=

1)

DGEMM(Floating point performance)

Performance Perf./watt

0

2

4

6

8

10

12

14

SX-AuroraTSUBASA

Intel Xeon Gold6142 2P

NVIDIA TeslaV100

BW

, effic

iency (

Xeon=

1)

STREAM,Triad(Memory bandwidth)

Bandwidth BW/watt

▌Floating point calculation and memory bandwidth

Industry leading memory access performance and efficiency

Comfortable enough compute capability for memory intensive workloads

Page 21: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

21 © NEC Corporation 2018

Performance on General Benchmarks

0

2

4

6

8

10

SX-AuroraTSUBASA

Intel XeonGold 6142 2P

NVIDIA TeslaV100

Perf

., e

ffic

iency (

Xeon=

1) HPCG

Performance Perf./watt

0

2

4

6

8

10

12

SX-AuroraTSUBASA

Intel XeonGold 6142 2P

NVIDIA TeslaV100

Perf

., e

ffic

iency (

Xeon=

1)

Himeno benchmark

Performance Perf./watt

▌HPCG and Himeno benchmark (Poisson equation solver)

Competitive performance and power efficiencyavailable using standard programming paradigms

Page 22: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

22 © NEC Corporation 2018

Performance on Machine Learning

107.6

52.3

39.4

1 1 10

20

40

60

80

100

120

LR K-means SVD

Sp

eed

up

(S

park/

Xeo

n=

1)

Frovedis/SX-Aurora TSUBASA

Spark/Xeon Gold 6142 2P

▌Statistical machine learning

Workloads

• Web ads optimization (Logistic regression)

• Document clustering (K-means)

• Recommendation (Singular valuedecomposition)

NEC’s Frovedis™ framework for AI/BigData processing

• Apache Spark MLlib compatible API

• Open source

–https://github.com/frovedis

Page 23: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

23 © NEC Corporation 2018

Summary

▌SX-Aurora TSUBASA

A new product line of vector supercomputers based on Aurora architecture

Vector capability is provided in a standard x86/Linux environment

▌Vector Engine

High memory bandwidth by six HBM2s configuration

Enhancements of the vector microarchitecture to providehigh sustained performance and power efficiency

▌Benchmarks

Very competitive performance and power efficiency using standard programming paradigms

Outstanding performance on statistical machine learning workloads with Frovedis framework

Page 24: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s

Thank you!

Page 25: Vector Engine Processor of NEC’s Brand-New Supercomputer ... · Vector capability is provided in a standard x86/Linux environment Vector Engine High memory bandwidth by six HBM2s