Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care

Mattan Erez

The University of Texas at Austin

1

Arch-focused whole-system approach

•Efficiency requirements require crossing layers

•Algorithms are key– Compute less, move less, store less

•Proportional systems– Minimize waste

•Utilize and improve emerging technologies

•Explore (and act) up and down– Programming model to circuits

•Preferably implement at micro-arch/system

(C) Mattan Erez 2

Big problems and emerging platforms

•Memory systems– Capacity, bandwidth, efficiency – impossible to balance

– Adaptive and dynamic management helps

– New technologies to the rescue?

– Opportunities for in-memory computing

•GPUs, supercomputers, clouds, and more– Throughput oriented designs are a must

– Centralization trend is interesting

•Reliability and resilience– More, smaller devices – danger of poor reliability

– Trimmed margins – less room for error

– Hard constraints – efficiency is a must

– Scale exacerbates reliability and resilience concerns

(C) Mattan Erez 3

Lots of interesting multi-level projects

Resilience/Reliability

GPU/CPUMemory

(C) Mattan Erez 4

•A superscale computer is a high-performance system for solving big problems

(c) Mattan Erez 5

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 6

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 7

•Simulate

• Analyze

• Predict

(c) Mattan Erez 8

•Simulate

• Analyze

• Predict

(c) Mattan Erez 9

•Supercomputers are general

– Not domain specific

•Cloud computers are general

– Algorithms keep changing

(c) Mattan Erez 10

•Superscale computers must balance

– Compute

– Storage

– Communication

– Constraints

• OpEx and CapEx

11(c) Mattan Erez

•Super is never super enough

(c) Mattan Erez 12

1E+8

1E+9

1E+10

1E+11

1E+12

1E+13

1E+14

1E+15

1E+16

1E+17

1E+18

Su

sta

ine

d F

LOP

/s

Num. 1

Num. 10

Num.500

Idealized VLSI

(c) Mattan Erez 13

•What’s keeping us from exascale?

(c) Mattan Erez 14

•The power problem

1

10

100

1000

10000

100000

1000000 Power

Efficiency

Tota

l P

ow

er

[kW

]

Eff

icie

nc

y [

GFLO

PS/k

W]

(c) Mattan Erez 15

•Solution: build more efficient systems

(c) Mattan Erez 16

•Exascale computers are a lunacy:

– Crazy big (1M processors, >10MW)

– Need efficiency of embedded devices

– Effective for many domains

– Fully programmable

17(c) Mattan Erez

•Why should you care?

•Harbingers of things to come

•Discovery awaits

•Enable and promote open innovation

18(c) Mattan Erez

•Where does the power go?

(c) Mattan Erez 19

•Cooling and infrastructure

– Amazing mechanical and electrical systems

– Getting close to optimal efficiency

17 PFLOP/s

18,688 nodes

>8 MW

~200 cabinets

~400 m2 floorspace

(c) Mattan Erez 20

•Actual processing

+

−

×√ x2

Arithmetic…100101001

0100…

I/O

MemoryControl

Comm.

Idle/margin

How much of each component?

(c) Mattan Erez 21

•The budget: Power ≤ 20MW

(c) Mattan Erez 22

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

(c) Mattan Erez 23

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

• Best supercomputer today: ~300pJ/op

(c) Mattan Erez 24

Arithmetic

64-bit floating-point operation

Rough estimated numbers

+

−

×√ x2

40

2015

5

0

10

20

30

40

50

Today Scaled Researchy

Arith. Headroom

(c) Mattan Erez 25

260

•Enough headroom?

(c) Mattan Erez 26

•Unfortunately, hard tradeoffs

20mm

64-bit DP

15pJ

15 pJ 80 pJ

100 pJ

500 pJEfficient

off-chip

link

10nm

256-bit

buses

2 nJDRAM

Rd/Wr

256-bit access

8 kB SRAM

15 pJ

(c) Mattan Erez 27

•Need more headroom

– Minimize waste

+

−

×√ x2

(c) Mattan Erez 28

Do we care about single-unit performance?

•Must all results be equally precise?

•Must all results be correct?

•Lunacy?

+

−

×√ x2

(c) Mattan Erez 29

•Relaxed reliability and precision– Some lunacy

(rare easy-to-detect errors + parallelism)

– Lunatic fringe: bounded imprecision

– Lunacy: live with real unpredictable errors

40

2015 12 8

2

5 8 1218

0

10

20

30

40

50

Today Scaled Researchy Some

lunacy

Lunatic

fringe

Lunacy

Arith. Headroom

Rough estimated numbers for illustration purposes

+

−

×√ x2

(c) Mattan Erez 30

•Bounded Approximate Duplication +

−

×√ x2

(c) Mattan Erez 31

•Bounded Approximate Duplication +

−

×√ x2

(c) Mattan Erez 32

•Supercomputers are general

• Dynamic adaptivity and tuning

– Some programmers crazier than others

– Large effort in error-tolerant algorithms

+

−

×√ x2

(c) Mattan Erez 33

•Proportionality and overprovisioning

(c) Mattan Erez 34

Arithmetic

Control

Memory

Margin

Dense

Sparse

Latency

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•It’s all about the algorithm

35(c) Mattan Erez





•Architecture concepts so far:

– Proportionality

• Adaptivity

• SW-HW co-tuning

– Locality

– Parallelism

36(c) Mattan Erez

•How to control a billion arithmetic units?

(c) Mattan Erez 37

•Hierarchy minimizes control cost

– Amortize control decisions

Program

Teams

Processes

Threads

Instructions

Machine

Cabinets

/modules

Nodes

Cores

HW units

(c) Mattan Erez 38

•Hierarchy minimizes control cost

– Amortize control decisions

(c) Mattan Erez 39

Program

Teams

Processes

Threads

Instructions

•What does a HW instruction do?

++++

xxxx

+ + + +

x x x x

x x

+

+

√

- +

Amortize/specialize more

CPU

GPU

Embedded

(c) Mattan Erez 40

•Single-thread or bulk-thread

latency

orientedthroughput

oriented

CPU GPU

(c) Mattan Erez 41

Heterogeneity is a necessity disciplined specialization

(c) Mattan Erez 42

•Must balance generality and control cost

•Over-specialization stifles innovation

– Algorithms more important than hardware

(c) Mattan Erez 43

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

(c) Mattan Erez 44

•Disciplined heterogeneityScarcity of choice

– Throughput oriented

– Latency oriented

– Disciplined reconfigurable accelerators

(c) Mattan Erez 45

•Heterogeneous computing already here

– “GPU” for throughput

– CPU for latency

– Abstracted FPGA accelerators

46(c) Mattan Erez

•Heterogeneous computing already here

Titan node (Cray XK7)

NVIDIA GPU + AMD CPU

Stampede node

Intel Xeon CPU + Xeon Phi

(c) AMD. Cray, Intel, NVIDIA, xSede 47

Copyrighted

picture of

an XK7 node

Copyrighted

picture of

Kepler die

Copyrighted

picture of

Opteron die

Copyrighted

picture of

Xeon Phi card

Copyrighted

picture of

Xeon Phi die

Copyrighted

picture of

Sany Bridge die

Copyrighted

picture of

Stampede

node

•Choose most efficient core

48(c) Mattan Erez

•Choose most efficient core

49(c) Mattan Erez

Arithmetic

Control

Memory

Margin

+

+

+

+

+ + + +=?

•Generalize the throughput core

50(c) Mattan Erez

52


51(c) Mattan Erez


52(c) Mattan Erez

•Synchronization may impair execution

– Predict when necessary robust optimization

– Adaptive compaction

: Compaction barrier

(c) Mattan Erez 53





•Architecture so far:

– Proportionality

• Adaptivity

• Heterogeneity

• SW-HW co-tuning

– Locality

– Parallelism

– Hierarchy

54(c) Mattan Erez

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

•How can we program these machines

(c) Mattan Erez 55

Hierarchical languages restore some sanity

– Abstract key tuning knobs

• Parallelism

• Locality

• Hierarchy

• Throughput

(c) Mattan Erez 56

Hierarchical languages restore some sanity

– CUDA and OpenCL

– Sequoia (Stanford)

– Habanero (Rice)

– Phalanx (NVIDIA)

– Legion (Stanford)

– Working into X10, Chapel

– …

•Domain specific languages even better

57(c) Mattan Erez

•A counter example:DE Shaw Research Anton

58(c) DE Shaw Research

Copyrighted

picture of

Anton package

(Google

image search)

Copyrighted

figure demonstrating

molecular dynamics

(Google

image search)

Copyrighted

figure of

Anton system

(Google

image search)

•Another counter example?

•Microsoft Catapult

– Disciplined reconfigurability for the cloud

– Distributed FPGA accelerator

• Within form-factor and cost constraints

– Architected interfaces and compiler

• Compiler for software, not (just) hardeare

59(c) Mattan Erez

Copyrighted

figure of

datacenter

(Google image search)

Copyrighted

figure of

Catapult diagram

(from ISCA 2014 paper)

Memory

Cheap

Efficient

LargeFast

Reliable

(c) Mattan Erez 60

•Fast + large hierarchy

(c) Mattan Erez 61

TSVSilicon

Interposer

•Fast + large hierarchical

(c) Micron, Intel, Mattan Erez 62

mem

mem

•Large + efficient heterogeneity

(c) Mattan Erez 63

TSVSilicon

Interposer

•Large + efficient heterogeneous

(c) Micron, Intel, Mattan Erez 64

DRAM +

NVRAM

•Heterogeneous + hierarchical big mess

•Research just starting

– New mechanisms needed

– Very interesting tradeoffs with reliability

(c) Mattan Erez 65

•Fast + efficient control hierarchy

+ parallelism

(c) Mattan Erez 66

•Fast + efficient hierarchy + parallelism

Access cell

(c) Mattan Erez 67


Access many

cells in parallel

Many pins

in parallel

(c) Mattan Erez 68


Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 69


Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 70

•Hierarchy + parallelism potential waste

(c) Mattan Erez 71

•Is fine-grained access better?

– Wasteful when all data is needed

D0 E0

DataC

C0

C1

D1 E1 D2 E2 D3 E3C2

C3

C4

D4 E4

CG

FGC5

C6

C7

D5 E5 D6 E6 D7 E7

ECC

D0 E0

DataC

C0

CG

FG

ECC

(c) Mattan Erez 72

72

•Dynamically adapt granularity best of both worlds

73

ProcessorCore

$I$D

L2

Last Level Cache

MC

DRAM

Core0

Core1

CoreN-1

…

SLP

Co-scheduling CG/FG w/ split CG

Sector cache (8 8B sectors per line)

Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG

Locality Predictor

(c) Mattan Erez

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

mcf x8 omnetpp

x8

mst x8 em3d x8 canneal

x8

SSCA2 x8 linked list

x8

gups x8 MIX-A MIX-B MIX-C MIX-D

Sy

ste

m T

hro

ug

hp

ut

CG+ECC

FG+ECC

AG+ECC

•Dynamically adapt granularity

(c) Mattan Erez 74

•Cost + … poor reliability

•Efficiency + … poor reliability

•New + … poor reliability

(c) Mattan Erez 75

•Compensate with error protection

– ECC

(c) Mattan Erez 76

•Compensate with proportional protection

(c) Mattan Erez 77

•Adaptive reliability

– Adapt the mechanism

– Adapt the error rate

• precision (stochastic precision)

(c) Mattan Erez 78

•Adaptive reliability

– By type

• Application guided

– By location

• Machine-state guided

(c) Mattan Erez 79

•Example: Virtualized ECC

– Adapt the mechanism

(c) Mattan Erez 80

Physical Memory

Data ECC

Virtual Address space

LowVirtual Page to Physical Frame mapping

Page frame – x

Page frame – y

Page frame – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 81

Physical Memory

Data



Physical Frame to ECC Page mapping

ECC

StrongerECC

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 82

Physical Memory

Data T1EC



Physical Frame to ECC Page mapping

T2EC for Chipkill

T2EC for DoubleChipkill

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 83

Two-Tiered ECC

Write Read

T1ECDecoding

Data T1EC

T1EC/T2ECEncoding

T2EC

(c) Mattan Erez 84

85(c) Mattan Erez

Write Read

T1ECDecoding

Data T1EC

Virtualized

T2EC

T1EC/T2ECEncoding

T2ECMapping

•Caching work great

0.98

0.99

1.00

1.01

1.02

Chipkill Detect ChipkillCorrect

2 ChipkillCorrect

Normalized Execution Time

(c) Mattan Erez 86

•Bonus: flexibility of ECC

0.75

0.80

0.85

0.90

0.95

1.00

Chipkill Detect Chipkill Correct 2 ChipkillCorrect

Normalized Energy-Delay Product

(c) Mattan Erez 87

•Can we do even better?

– Hide second-tier

88(C) Mattan Erez

FrugalECC = Compression + VECC

89(C) Mattan Erez

Adapt resilience scheme or adapt storage?

90(C) Mattan Erez

Adapt compression scheme too!Coverage-oriented-Compression (CoC)

91(C) Mattan Erez

92(C) Mattan Erez

93(C) Mattan Erez

Experiments promising:

Match or exceed reliability

Improve power

Minimal impact on performance

94(C) Mattan Erez

95(C) Mattan Erez

96(C) Mattan Erez






– Proportionality

• Adaptivity

• SW-HW co-tuning

• Heterogeneity

– Locality

– Parallelism

– Hierarchy

97(c) Mattan Erez

•The reliability challenge

(c) Mattan Erez 98

1

10

100

1000

2013

(15PF)

2015

(40PF)

2017

(130PF)

2019

(450PF)

2021

(1500PF)

Pro

ject

ed

MT

TI

[ho

urs

]

Year (Expected performance in PetaFlops)

Hard

Soft

Intermittent SW

Overall

Detect Contain Repair Recover

(c) Mattan Erez 99

•Today’s techniques not good enough

– Hardware support expensive

– Checkpoint-restart doesn’t scale

(c) Mattan Erez 100

•Exascale resilience must be:

– Proportional

– Parallel

– Adaptive

– Hierarchical

– Analyzable

(c) Mattan Erez 101

•Containment domains

– Embed resilience within application

– Match system hierarchy and characteristics

– Analyzable abstraction consistent across layersRoot CD

Child CD

CDs don’t communicate

erroneous data

CDs have

means to recover

(c) Mattan Erez 102

Hardware Abstraction Layer

Microkernel

Runtime Library Interface

Machine

efficiency-oriented

programming model

int main(int argc, char **argv){

main_task here = phalanx::initialize(argc, argv);

… Create test arrays here …

// Launch kernel on default CPU (“host”)openmp_event e1 = async(here, here.processor(), n)

(host_saxpy, 2.0f, host_x, host_y);

// Launch kernel on default GPU (“device”)cuda_event e2 = async(here, here.cuda_gpu(0), n)

(device_saxpy, 2.0f, device_x, device_y);

wait(here, e1&e2);return 0;

}

Phalanx C++ program

CD Annotations

resiliency model

Error

Reporting

Architecture

ECC, status

CD control and persistence

CD API

resiliency interface

(c) Mattan Erez 103

101, 106

Mapping example: SpMV

104(c) Mattan Erez

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

}

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

}

}

𝑴

Matrix M

𝑽

Vector V

Mapping example: SpMV

105(c) Mattan Erez



}


for r=0..N


prevC=c;

}

}

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

•Mapping example: SpMV

106(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Distributed to 4 nodes



}


for r=0..N


prevC=c;

}

}


107(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏



}


for r=0..N


prevC=c;

}

}

Distributed to 4 nodes


108(c) Mattan Erez

𝑴𝟎𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟏𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟎𝟏 𝑽𝟏

Preserve

DetectRecover

𝑴𝟏𝟏 𝑽𝟏

Preserve

DetectRecover

Preserve

DetectRecover

M V

Parent CD

Child CD

Preserve (Parent)

Detect (Parent)

Recover (Parent)

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

void task<inner> SpMV(in M, in Vi, out Ri) {

cd = begin(parentCD);

preserve_via_copy(cd, matrix, …);


complete(cd);

}

void task<leaf> SpMV(…) {



preserve_via_parent(cd, veci, …);

for r=0..N


check {fault<fail>(c > prevC);}

prevC=c;

}

complete(cd);

}

109(c) Mattan Erez

API

begin

configure

preserve

check

complete

•SpMV preservation tuning

110(c) Mattan Erez

Natural redundancy

void task<leaf> SpMV(…) {cd = begin(parentCD);preserve_via_copy(cd, matrix, …);preserve_via_parent(cd, veci, …);for r=0..N for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];check {fault<fail>(c > prevC);}prevC=c;

}complete(cd);}

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Hierarchy

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

•Concise abstraction for complex behavior

111(c) Mattan Erez

void task<leaf> SpMV(…) {



preserve_via_parent(cd, veci, …);

for r=0..N

for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];

check {fault<fail>(c > prevC);}

prevC=c;

}

complete(cd);}

Local copy or regen Sibling Parent (unchanged)

Recovery: concurrent, hierarchical, adaptive

112(c) Mattan Erez

Compute

Preserve

Detect

Re-execution overhead

Tim

e

•Analyzable

– Application model (tree)

– Overhead model

– Fault model

113(c) Mattan Erez

Ex

ec

utio

n t

ime

•𝑞 0,𝑚, 𝑛 = 1 − 𝑝𝑐𝑚∙𝑛

– Probability that all child CDs experienced no failures

•𝑞 𝑥,𝑚, 𝑛 = 𝑖=0𝑥 𝑖+𝑚−1

𝑖𝑝𝑐𝑖 1 − 𝑝𝑐

𝑚𝑛

– Probability that all child CDs experienced at most x failures in the m serial CDs

•𝑑 0,𝑚, 𝑛 = 𝑞[0,𝑚, 𝑛]– Probability that all child CDs experienced exactly 0

failures

•𝑑 𝑥,𝑚, 𝑛 = 𝑞 𝑥,𝑚, 𝑛 − 𝑞 𝑥 − 1,𝑚, 𝑛– Probability that sibling with the most failure

experiences exactly x failures

•𝑇𝑝𝑎𝑟𝑒𝑛𝑡 𝑥,𝑚, 𝑛 = 𝑖=0∞ 𝑖 + 𝑚 𝑇𝑐𝑑[𝑖,𝑚, 𝑛]

•Heterogeneous CDs

•Asymmetric preservation and restoration

114

Re

-exe

cu

tio

n t

ime

Parallel domains

Execution

Re-execution

Analyzable

(c) Mattan Erez

•Analyzable?

115(c) Mattan Erez

Ex

ec

utio

n t

ime

108

•CDs are scalable

116(c) Mattan Erez

Peak System Performance

NT

SpMV

HPCCG

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, NT

h-CPR, 80%

g-CPR, 80%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

•CDs are proportional and efficient

117(c) Mattan Erez

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, NT

h-CPR, 80%

gCPR, 80%

Peak System Performance

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

NT

SpMV

HPCCG





•It’s all about the algorithm

118(c) Mattan Erez






– Proportionality

• Adaptivity

• SW-HW co-tuning + analyzability

• Heterogeneity

– Locality

– Parallelism

– Hierarchy

119(c) Mattan Erez

Arithmetic

Control

Memory

Reliability

Programming

•Exascale computers are lunacy

120(c) Mattan Erez

•Exascale computers are lunacyscience

121(c) Mattan Erez

•Harbingers of things to come

•Discovery awaits

•Enable and promote open innovation

•Math + Systems +

Engineering +

• Technology

122(c) Mattan Erez

Documents

Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688