122
Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care Mattan Erez The University of Texas at Austin 1

Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care

Mattan Erez

The University of Texas at Austin

1

Page 2: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Arch-focused whole-system approach

•Efficiency requirements require crossing layers

•Algorithms are key– Compute less, move less, store less

•Proportional systems– Minimize waste

•Utilize and improve emerging technologies

•Explore (and act) up and down– Programming model to circuits

•Preferably implement at micro-arch/system

(C) Mattan Erez 2

Page 3: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Big problems and emerging platforms

•Memory systems– Capacity, bandwidth, efficiency – impossible to balance

– Adaptive and dynamic management helps

– New technologies to the rescue?

– Opportunities for in-memory computing

•GPUs, supercomputers, clouds, and more– Throughput oriented designs are a must

– Centralization trend is interesting

•Reliability and resilience– More, smaller devices – danger of poor reliability

– Trimmed margins – less room for error

– Hard constraints – efficiency is a must

– Scale exacerbates reliability and resilience concerns

(C) Mattan Erez 3

Page 4: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Lots of interesting multi-level projects

Resilience/Reliability

GPU/CPUMemory

(C) Mattan Erez 4

Page 5: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•A superscale computer is a high-performance system for solving big problems

(c) Mattan Erez 5

Page 6: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 6

Page 7: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 7

Page 8: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Simulate

• Analyze

• Predict

(c) Mattan Erez 8

Page 9: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Simulate

• Analyze

• Predict

(c) Mattan Erez 9

Page 10: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Supercomputers are general

– Not domain specific

•Cloud computers are general

– Algorithms keep changing

(c) Mattan Erez 10

Page 11: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Superscale computers must balance

– Compute

– Storage

– Communication

– Constraints

• OpEx and CapEx

11(c) Mattan Erez

Page 12: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Super is never super enough

(c) Mattan Erez 12

Page 13: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

1E+8

1E+9

1E+10

1E+11

1E+12

1E+13

1E+14

1E+15

1E+16

1E+17

1E+18

Su

sta

ine

d F

LOP

/s

Num. 1

Num. 10

Num.500

Idealized VLSI

(c) Mattan Erez 13

Page 14: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•What’s keeping us from exascale?

(c) Mattan Erez 14

Page 15: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•The power problem

1

10

100

1000

10000

100000

1000000 Power

Efficiency

Tota

l P

ow

er

[kW

]

Eff

icie

nc

y [

GFLO

PS/k

W]

(c) Mattan Erez 15

Page 16: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Solution: build more efficient systems

(c) Mattan Erez 16

Page 17: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Exascale computers are a lunacy:

– Crazy big (1M processors, >10MW)

– Need efficiency of embedded devices

– Effective for many domains

– Fully programmable

17(c) Mattan Erez

Page 18: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Why should you care?

•Harbingers of things to come

•Discovery awaits

•Enable and promote open innovation

18(c) Mattan Erez

Page 19: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Where does the power go?

(c) Mattan Erez 19

Page 20: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Cooling and infrastructure

– Amazing mechanical and electrical systems

– Getting close to optimal efficiency

17 PFLOP/s

18,688 nodes

>8 MW

~200 cabinets

~400 m2 floorspace

(c) Mattan Erez 20

Page 21: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Actual processing

+

×√ x2

Arithmetic…100101001

0100…

I/O

MemoryControl

Comm.

Idle/margin

How much of each component?

(c) Mattan Erez 21

Page 22: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•The budget: Power ≤ 20MW

(c) Mattan Erez 22

Page 23: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

(c) Mattan Erez 23

Page 24: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

• Best supercomputer today: ~300pJ/op

(c) Mattan Erez 24

Page 25: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Arithmetic

64-bit floating-point operation

Rough estimated numbers

+

×√ x2

40

2015

5

0

10

20

30

40

50

Today Scaled Researchy

Arith. Headroom

(c) Mattan Erez 25

260

Page 26: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Enough headroom?

(c) Mattan Erez 26

Page 27: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Unfortunately, hard tradeoffs

20mm

64-bit DP

15pJ

15 pJ 80 pJ

100 pJ

500 pJEfficient

off-chip

link

10nm

256-bit

buses

2 nJDRAM

Rd/Wr

256-bit access

8 kB SRAM

15 pJ

(c) Mattan Erez 27

Page 28: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Need more headroom

– Minimize waste

+

×√ x2

(c) Mattan Erez 28

Page 29: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Do we care about single-unit performance?

•Must all results be equally precise?

•Must all results be correct?

•Lunacy?

+

×√ x2

(c) Mattan Erez 29

Page 30: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Relaxed reliability and precision– Some lunacy

(rare easy-to-detect errors + parallelism)

– Lunatic fringe: bounded imprecision

– Lunacy: live with real unpredictable errors

40

2015 12 8

2

5 8 1218

0

10

20

30

40

50

Today Scaled Researchy Some

lunacy

Lunatic

fringe

Lunacy

Arith. Headroom

Rough estimated numbers for illustration purposes

+

×√ x2

(c) Mattan Erez 30

Page 31: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Bounded Approximate Duplication +

×√ x2

(c) Mattan Erez 31

Page 32: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Bounded Approximate Duplication +

×√ x2

(c) Mattan Erez 32

Page 33: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Supercomputers are general

• Dynamic adaptivity and tuning

– Some programmers crazier than others

– Large effort in error-tolerant algorithms

+

×√ x2

(c) Mattan Erez 33

Page 34: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Proportionality and overprovisioning

(c) Mattan Erez 34

Arithmetic

Control

Memory

Margin

Dense

Sparse

Latency

Page 35: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•It’s all about the algorithm

35(c) Mattan Erez

Page 36: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•Architecture concepts so far:

– Proportionality

• Adaptivity

• SW-HW co-tuning

– Locality

– Parallelism

36(c) Mattan Erez

Page 37: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•How to control a billion arithmetic units?

(c) Mattan Erez 37

Page 38: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Hierarchy minimizes control cost

– Amortize control decisions

Program

Teams

Processes

Threads

Instructions

Machine

Cabinets

/modules

Nodes

Cores

HW units

(c) Mattan Erez 38

Page 39: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Hierarchy minimizes control cost

– Amortize control decisions

(c) Mattan Erez 39

Program

Teams

Processes

Threads

Instructions

Page 40: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•What does a HW instruction do?

++++

xxxx

+ + + +

x x x x

x x

+

+

- +

Amortize/specialize more

CPU

GPU

Embedded

(c) Mattan Erez 40

Page 41: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Single-thread or bulk-thread

latency

orientedthroughput

oriented

CPU GPU

(c) Mattan Erez 41

Page 42: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Heterogeneity is a necessity disciplined specialization

(c) Mattan Erez 42

Page 43: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Must balance generality and control cost

•Over-specialization stifles innovation

– Algorithms more important than hardware

(c) Mattan Erez 43

Page 44: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

(c) Mattan Erez 44

Page 45: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Disciplined heterogeneityScarcity of choice

– Throughput oriented

– Latency oriented

– Disciplined reconfigurable accelerators

(c) Mattan Erez 45

Page 46: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Heterogeneous computing already here

– “GPU” for throughput

– CPU for latency

– Abstracted FPGA accelerators

46(c) Mattan Erez

Page 47: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Heterogeneous computing already here

Titan node (Cray XK7)

NVIDIA GPU + AMD CPU

Stampede node

Intel Xeon CPU + Xeon Phi

(c) AMD. Cray, Intel, NVIDIA, xSede 47

Copyrighted

picture of

an XK7 node

Copyrighted

picture of

Kepler die

Copyrighted

picture of

Opteron die

Copyrighted

picture of

Xeon Phi card

Copyrighted

picture of

Xeon Phi die

Copyrighted

picture of

Sany Bridge die

Copyrighted

picture of

Stampede

node

Page 48: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Choose most efficient core

48(c) Mattan Erez

Page 49: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Choose most efficient core

49(c) Mattan Erez

Arithmetic

Control

Memory

Margin

Page 50: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

+

+

+

+

+ + + +=?

•Generalize the throughput core

50(c) Mattan Erez

52

Page 51: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Generalize the throughput core

51(c) Mattan Erez

Page 52: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Generalize the throughput core

52(c) Mattan Erez

Page 53: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Synchronization may impair execution

– Predict when necessary robust optimization

– Adaptive compaction

: Compaction barrier

(c) Mattan Erez 53

Page 54: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•Architecture so far:

– Proportionality

• Adaptivity

• Heterogeneity

• SW-HW co-tuning

– Locality

– Parallelism

– Hierarchy

54(c) Mattan Erez

Page 55: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

•How can we program these machines

(c) Mattan Erez 55

Page 56: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Hierarchical languages restore some sanity

– Abstract key tuning knobs

• Parallelism

• Locality

• Hierarchy

• Throughput

(c) Mattan Erez 56

Page 57: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Hierarchical languages restore some sanity

– CUDA and OpenCL

– Sequoia (Stanford)

– Habanero (Rice)

– Phalanx (NVIDIA)

– Legion (Stanford)

– Working into X10, Chapel

– …

•Domain specific languages even better

57(c) Mattan Erez

Page 58: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•A counter example:DE Shaw Research Anton

58(c) DE Shaw Research

Copyrighted

picture of

Anton package

(Google

image search)

Copyrighted

figure demonstrating

molecular dynamics

(Google

image search)

Copyrighted

figure of

Anton system

(Google

image search)

Page 59: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Another counter example?

•Microsoft Catapult

– Disciplined reconfigurability for the cloud

– Distributed FPGA accelerator

• Within form-factor and cost constraints

– Architected interfaces and compiler

• Compiler for software, not (just) hardeare

59(c) Mattan Erez

Copyrighted

figure of

datacenter

(Google image search)

Copyrighted

figure of

Catapult diagram

(from ISCA 2014 paper)

Page 60: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Memory

Cheap

Efficient

LargeFast

Reliable

(c) Mattan Erez 60

Page 61: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + large hierarchy

(c) Mattan Erez 61

Page 62: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

TSVSilicon

Interposer

•Fast + large hierarchical

(c) Micron, Intel, Mattan Erez 62

mem

mem

Page 63: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Large + efficient heterogeneity

(c) Mattan Erez 63

Page 64: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

TSVSilicon

Interposer

•Large + efficient heterogeneous

(c) Micron, Intel, Mattan Erez 64

DRAM +

NVRAM

Page 65: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Heterogeneous + hierarchical big mess

•Research just starting

– New mechanisms needed

– Very interesting tradeoffs with reliability

(c) Mattan Erez 65

Page 66: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + efficient control hierarchy

+ parallelism

(c) Mattan Erez 66

Page 67: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + efficient hierarchy + parallelism

Access cell

(c) Mattan Erez 67

Page 68: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + efficient hierarchy + parallelism

Access many

cells in parallel

Many pins

in parallel

(c) Mattan Erez 68

Page 69: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + efficient hierarchy + parallelism

Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 69

Page 70: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Fast + efficient hierarchy + parallelism

Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 70

Page 71: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Hierarchy + parallelism potential waste

(c) Mattan Erez 71

Page 72: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Is fine-grained access better?

– Wasteful when all data is needed

D0 E0

DataC

C0

C1

D1 E1 D2 E2 D3 E3C2

C3

C4

D4 E4

CG

FGC5

C6

C7

D5 E5 D6 E6 D7 E7

ECC

D0 E0

DataC

C0

CG

FG

ECC

(c) Mattan Erez 72

72

Page 73: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Dynamically adapt granularity best of both worlds

73

ProcessorCore

$I$D

L2

Last Level Cache

MC

DRAM

Core0

Core1

CoreN-1

SLP

Co-scheduling CG/FG w/ split CG

Sector cache (8 8B sectors per line)

Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG

Locality Predictor

(c) Mattan Erez

Page 74: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

mcf x8 omnetpp

x8

mst x8 em3d x8 canneal

x8

SSCA2 x8 linked list

x8

gups x8 MIX-A MIX-B MIX-C MIX-D

Sy

ste

m T

hro

ug

hp

ut

CG+ECC

FG+ECC

AG+ECC

•Dynamically adapt granularity

(c) Mattan Erez 74

Page 75: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Cost + … poor reliability

•Efficiency + … poor reliability

•New + … poor reliability

(c) Mattan Erez 75

Page 76: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Compensate with error protection

– ECC

(c) Mattan Erez 76

Page 77: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Compensate with proportional protection

(c) Mattan Erez 77

Page 78: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Adaptive reliability

– Adapt the mechanism

– Adapt the error rate

• precision (stochastic precision)

(c) Mattan Erez 78

Page 79: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Adaptive reliability

– By type

• Application guided

– By location

• Machine-state guided

(c) Mattan Erez 79

Page 80: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Example: Virtualized ECC

– Adapt the mechanism

(c) Mattan Erez 80

Page 81: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Physical Memory

Data ECC

Virtual Address space

LowVirtual Page to Physical Frame mapping

Page frame – x

Page frame – y

Page frame – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 81

Page 82: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Physical Memory

Data

Virtual Address space

LowVirtual Page to Physical Frame mapping

Physical Frame to ECC Page mapping

ECC

StrongerECC

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 82

Page 83: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Physical Memory

Data T1EC

Virtual Address space

LowVirtual Page to Physical Frame mapping

Physical Frame to ECC Page mapping

T2EC for Chipkill

T2EC for DoubleChipkill

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

High

(c) Mattan Erez 83

Page 84: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Two-Tiered ECC

Write Read

T1ECDecoding

Data T1EC

T1EC/T2ECEncoding

T2EC

(c) Mattan Erez 84

Page 85: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

85(c) Mattan Erez

Write Read

T1ECDecoding

Data T1EC

Virtualized

T2EC

T1EC/T2ECEncoding

T2ECMapping

Page 86: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Caching work great

0.98

0.99

1.00

1.01

1.02

Chipkill Detect ChipkillCorrect

2 ChipkillCorrect

Normalized Execution Time

(c) Mattan Erez 86

Page 87: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Bonus: flexibility of ECC

0.75

0.80

0.85

0.90

0.95

1.00

Chipkill Detect Chipkill Correct 2 ChipkillCorrect

Normalized Energy-Delay Product

(c) Mattan Erez 87

Page 88: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Can we do even better?

– Hide second-tier

88(C) Mattan Erez

Page 89: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

FrugalECC = Compression + VECC

89(C) Mattan Erez

Page 90: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Adapt resilience scheme or adapt storage?

90(C) Mattan Erez

Page 91: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Adapt compression scheme too!Coverage-oriented-Compression (CoC)

91(C) Mattan Erez

Page 92: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

92(C) Mattan Erez

Page 93: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

93(C) Mattan Erez

Page 94: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Experiments promising:

Match or exceed reliability

Improve power

Minimal impact on performance

94(C) Mattan Erez

Page 95: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

95(C) Mattan Erez

Page 96: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

96(C) Mattan Erez

Page 97: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•Architecture so far:

– Proportionality

• Adaptivity

• SW-HW co-tuning

• Heterogeneity

– Locality

– Parallelism

– Hierarchy

97(c) Mattan Erez

Page 98: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•The reliability challenge

(c) Mattan Erez 98

1

10

100

1000

2013

(15PF)

2015

(40PF)

2017

(130PF)

2019

(450PF)

2021

(1500PF)

Pro

ject

ed

MT

TI

[ho

urs

]

Year (Expected performance in PetaFlops)

Hard

Soft

Intermittent SW

Overall

Page 99: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Detect Contain Repair Recover

(c) Mattan Erez 99

Page 100: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Today’s techniques not good enough

– Hardware support expensive

– Checkpoint-restart doesn’t scale

(c) Mattan Erez 100

Page 101: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Exascale resilience must be:

– Proportional

– Parallel

– Adaptive

– Hierarchical

– Analyzable

(c) Mattan Erez 101

Page 102: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Containment domains

– Embed resilience within application

– Match system hierarchy and characteristics

– Analyzable abstraction consistent across layersRoot CD

Child CD

CDs don’t communicate

erroneous data

CDs have

means to recover

(c) Mattan Erez 102

Page 103: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Hardware Abstraction Layer

Microkernel

Runtime Library Interface

Machine

efficiency-oriented

programming model

int main(int argc, char **argv){

main_task here = phalanx::initialize(argc, argv);

… Create test arrays here …

// Launch kernel on default CPU (“host”)openmp_event e1 = async(here, here.processor(), n)

(host_saxpy, 2.0f, host_x, host_y);

// Launch kernel on default GPU (“device”)cuda_event e2 = async(here, here.cuda_gpu(0), n)

(device_saxpy, 2.0f, device_x, device_y);

wait(here, e1&e2);return 0;

}

Phalanx C++ program

CD Annotations

resiliency model

Error

Reporting

Architecture

ECC, status

CD control and persistence

CD API

resiliency interface

(c) Mattan Erez 103

101, 106

Page 104: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Mapping example: SpMV

104(c) Mattan Erez

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

}

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

}

}

𝑴

Matrix M

𝑽

Vector V

Page 105: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Mapping example: SpMV

105(c) Mattan Erez

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

}

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

}

}

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Page 106: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Mapping example: SpMV

106(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Distributed to 4 nodes

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

}

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

}

}

Page 107: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Mapping example: SpMV

107(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

}

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

}

}

Distributed to 4 nodes

Page 108: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Mapping example: SpMV

108(c) Mattan Erez

𝑴𝟎𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟏𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟎𝟏 𝑽𝟏

Preserve

DetectRecover

𝑴𝟏𝟏 𝑽𝟏

Preserve

DetectRecover

Preserve

DetectRecover

M V

Parent CD

Child CD

Preserve (Parent)

Detect (Parent)

Recover (Parent)

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Page 109: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

void task<inner> SpMV(in M, in Vi, out Ri) {

cd = begin(parentCD);

preserve_via_copy(cd, matrix, …);

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

complete(cd);

}

void task<leaf> SpMV(…) {

cd = begin(parentCD);

preserve_via_copy(cd, matrix, …);

preserve_via_parent(cd, veci, …);

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

check {fault<fail>(c > prevC);}

prevC=c;

}

complete(cd);

}

109(c) Mattan Erez

API

begin

configure

preserve

check

complete

Page 110: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•SpMV preservation tuning

110(c) Mattan Erez

Natural redundancy

void task<leaf> SpMV(…) {cd = begin(parentCD);preserve_via_copy(cd, matrix, …);preserve_via_parent(cd, veci, …);for r=0..N for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];check {fault<fail>(c > prevC);}prevC=c;

}complete(cd);}

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Hierarchy

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

Page 111: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Concise abstraction for complex behavior

111(c) Mattan Erez

void task<leaf> SpMV(…) {

cd = begin(parentCD);

preserve_via_copy(cd, matrix, …);

preserve_via_parent(cd, veci, …);

for r=0..N

for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];

check {fault<fail>(c > prevC);}

prevC=c;

}

complete(cd);}

Local copy or regen Sibling Parent (unchanged)

Page 112: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

Recovery: concurrent, hierarchical, adaptive

112(c) Mattan Erez

Compute

Preserve

Detect

Re-execution overhead

Tim

e

Page 113: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Analyzable

– Application model (tree)

– Overhead model

– Fault model

113(c) Mattan Erez

Ex

ec

utio

n t

ime

Page 114: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•𝑞 0,𝑚, 𝑛 = 1 − 𝑝𝑐𝑚∙𝑛

– Probability that all child CDs experienced no failures

•𝑞 𝑥,𝑚, 𝑛 = 𝑖=0𝑥 𝑖+𝑚−1

𝑖𝑝𝑐𝑖 1 − 𝑝𝑐

𝑚𝑛

– Probability that all child CDs experienced at most x failures in the m serial CDs

•𝑑 0,𝑚, 𝑛 = 𝑞[0,𝑚, 𝑛]– Probability that all child CDs experienced exactly 0

failures

•𝑑 𝑥,𝑚, 𝑛 = 𝑞 𝑥,𝑚, 𝑛 − 𝑞 𝑥 − 1,𝑚, 𝑛– Probability that sibling with the most failure

experiences exactly x failures

•𝑇𝑝𝑎𝑟𝑒𝑛𝑡 𝑥,𝑚, 𝑛 = 𝑖=0∞ 𝑖 + 𝑚 𝑇𝑐𝑑[𝑖,𝑚, 𝑛]

•Heterogeneous CDs

•Asymmetric preservation and restoration

114

Re

-exe

cu

tio

n t

ime

Parallel domains

Execution

Re-execution

Analyzable

(c) Mattan Erez

Page 115: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Analyzable?

115(c) Mattan Erez

Ex

ec

utio

n t

ime

108

Page 116: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•CDs are scalable

116(c) Mattan Erez

Peak System Performance

NT

SpMV

HPCCG

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, NT

h-CPR, 80%

g-CPR, 80%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

Page 117: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•CDs are proportional and efficient

117(c) Mattan Erez

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, NT

h-CPR, 80%

gCPR, 80%

Peak System Performance

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

NT

SpMV

HPCCG

Page 118: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•It’s all about the algorithm

118(c) Mattan Erez

Page 119: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•Architecture so far:

– Proportionality

• Adaptivity

• SW-HW co-tuning + analyzability

• Heterogeneity

– Locality

– Parallelism

– Hierarchy

119(c) Mattan Erez

Arithmetic

Control

Memory

Reliability

Programming

Page 120: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Exascale computers are lunacy

120(c) Mattan Erez

Page 121: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Exascale computers are lunacyscience

121(c) Mattan Erez

Page 122: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688

•Harbingers of things to come

•Discovery awaits

•Enable and promote open innovation

•Math + Systems +

Engineering +

• Technology

122(c) Mattan Erez