Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care
Mattan Erez
The University of Texas at Austin
1
Arch-focused whole-system approach
•Efficiency requirements require crossing layers
•Algorithms are key– Compute less, move less, store less
•Proportional systems– Minimize waste
•Utilize and improve emerging technologies
•Explore (and act) up and down– Programming model to circuits
•Preferably implement at micro-arch/system
(C) Mattan Erez 2
Big problems and emerging platforms
•Memory systems– Capacity, bandwidth, efficiency – impossible to balance
– Adaptive and dynamic management helps
– New technologies to the rescue?
– Opportunities for in-memory computing
•GPUs, supercomputers, clouds, and more– Throughput oriented designs are a must
– Centralization trend is interesting
•Reliability and resilience– More, smaller devices – danger of poor reliability
– Trimmed margins – less room for error
– Hard constraints – efficiency is a must
– Scale exacerbates reliability and resilience concerns
(C) Mattan Erez 3
Lots of interesting multi-level projects
Resilience/Reliability
GPU/CPUMemory
(C) Mattan Erez 4
•A superscale computer is a high-performance system for solving big problems
(c) Mattan Erez 5
•A supercomputer is a high-performance system for solving big cohesive problems
(c) Mattan Erez 6
•A supercomputer is a high-performance system for solving big cohesive problems
(c) Mattan Erez 7
•Simulate
• Analyze
• Predict
(c) Mattan Erez 8
•Simulate
• Analyze
• Predict
(c) Mattan Erez 9
•Supercomputers are general
– Not domain specific
•Cloud computers are general
– Algorithms keep changing
(c) Mattan Erez 10
•Superscale computers must balance
– Compute
– Storage
– Communication
– Constraints
• OpEx and CapEx
11(c) Mattan Erez
•Super is never super enough
(c) Mattan Erez 12
1E+8
1E+9
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
1E+17
1E+18
Su
sta
ine
d F
LOP
/s
Num. 1
Num. 10
Num.500
Idealized VLSI
(c) Mattan Erez 13
•What’s keeping us from exascale?
(c) Mattan Erez 14
•The power problem
1
10
100
1000
10000
100000
1000000 Power
Efficiency
Tota
l P
ow
er
[kW
]
Eff
icie
nc
y [
GFLO
PS/k
W]
(c) Mattan Erez 15
•Solution: build more efficient systems
(c) Mattan Erez 16
•Exascale computers are a lunacy:
– Crazy big (1M processors, >10MW)
– Need efficiency of embedded devices
– Effective for many domains
– Fully programmable
17(c) Mattan Erez
•Why should you care?
•Harbingers of things to come
•Discovery awaits
•Enable and promote open innovation
18(c) Mattan Erez
•Where does the power go?
(c) Mattan Erez 19
•Cooling and infrastructure
– Amazing mechanical and electrical systems
– Getting close to optimal efficiency
17 PFLOP/s
18,688 nodes
>8 MW
~200 cabinets
~400 m2 floorspace
(c) Mattan Erez 20
•Actual processing
+
−
×√ x2
Arithmetic…100101001
0100…
I/O
MemoryControl
Comm.
Idle/margin
How much of each component?
(c) Mattan Erez 21
•The budget: Power ≤ 20MW
(c) Mattan Erez 22
• 20MW / 1 exa-FLOP/sEnergy ≤
20pJ/op
• 50 GFLOPs/W sustained
(c) Mattan Erez 23
• 20MW / 1 exa-FLOP/sEnergy ≤
20pJ/op
• 50 GFLOPs/W sustained
• Best supercomputer today: ~300pJ/op
(c) Mattan Erez 24
Arithmetic
64-bit floating-point operation
Rough estimated numbers
+
−
×√ x2
40
2015
5
0
10
20
30
40
50
Today Scaled Researchy
Arith. Headroom
(c) Mattan Erez 25
260
•Enough headroom?
(c) Mattan Erez 26
•Unfortunately, hard tradeoffs
20mm
64-bit DP
15pJ
15 pJ 80 pJ
100 pJ
500 pJEfficient
off-chip
link
10nm
256-bit
buses
2 nJDRAM
Rd/Wr
256-bit access
8 kB SRAM
15 pJ
(c) Mattan Erez 27
•Need more headroom
– Minimize waste
+
−
×√ x2
(c) Mattan Erez 28
Do we care about single-unit performance?
•Must all results be equally precise?
•Must all results be correct?
•Lunacy?
+
−
×√ x2
(c) Mattan Erez 29
•Relaxed reliability and precision– Some lunacy
(rare easy-to-detect errors + parallelism)
– Lunatic fringe: bounded imprecision
– Lunacy: live with real unpredictable errors
40
2015 12 8
2
5 8 1218
0
10
20
30
40
50
Today Scaled Researchy Some
lunacy
Lunatic
fringe
Lunacy
Arith. Headroom
Rough estimated numbers for illustration purposes
+
−
×√ x2
(c) Mattan Erez 30
•Bounded Approximate Duplication +
−
×√ x2
(c) Mattan Erez 31
•Bounded Approximate Duplication +
−
×√ x2
(c) Mattan Erez 32
•Supercomputers are general
• Dynamic adaptivity and tuning
– Some programmers crazier than others
– Large effort in error-tolerant algorithms
+
−
×√ x2
(c) Mattan Erez 33
•Proportionality and overprovisioning
(c) Mattan Erez 34
Arithmetic
Control
Memory
Margin
Dense
Sparse
Latency
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•It’s all about the algorithm
35(c) Mattan Erez
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture concepts so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning
– Locality
– Parallelism
36(c) Mattan Erez
•How to control a billion arithmetic units?
(c) Mattan Erez 37
•Hierarchy minimizes control cost
– Amortize control decisions
Program
Teams
Processes
Threads
Instructions
Machine
Cabinets
/modules
Nodes
Cores
HW units
(c) Mattan Erez 38
•Hierarchy minimizes control cost
– Amortize control decisions
(c) Mattan Erez 39
Program
Teams
Processes
Threads
Instructions
•What does a HW instruction do?
++++
xxxx
+ + + +
x x x x
x x
+
+
√
- +
Amortize/specialize more
CPU
GPU
Embedded
(c) Mattan Erez 40
•Single-thread or bulk-thread
latency
orientedthroughput
oriented
CPU GPU
(c) Mattan Erez 41
Heterogeneity is a necessity disciplined specialization
(c) Mattan Erez 42
•Must balance generality and control cost
•Over-specialization stifles innovation
– Algorithms more important than hardware
(c) Mattan Erez 43
Heterogeneity is lunacy
– Programing and tuning extremely tricky
– Diverse and rapidly evolving algorithms
(c) Mattan Erez 44
•Disciplined heterogeneityScarcity of choice
– Throughput oriented
– Latency oriented
– Disciplined reconfigurable accelerators
(c) Mattan Erez 45
•Heterogeneous computing already here
– “GPU” for throughput
– CPU for latency
– Abstracted FPGA accelerators
46(c) Mattan Erez
•Heterogeneous computing already here
Titan node (Cray XK7)
NVIDIA GPU + AMD CPU
Stampede node
Intel Xeon CPU + Xeon Phi
(c) AMD. Cray, Intel, NVIDIA, xSede 47
Copyrighted
picture of
an XK7 node
Copyrighted
picture of
Kepler die
Copyrighted
picture of
Opteron die
Copyrighted
picture of
Xeon Phi card
Copyrighted
picture of
Xeon Phi die
Copyrighted
picture of
Sany Bridge die
Copyrighted
picture of
Stampede
node
•Choose most efficient core
48(c) Mattan Erez
•Choose most efficient core
49(c) Mattan Erez
Arithmetic
Control
Memory
Margin
+
+
+
+
+ + + +=?
•Generalize the throughput core
50(c) Mattan Erez
52
•Generalize the throughput core
51(c) Mattan Erez
•Generalize the throughput core
52(c) Mattan Erez
•Synchronization may impair execution
– Predict when necessary robust optimization
– Adaptive compaction
: Compaction barrier
(c) Mattan Erez 53
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• Heterogeneity
• SW-HW co-tuning
– Locality
– Parallelism
– Hierarchy
54(c) Mattan Erez
Heterogeneity is lunacy
– Programing and tuning extremely tricky
– Diverse and rapidly evolving algorithms
•How can we program these machines
(c) Mattan Erez 55
Hierarchical languages restore some sanity
– Abstract key tuning knobs
• Parallelism
• Locality
• Hierarchy
• Throughput
(c) Mattan Erez 56
Hierarchical languages restore some sanity
– CUDA and OpenCL
– Sequoia (Stanford)
– Habanero (Rice)
– Phalanx (NVIDIA)
– Legion (Stanford)
– Working into X10, Chapel
– …
•Domain specific languages even better
57(c) Mattan Erez
•A counter example:DE Shaw Research Anton
58(c) DE Shaw Research
Copyrighted
picture of
Anton package
image search)
Copyrighted
figure demonstrating
molecular dynamics
image search)
Copyrighted
figure of
Anton system
image search)
•Another counter example?
•Microsoft Catapult
– Disciplined reconfigurability for the cloud
– Distributed FPGA accelerator
• Within form-factor and cost constraints
– Architected interfaces and compiler
• Compiler for software, not (just) hardeare
59(c) Mattan Erez
Copyrighted
figure of
datacenter
(Google image search)
Copyrighted
figure of
Catapult diagram
(from ISCA 2014 paper)
Memory
Cheap
Efficient
LargeFast
Reliable
(c) Mattan Erez 60
•Fast + large hierarchy
(c) Mattan Erez 61
TSVSilicon
Interposer
•Fast + large hierarchical
(c) Micron, Intel, Mattan Erez 62
mem
mem
•Large + efficient heterogeneity
(c) Mattan Erez 63
TSVSilicon
Interposer
•Large + efficient heterogeneous
(c) Micron, Intel, Mattan Erez 64
DRAM +
NVRAM
•Heterogeneous + hierarchical big mess
•Research just starting
– New mechanisms needed
– Very interesting tradeoffs with reliability
(c) Mattan Erez 65
•Fast + efficient control hierarchy
+ parallelism
(c) Mattan Erez 66
•Fast + efficient hierarchy + parallelism
Access cell
(c) Mattan Erez 67
•Fast + efficient hierarchy + parallelism
Access many
cells in parallel
Many pins
in parallel
(c) Mattan Erez 68
•Fast + efficient hierarchy + parallelism
Access even more
cells in parallel
Many pins
in parallel over
multiple cycles
(c) Mattan Erez 69
•Fast + efficient hierarchy + parallelism
Access even more
cells in parallel
Many pins
in parallel over
multiple cycles
(c) Mattan Erez 70
•Hierarchy + parallelism potential waste
(c) Mattan Erez 71
•Is fine-grained access better?
– Wasteful when all data is needed
D0 E0
DataC
C0
C1
D1 E1 D2 E2 D3 E3C2
C3
C4
D4 E4
CG
FGC5
C6
C7
D5 E5 D6 E6 D7 E7
ECC
D0 E0
DataC
C0
CG
FG
ECC
(c) Mattan Erez 72
72
•Dynamically adapt granularity best of both worlds
73
ProcessorCore
$I$D
L2
Last Level Cache
MC
DRAM
Core0
Core1
CoreN-1
…
SLP
Co-scheduling CG/FG w/ split CG
Sector cache (8 8B sectors per line)
Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG
Locality Predictor
(c) Mattan Erez
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
mcf x8 omnetpp
x8
mst x8 em3d x8 canneal
x8
SSCA2 x8 linked list
x8
gups x8 MIX-A MIX-B MIX-C MIX-D
Sy
ste
m T
hro
ug
hp
ut
CG+ECC
FG+ECC
AG+ECC
•Dynamically adapt granularity
(c) Mattan Erez 74
•Cost + … poor reliability
•Efficiency + … poor reliability
•New + … poor reliability
(c) Mattan Erez 75
•Compensate with error protection
– ECC
(c) Mattan Erez 76
•Compensate with proportional protection
(c) Mattan Erez 77
•Adaptive reliability
– Adapt the mechanism
– Adapt the error rate
• precision (stochastic precision)
(c) Mattan Erez 78
•Adaptive reliability
– By type
• Application guided
– By location
• Machine-state guided
(c) Mattan Erez 79
•Example: Virtualized ECC
– Adapt the mechanism
(c) Mattan Erez 80
Physical Memory
Data ECC
Virtual Address space
LowVirtual Page to Physical Frame mapping
Page frame – x
Page frame – y
Page frame – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 81
Physical Memory
Data
Virtual Address space
LowVirtual Page to Physical Frame mapping
Physical Frame to ECC Page mapping
ECC
StrongerECC
Page frame – x
Page frame – y
Page frame – z
ECC page – y
ECC page – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 82
Physical Memory
Data T1EC
Virtual Address space
LowVirtual Page to Physical Frame mapping
Physical Frame to ECC Page mapping
T2EC for Chipkill
T2EC for DoubleChipkill
Page frame – x
Page frame – y
Page frame – z
ECC page – y
ECC page – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 83
Two-Tiered ECC
Write Read
T1ECDecoding
Data T1EC
T1EC/T2ECEncoding
T2EC
(c) Mattan Erez 84
85(c) Mattan Erez
Write Read
T1ECDecoding
Data T1EC
Virtualized
T2EC
T1EC/T2ECEncoding
T2ECMapping
•Caching work great
0.98
0.99
1.00
1.01
1.02
Chipkill Detect ChipkillCorrect
2 ChipkillCorrect
Normalized Execution Time
(c) Mattan Erez 86
•Bonus: flexibility of ECC
0.75
0.80
0.85
0.90
0.95
1.00
Chipkill Detect Chipkill Correct 2 ChipkillCorrect
Normalized Energy-Delay Product
(c) Mattan Erez 87
•Can we do even better?
– Hide second-tier
88(C) Mattan Erez
FrugalECC = Compression + VECC
89(C) Mattan Erez
Adapt resilience scheme or adapt storage?
90(C) Mattan Erez
Adapt compression scheme too!Coverage-oriented-Compression (CoC)
91(C) Mattan Erez
92(C) Mattan Erez
93(C) Mattan Erez
Experiments promising:
Match or exceed reliability
Improve power
Minimal impact on performance
94(C) Mattan Erez
95(C) Mattan Erez
96(C) Mattan Erez
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning
• Heterogeneity
– Locality
– Parallelism
– Hierarchy
97(c) Mattan Erez
•The reliability challenge
(c) Mattan Erez 98
1
10
100
1000
2013
(15PF)
2015
(40PF)
2017
(130PF)
2019
(450PF)
2021
(1500PF)
Pro
ject
ed
MT
TI
[ho
urs
]
Year (Expected performance in PetaFlops)
Hard
Soft
Intermittent SW
Overall
Detect Contain Repair Recover
(c) Mattan Erez 99
•Today’s techniques not good enough
– Hardware support expensive
– Checkpoint-restart doesn’t scale
(c) Mattan Erez 100
•Exascale resilience must be:
– Proportional
– Parallel
– Adaptive
– Hierarchical
– Analyzable
(c) Mattan Erez 101
•Containment domains
– Embed resilience within application
– Match system hierarchy and characteristics
– Analyzable abstraction consistent across layersRoot CD
Child CD
CDs don’t communicate
erroneous data
CDs have
means to recover
(c) Mattan Erez 102
Hardware Abstraction Layer
Microkernel
Runtime Library Interface
Machine
efficiency-oriented
programming model
int main(int argc, char **argv){
main_task here = phalanx::initialize(argc, argv);
… Create test arrays here …
// Launch kernel on default CPU (“host”)openmp_event e1 = async(here, here.processor(), n)
(host_saxpy, 2.0f, host_x, host_y);
// Launch kernel on default GPU (“device”)cuda_event e2 = async(here, here.cuda_gpu(0), n)
(device_saxpy, 2.0f, device_x, device_y);
wait(here, e1&e2);return 0;
}
Phalanx C++ program
CD Annotations
resiliency model
Error
Reporting
Architecture
ECC, status
CD control and persistence
CD API
resiliency interface
(c) Mattan Erez 103
101, 106
Mapping example: SpMV
104(c) Mattan Erez
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
𝑴
Matrix M
𝑽
Vector V
Mapping example: SpMV
105(c) Mattan Erez
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
•Mapping example: SpMV
106(c) Mattan Erez
𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
Distributed to 4 nodes
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
•Mapping example: SpMV
107(c) Mattan Erez
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
Distributed to 4 nodes
•Mapping example: SpMV
108(c) Mattan Erez
𝑴𝟎𝟎 𝑽𝟎
Preserve
DetectRecover
𝑴𝟏𝟎 𝑽𝟎
Preserve
DetectRecover
𝑴𝟎𝟏 𝑽𝟏
Preserve
DetectRecover
𝑴𝟏𝟏 𝑽𝟏
Preserve
DetectRecover
Preserve
DetectRecover
M V
Parent CD
Child CD
Preserve (Parent)
Detect (Parent)
Recover (Parent)
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
void task<inner> SpMV(in M, in Vi, out Ri) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
complete(cd);
}
void task<leaf> SpMV(…) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
preserve_via_parent(cd, veci, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
check {fault<fail>(c > prevC);}
prevC=c;
}
complete(cd);
}
109(c) Mattan Erez
API
begin
configure
preserve
check
complete
•SpMV preservation tuning
110(c) Mattan Erez
Natural redundancy
void task<leaf> SpMV(…) {cd = begin(parentCD);preserve_via_copy(cd, matrix, …);preserve_via_parent(cd, veci, …);for r=0..N for c=rowS[r]..rowS[r+1] {
resi[r]+=data[c]*Vi[cIdx[c]];check {fault<fail>(c > prevC);}prevC=c;
}complete(cd);}
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
Hierarchy
𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏
•Concise abstraction for complex behavior
111(c) Mattan Erez
void task<leaf> SpMV(…) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
preserve_via_parent(cd, veci, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {
resi[r]+=data[c]*Vi[cIdx[c]];
check {fault<fail>(c > prevC);}
prevC=c;
}
complete(cd);}
Local copy or regen Sibling Parent (unchanged)
Recovery: concurrent, hierarchical, adaptive
112(c) Mattan Erez
Compute
Preserve
Detect
Re-execution overhead
Tim
e
•Analyzable
– Application model (tree)
– Overhead model
– Fault model
113(c) Mattan Erez
Ex
ec
utio
n t
ime
•𝑞 0,𝑚, 𝑛 = 1 − 𝑝𝑐𝑚∙𝑛
– Probability that all child CDs experienced no failures
•𝑞 𝑥,𝑚, 𝑛 = 𝑖=0𝑥 𝑖+𝑚−1
𝑖𝑝𝑐𝑖 1 − 𝑝𝑐
𝑚𝑛
– Probability that all child CDs experienced at most x failures in the m serial CDs
•𝑑 0,𝑚, 𝑛 = 𝑞[0,𝑚, 𝑛]– Probability that all child CDs experienced exactly 0
failures
•𝑑 𝑥,𝑚, 𝑛 = 𝑞 𝑥,𝑚, 𝑛 − 𝑞 𝑥 − 1,𝑚, 𝑛– Probability that sibling with the most failure
experiences exactly x failures
•𝑇𝑝𝑎𝑟𝑒𝑛𝑡 𝑥,𝑚, 𝑛 = 𝑖=0∞ 𝑖 + 𝑚 𝑇𝑐𝑑[𝑖,𝑚, 𝑛]
•Heterogeneous CDs
•Asymmetric preservation and restoration
114
Re
-exe
cu
tio
n t
ime
Parallel domains
Execution
Re-execution
Analyzable
(c) Mattan Erez
•Analyzable?
115(c) Mattan Erez
Ex
ec
utio
n t
ime
108
•CDs are scalable
116(c) Mattan Erez
Peak System Performance
NT
SpMV
HPCCG
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, NT
h-CPR, 80%
g-CPR, 80%
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, SpMV
h-CPR, 50%
g-CPR, 50%
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, HPCCG
h-CPR, 10%
g-CPR, 10%
•CDs are proportional and efficient
117(c) Mattan Erez
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, SpMV
h-CPR, 50%
g-CPR, 50%
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, NT
h-CPR, 80%
gCPR, 80%
Peak System Performance
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, HPCCG
h-CPR, 10%
g-CPR, 10%
NT
SpMV
HPCCG
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•It’s all about the algorithm
118(c) Mattan Erez
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning + analyzability
• Heterogeneity
– Locality
– Parallelism
– Hierarchy
119(c) Mattan Erez
Arithmetic
Control
Memory
Reliability
Programming
•Exascale computers are lunacy
120(c) Mattan Erez
•Exascale computers are lunacyscience
121(c) Mattan Erez
•Harbingers of things to come
•Discovery awaits
•Enable and promote open innovation
•Math + Systems +
Engineering +
• Technology
122(c) Mattan Erez