78
Celsius Lecture 2/14/13 1

Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 1

Page 2: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 2

Exascale Computing Will Enable Transformational Science

Presentatör�
Presentationsanteckningar�
Supercomputers are scientific instruments – like telescopes, microscopes, and particle accelerators – but they advance all fields of science Let me tell you about my quest to build the next generation of instruments�
Page 3: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 3

Climate

Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies.

Page 4: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 4

Combustion

First-principles simulation of combustion for new high- efficiency, low-emision engines.

Page 5: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 5

Biology

Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.

Page 6: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 6

Astrophysics

Predictive calculations for thermonuclear and core- collapse supernovae, allowing confirmation of theoretical models.

Page 7: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 7

Exascale Computing Will Enable Transformational Science

High-Performance Computers are Scientific Instruments

Page 8: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 8

Titan: World’s #1 Open Science Supercomputer

18,688 NVIDIA Tesla K20X GPUs

27 Petaflops

Peak: 90% of Performance from GPUs

17.59 Petaflops

Sustained Performance on Linpack

Presentatör�
Presentationsanteckningar�
Currently the best available instrument�
Page 9: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 9

Titan & Kepler

18,688 NVIDIA Kepler GK11027 PF peak (90% from GPUs)17.6PF HP Linpack2.12 GF/W

GK110 is 7GF/W

Presentatör�
Presentationsanteckningar�
Titan is powered by Kepler, a GPU $1B to develop each generation of GPU – can’t afford to do this for HPC alone Graphics very aligned in its requirements�
Page 10: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 10

The Road to Exascale

201220PF

18,000GPUs10MW

2GFLOPs/W~107 Threads

You are Here2020

1000PF (50x)72,000HCNs (4x)

20MW (2x)50GFLOPs/W (25x)

~1010 Threads (1000x)

Page 11: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 11

Technical Challenges on The Road to Exascale

201220PF

18,000GPUs10MW

2GFLOPs/W~107 Threads

20201000PF (50x)

72,000HCNs (4x)20MW (2x)

50GFLOPs/W (25x)~1010 Threads (1000x)

1. Energy Efficiency

Page 12: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 12

Technical Challenges on The Road to Exascale

201220PF

18,000GPUs10MW

2GFLOPs/W~107 Threads

20201000PF (50x)

72,000HCNs (4x)20MW (2x)

50GFLOPs/W (25x)~1010 Threads (1000x)

1. Energy Efficiency2. Parallel Programmability

Page 13: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 13

Technical Challenges on The Road to Exascale

201220PF

18,000GPUs10MW

2GFLOPs/W~107 Threads

20201000PF (50x)

72,000HCNs (4x)20MW (2x)

50GFLOPs/W (25x)~1010 Threads (1000x)

1. Energy Efficiency2. Parallel Programmability3. Resilience

Page 14: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 14

50x performance in 8 years, Moore’s Law will take care of that, right?

Page 15: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 15

50x performance in 8 years, Moore’s Law will take care of that, right?

Wrong!

Page 16: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 16

Moore’s Law gives us transistors Which we used to turn into scalar performance

Moore, Electronics 38(8) April 19, 1965

Presentatör�
Presentationsanteckningar�
For many years we have counted on the scaling of scalar processing elements to add performance that enabled new features and capabilities. Moore’s law gave us more transistors. Architects turned these transistors into more scalar performance Applications turned this performance into value. This historic scaling is at an end for two reasons ILP is mined out Power scaling has changed – we can no longer afford high-overhead means to get scalar performance�
Page 17: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 17

ISAT LCC: 17

But ILP was ‘mined out’ in 2000

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1

Dally et al. “The Last Classical Computer”, ISAT Study, 2001

Presentatör�
Presentationsanteckningar�
Note scale Transistors turned into performance via – faster devices, deeper pipelines, more ILP Parallel computing makes sense now While microprocessors have sustained performance improvements of 52%/year, fabrication technology has actually provided a much higher growth rate in potential capability. When accounting for increased transistor counts and faster transistor switching speeds, the capability of microprocessor-scale integrated circuits has been improving at 74%/year. Until now, the differential between the 74% and 52% rates has resulted in only a factor of 30 of untapped performance potential. However, with only 19% per year projected in the future, the differential is expected to increase to a factor of 30,000 by 2020. This quantity represents a tremendous opportunity for novel architectures to help bridge the performance gap and to enable future computer systems to solve increasingly complex and important problems. �
Page 18: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 18

And L3 energy scaling ended in 2005

Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003

Presentatör�
Presentationsanteckningar�
End of Dennard (constant field scaling) Another semi-log chart due to Moore – the end of power scaling. Now energy goes as L rather than L^3. �
Page 19: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 19

Result: The End of Historic Scaling

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Presentatör�
Presentationsanteckningar�
The mining out of ILP Individual processors aren’t getting faster And we can’t put more on a chip and stay within power budgets�
Page 20: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 20

Historic scaling is at an end!

To continue performance scaling of all sizes of computer systems requires addressing two challenges:

Power and Parallelism

Much of the economy depends on this

Page 21: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 21

The Power Challenge

Page 22: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 22

In the past we had constant-field scaling L’ = L/2 V’ = V/2

E’ = CV2 = E/8 f’ = 2f

D’ = 1/L2 = 4D P’ = P

Halve L and get 8x the capability for the same power

Page 23: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 23

Now voltage is held nearly constant L’ = L/2 V’ = V

E’ = CV2 = E/2 f’ = 2f*

D’ = 1/L2 = 4D P’ = 4P

Halve L and get 2x the capability for the same power in ¼ the area

*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did

Page 24: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 24

Performance = Efficiency

Efficiency = Locality

Page 25: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 25

Locality

Page 26: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 26

The High Cost of Data Movement Fetching operands costs more than computing on them

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28nm

256-bitbuses

16 nJ DRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

Presentatör�
Presentationsanteckningar�
Architects are artists Canvas is CMOS chip This is what things cost on a modern CMOS chip Efficiency = Locality – because most cost is in Data Movement.�
Page 27: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 27

Scaling makes locality even more important

Page 28: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 28

Its not about the FLOPS

Its about data movement

Algorithms should be designed to perform more work per unit data movement.

Programming systems should further optimize this data movement.

Architectures should facilitate this by providing an exposed hierarchy and efficient communication.

Page 29: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 29

Move Bits More Efficiently

Page 30: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 30

Move Fewer Bits

forall cells in set { compute_x_flux(cell) ;

}forall cells in set {

compute_y_flux(cell) ;}forall cells in set {

compute_z_flux(cell) ;}forall cells in set {

compute_p(cell) ;}

Page 31: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 31

Move Fewer Bits

forall cells in set { compute_x_flux(cell) ;compute_y_flux(cell) ;compute_z_flux(cell) ;compute_p(cell) ;

}

Page 32: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 32

Move Fewer Bits

forall blocks in set {// hierarchicallylocalize(block)forall cells in block {

compute_x_flux(cell) ;compute_y_flux(cell) ;compute_z_flux(cell) ;compute_p(cell) ;

}}

Page 33: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 33

System SketchSystem Sketch

Page 34: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 34

Echelon Chip Floorplan

L2 Banks

XBAR

NOC

SMLa

ne

Lane

Lane

Lane

Lane

Lane

Lane

Lane

SMSM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LOC

NOC

SMSMSMSM

NOC

SMSMSMSM

NOCSMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

LOC

NOC

SM SM SM SM

NOC

SM SM SM SMNOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

LOC

NOCSMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

LOC

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SMNOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

LOC

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSMNOC

SMSMSMSM

NOC

SMSMSMSM

LOC

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

LOC

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

NOC

SMSMSMSM

LOC

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SM

NOC

SM SM SM SMDRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O 17mm

10nm process290mm2

Page 35: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 35

Overhead

Page 36: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 36

4/11/11 Milad Mohammadi 36

An Out-of-Order CoreSpends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

Presentatör�
Presentationsanteckningar�
May want to make a better image if time permits�
Page 37: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 37

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0AddrL1Addr

Net

LM Bank

0

To LD/ST

LM Bank

3

RFL0AddrL1Addr

Net

RF

Net

DataPath

L0I$

Thre

ad P

Cs

Act

ive

PCs

Inst

ControlPath

Sch

edul

er

64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)

Page 38: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 38

Solving the Power Challenge – 1, 2, 3

Page 39: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 39

Solving the ExaScale Power Problem

Page 40: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 40

Parallelism

Page 41: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 41

Parallel programming is not inherently any more difficult than serial programming

However, we can make it a lot more difficult

Page 42: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 42

A simple parallel program

forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nestedmolecule.force = reduce_sum(force(molecule, neighbor))

}}

}

Page 43: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 43

Why is this easy?

forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nestedmolecule.force = reduce_sum(force(molecule, neighbor))

}}

}

No machine detailsAll parallelism is expressedSynchronization is semantic (in reduction)

Page 44: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 44

We could make it hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization// manipulate structunlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

Page 45: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 45

Programmers, tools, and architecture Need to play their positions

Programmer

Architectur eTools

Page 46: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 46

Programmers, tools, and architecture Need to play their positions

Programmer

Architectur eTools

AlgorithmAll of the parallelismAbstract locality

Fast mechanismsExposed costs

Combinatorial optimizationMappingSelection of mechanisms

Page 47: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 47

Programmers, tools, and architecture Need to play their positions

Programmer

Architectur eTools

forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { //

forall force in forces { // doubly nestedmolecule.force =

reduce_sum(force(molecule, neighbor))}

}}

Map foralls in time and spaceMap molecules across memoriesStage data up/down hierarchySelect mechanisms

Exposed storage hierarchyFast comm/sync/thread mechanisms

Page 48: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 48

Abstract description of Locality – not mapping

compute_forces::inner(molecules, forces) {tunable N ;set part_molecules[N] ;part_molecules = subdivide(molecules, N) ;

forall(i in 0:N-1) {compute_forces(part_molecules[i]) ;

}

Page 49: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 49

Abstract description of Locality – not mapping

compute_forces::inner(molecules, forces) {tunable N ;set part_molecules[N] ;part_molecules = subdivide(molecules, N) ;

forall(i in 0:N-1) {compute_forces(part_molecules) ;

}

Autotuner picks number and size of partitions - recursively

No need to worry about “ghost molecules”with global address space, it just works

Page 50: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 50

Autotuning Search Spaces

T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'BoyleCombined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.In IEEE PACT, pages 237-248, 2000.

ExeExecution Time of Matrix Multiplication for Unrolling and Tiling

Architecture enables simple and effective autotuning

Page 51: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 51

Performance of Auto-tuner

Conv2D SGEMM FFT3D SUmb

Cell Auto 96.4 129 57 10.5

Hand 85 119 54

Cluster Auto 26.7 91.3 5.5 1.65

Hand 24 90 5.5

Cluster of PS3s

Auto 19.5 32.4 0.55 0.49

Hand 19 30 0.23

Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.

For FFT3D, performances is with fusion of leaf tasks.

SUmb is too complicated to be hand-tuned.

Page 52: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 52

Fundamental and Incidental Obstacles to Programmability

FundamentalExpressing 109 way parallelismExpressing locality to deal with >100:1 global:local energyBalancing load across 109 cores

IncidentalDealing with multiple address spacesPartitioning data across nodesAggregating data to amortize message overhead

Page 53: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 53

The fundamental problems are hard enough. We must eliminate the incidental ones.

Page 54: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 54

Execution ModelExecution Model

A B

Active Message

Abstract Memory

Hierarchy

Global Address Space

ThreadObject

B

Load

/Sto

re

A

B Bulk Xfer

Page 55: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 55

Thread array creation, messages, block transfers, collective operations – at the “speed of light”

Page 56: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 56

Kepler

Hardware thread-array creation

Fast syncthreads() ;

Shared memory

Page 57: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 57

Scalar ISAs don’t matter

Parallel ISAs – the mechanisms for threads, communication, and synchronization make a huge difference.

Page 58: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 58

A Prescription

Page 59: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 59

Research

Need a research vehicle (experimental system)Co-design architecture, programming system, applications

Productive parallel programmingExpress all the parallelism and localityCompiler and run-time map to the target machineLeverage an existing eco-system

Mechanisms – for: threads, comm, syncEliminate ‘incidental’ programming issuesEnable fine-grain execution

PowerLocality – exposed memory hierarchy and software to use itOverhead – move scheduling to compiler

Others are investing, if we don’t invest we will be left behind.

Page 60: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 60

Education

We need parallel programmersBut we are training serial programmersand serial thinkers

Parallelism throughout the CS curriculumProgrammingAlgorithms

Parallel algorithmsAnalysis focused on communications, not counting ops

Systems

Models need to include locality

Page 61: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 61

A Bright Future from Supercomputers to Cellphones

Eliminate overhead and exploit locality to get 100x power efficiency

Easy parallelism with a coordinated team

ProgrammerToolsArchitectureHD Video

Decoder

HD VideoEncoder

Audio ISP

GPU

MEM I/O

HDMI

SecurityEngine

Display

Core 1

Core 3

Core 2

Core 4

Core 0

Page 62: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 62

Page 63: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 63

More Fundamentally

Both

are power limited

get performance from parallelism

need 100x performance increase in 10 years

Page 64: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 64

More Fundamentally

Both

are power limited

get performance from parallelism

need 100x performance increase in 10 years

Page 65: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 65

Granularity

Page 66: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 66

#Threads increasing faster than problem size.

Page 67: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 67

Number of Threads increasing faster than problem size

Page 68: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 68

Number of Threads increasing faster than problem size

WeakScalingWeak

Scaling

Page 69: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 69

Number of Threads increasing faster than problem size

WeakScalingWeak

ScalingStrongScalingStrongScaling

Page 70: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 70

Smaller sub-problem per thread

Page 71: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 71

Smaller sub-problem per thread

Page 72: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 72

Smaller sub-problem per thread

More frequent comm, sync, and thread operations

Page 73: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 73

Smaller sub-problem per thread

More frequent comm, sync, and thread operations

Page 74: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 74

This fine-grain parallelism is multi- level and irregular

Page 75: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 75

To support this requires fast mechanisms for

Thread arrays – create, terminate, suspend, resumeHardware allocation of resources to a thread array

threads, registers, shared memoryWith locality

CommunicationData movement up and down the hierarchyFast active messages (message-driven computing)

SynchronizationCollective operations (e.g., barrier, reduce)Pairwise (producer-consumer)

Page 76: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

Celsius Lecture2/14/13 76

Execution ModelExecution Model

A B

Active Message

Abstract Memory

Hierarchy

Global Address Space

ThreadObject

B

Load

/Sto

re

A

B Bulk Xfer

Page 77: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

J-Machine Speedup with Strong Scaling

Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235

Page 78: Celsius Lecture 2/14/13 1 - Uppsala University · 2019. 9. 9. · Celsius Lecture 2/14/13 2 Exascale Computing Will Enable Transformational Science. Supercomputers are scientific

J-Machine Speedup with Strong Scaling

Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235

2 characters per node