68
Cores, cores, everywhere Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

Cores, cores, everywhere

  • Upload
    eytan

  • View
    83

  • Download
    0

Embed Size (px)

DESCRIPTION

Cores, cores, everywhere. Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach , Akhilesh Singhania. - PowerPoint PPT Presentation

Citation preview

Page 1: Cores, cores, everywhere

Cores, cores, everywhere

Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson,

Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

Page 2: Cores, cores, everywhere

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Page 3: Cores, cores, everywhere

Amdahl’s law

“Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with 128 cores, how many cores do you need to use to get a 4x speed-up on the overall program?”

Page 4: Cores, cores, everywhere

Amdahl’s law, f=70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

#cores

Spee

dup

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

Limit as c→∞ = 1/(1-f) = 3.33

Page 5: Cores, cores, everywhere

Amdahl’s law, f=10%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

#cores

Spee

dup

Speedup achieved with perfect scaling

Amdahl’s law limit, just 1.11x

Page 6: Cores, cores, everywhere

Amdahl’s law, f=98%

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 1270

10

20

30

40

50

60

#cores

Spee

dup

Page 7: Cores, cores, everywhere

Amdahl’s law & multi-coreSuppose that the same h/w budget (space or power) can make us:

1 2

5 6

3 4

7 8

9 10

13 14

11 12

15 16

1

1 2

3 4

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Page 8: Cores, cores, everywhere

Perf of big & small cores

1/16 1/8 1/4 1/2 10.0

0.2

0.4

0.6

0.8

1.0

1.2

Resources dedicated to core

Core

per

f (re

lativ

e to

1 b

ig c

ore

Assumption: perf = α √resource

Total perf:16 * 1/4 = 4

Total perf:1 * 1 = 1

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Page 9: Cores, cores, everywhere

Amdahl’s law, f=98%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

#Cores

Perf

(rela

tive

to 1

big

core

)

1 big

4 medium

16 small

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Page 10: Cores, cores, everywhere

Amdahl’s law, f=75%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.2

0.4

0.6

0.8

1.0

1.2

#Cores

Perf

(rela

tive

to 1

big

core

)

1 big

4 medium

16 small

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Page 11: Cores, cores, everywhere

Asymmetric chips

1

3 4

7 8

9 10

13 14

11 12

15 16

Page 12: Cores, cores, everywhere

Amdahl’s law, f=75%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

#Cores

Perf

(rela

tive

to 1

big

core

)

1 big4 medium

16 small

1+12

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Page 13: Cores, cores, everywhere

Two hardware trends

Traditional multi-processor machines

Asymmetric performance and/or

instruction sets

Page 14: Cores, cores, everywhere

Cache-coherent multicore

AMD Istanbul: 6 cores, per-core L2, per-package L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

Page 15: Cores, cores, everywhere

Single-chip cloud computer (SCC)

24 * 2-core tilesOn-chip mesh n/w

Non-coherent cachesHardware supported messaging

L2 Core

L2

Router MPB

Core

VRC

MC-

1

MC-

3

MC-

0

MC-

4System

interface

RAM RAM

RAMRAM

Page 16: Cores, cores, everywhere

MSR Beehive

Ring interconnectMessage passing in h/w

No cache coherenceSplit-phase memory access

Module MemMux

MQ

DDR Controller

Core 2

RingIn[31:0],SlotTypeIn[3:0],SrcDestIn[3:0]

Core 3Core N

Module RISCNModule RISCN Module RISCN

Messages, Locks

RA from display controller

RA,WA

WDRD (128 bits) Rdreturn (32 bits)

(pipelined bus toall cores)

RD toDisplay

controller

Core 1

Module RISCN

RAM

Page 17: Cores, cores, everywhere

Two hardware trends

Traditional multi-processor machines

Asymmetric performance and/or

instruction sets

Non-cache-coherent access to memory

Page 18: Cores, cores, everywhere

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Page 19: Cores, cores, everywhere

Messaging vs shared data as default

• Fundamental model is message based• “It’s better to have shared memory and

not need it than to need shared memory and not have it”

Shared state,one-big-lock

Fine-grainedlocking

Clustered objects,partitioning

Distributed state,replica maintenance

Traditional operating systemsBarrelfish multikernel

Page 20: Cores, cores, everywhere

The Barrelfish multi-kernel OS

x64

Message passing

App

x64 ARM Accelerator core

App

OS node OS node OS node OS nodeState

replicaState

replicaState

replicaState

replica

App App

Hardware interconnect

Page 21: Cores, cores, everywhere

The Barrelfish multi-kernel OS

x64

Message passing

App

x64 ARM Accelerator core

App

OS node OS node OS node OS nodeState

replicaState

replicaState

replicaState

replica

App App

Hardware interconnect

System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64

Page 22: Cores, cores, everywhere

The Barrelfish multi-kernel OS

x64

Message passing

App

x64 ARM Accelerator core

App

OS node OS node OS node OS nodeState

replicaState

replicaState

replicaState

replica

App App

Hardware interconnect

System components, each local to a specific core, and using

message passing

System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64

Page 23: Cores, cores, everywhere

The Barrelfish multi-kernel OS

x64

Message passing

App

x64 ARM Accelerator core

App

OS node OS node OS node OS nodeState

replicaState

replicaState

replicaState

replica

App App

Hardware interconnect

User-mode programs: several models supported, including conventional shared-memory

OpenMP & pthreads

System components, each local to a specific core, and using

message passing

System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64

Page 24: Cores, cores, everywhere

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Page 25: Cores, cores, everywhere

Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { bool ok = true; for (core in cores) ok &= permUpdateRequest_rpc(core, page, flags); if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page, flags); } return ok;}

Two-Phase Commit

Voting Phase

Commit Phase

Blocking RPC before sending to next core

~400 cyclesassuming process is scheduled on other

core!

Page 26: Cores, cores, everywhere

Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { state_t *st = malloc (sizeof(state_t)); st->ok=true; st->page=page; st->flags=flags; st->count=0; for (core in cores) { permUpdateRequest_send(core, page, flags, st); st.count++;}}void recvReply(state_t st, bool ok) { st->ok &= ok; if (st->count-- == 0) { if (st->ok) { localUpdatePermissions(st->page, st->flags); for (core in cores) permUpdateCommit_send(core, st->page, st->flags); } else { for (core in cores) permUpdateAbort_send(core, st->page , st->flags); free(st);}}

Stack-RippedCan fail to send immediately (e.g., due to full channel)

Need to Stack-Rip

and here

and here…

Page 27: Cores, cores, everywhere

AC: Asynchronous C

Synchronous Event-Driven

Easy to program

Poor Performance

Difficult to program

Good Performance

AC:Similar programing model to

sync Similar performance to event-

driven

Page 28: Cores, cores, everywhere

Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { bool ok = true; do { for (core in cores) async { ok &= permUpdateRequest_AC(core, page, flags); } } finish; if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page , flags); } return ok;}

Identify code that can block – execution can continue after async

AC versions of message RPCs

Don’t pass finish until all async work created in do {} finish

block has complete

Page 29: Cores, cores, everywhere

2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

10000

20000

30000

40000

50000

# Cores

Tim

e pe

r ope

ratio

n / c

ycle

s

Shared Resource Database Consensus

Event-DrivenSynchronous

AC

Page 30: Cores, cores, everywhere

Performance

Ping-pong test Minimum-sized messages

AMD 4 * 4-core machineUsing cores sharing L3

cache

Ping-pong latency (cycles)

Using UMP channel directly

931

Using event-based stubs

1134

Synchronous model (client only)

1266

Synchronous model(client and server)

1405

MPI (Visual Studio 2008 + HPC-Pack 2008 SDK)

2780

Page 31: Cores, cores, everywhere

PerformanceFunction call latency

(cycles)Direct

(normal function call)8

async foo()(foo does not block)

12

async foo()(foo blocks)

1692

• “Do not fear async”– Think about correctness: if the callee doesn’t

block then perf is basically unchanged

Page 32: Cores, cores, everywhere

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Page 33: Cores, cores, everywhere

Adding Parallelism

do { async msg_send(core_1, “Computing Forces”); par fluidAnimate (computeForces, cells, range); } finish; Spawn a bunch of parallel

tasks that can be run across multiple cores

Wait for parallel and async tasks to complete before

continuing

Page 34: Cores, cores, everywhere

FluidAnimate• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 35: Cores, cores, everywhere

Static Partitioning• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 36: Cores, cores, everywhere

Static Partitioning• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 37: Cores, cores, everywhere

Static Partitioning• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Uneven workload

Page 38: Cores, cores, everywhere

Static Partitioning• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Barrier Synchronization

Page 39: Cores, cores, everywhere

Static Partitioning• for each frame

–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Thread Preemption

Approach taken by (e.g.) OpenMP and Intel Parallel Building Blocks

They assume you own the machine and know your workload

Page 40: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 41: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 42: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 43: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 44: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 45: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Page 46: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Spawn / Sync Overhead

Cilk-5: 218 cycles per task

Wool (old version): 97 cycles per task

Density calculation task:~ 10 cycles per particle

Page 47: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Cache Locality

Page 48: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Cache Locality

Page 49: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Cache Locality

Page 50: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Cache Locality

Page 51: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Cache Locality

Page 52: Cores, cores, everywhere

Dynamic Partitioning (Work-Stealing)

• for each frame–move particles to correct cell

–calculate cell density–calculate particle forces–calculate particles position

–render frame

Problem: Data Synchronization

Page 53: Cores, cores, everywhere

Space-Time Continuum

• Controlled partitioning programming model– Flexible enough to enable movement on this

spectrum– Runtime system controls re-partitioning– Application controls how

• Parameterise how data is partitioned• Decide whether data-synchronisation is

necessary

DynamicPartitioning

Static Partitioning

Workload 1Workload 2

Workload 1Workload 264 Core Server

4 Core Laptop

Page 54: Cores, cores, everywhere

Controlled Partitioning

Page 55: Cores, cores, everywhere

Controlled Partitioning

Page 56: Cores, cores, everywhere

Controlled Partitioning

Page 57: Cores, cores, everywhere

Controlled Partitioning

Page 58: Cores, cores, everywhere

Controlled Partitioning

Page 59: Cores, cores, everywhere

Controlled Partitioning

Page 60: Cores, cores, everywhere

FluidAnimatevoid computeForces(cell_t [][][] cells, dimentions_t d) { range_t range= { .x_start=0, .x_curr=0, .x_end=d.x_len, ...}; do { par fluidAnimate (computeForces, cells, range); } finish;}par_task fluidAnimate { task computeForces(cell_t cell) { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); }} range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}}

Page 61: Cores, cores, everywhere

FluidAnimatevoid __computeForces_task(range_t my_range, cells_t [][][] cells) { cell_t cell = __fluidAnimate_getNext(cells, my_range); do { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); } if ((int num = calico_should_subdivide()) > 0) { range_t[] new_ranges = __fluidAnimate_subdivide(my_range, num); calico_schedule_par(__computeForces_task, new_ranges, cells); return; } while ((cell = __fluidAnimate_getNext(cells, my_range)) != NULL);} range_t[] __fluidAnimate_subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new}cell_t __fluidAnimate_getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}

Aggregation of multiple task iterations

Automatic Repartitioning when necessary

Page 62: Cores, cores, everywhere

FluidAnimatepar_task fluidAnimate {

task moveParticles(cell_t cell) { ... } task computeDensities(cell_t cell) { ... } task computeForces(cell_t cell) { ... } task renderCell(cell_t cell) { ... }

range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished } bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range}}

Page 63: Cores, cores, everywhere

FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}

Page 64: Cores, cores, everywhere

FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}

calcOnDifferentCore(new_cell, my_range);

Page 65: Cores, cores, everywhere

FluidAnimate Results

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

Parsec NativeCalico

Number of Cores

Wal

l-clo

ck e

xecu

tion

time

(nor

mal

ised

to se

quen

tial)

No competition for CPU-time

Bette

r

Page 66: Cores, cores, everywhere

FluidAnimate Results

0 1 2 3 4 5 6 7 80

0.5

1

1.5

2

2.5

Parsec NativeCalico

Number of Cores

Wal

l-clo

ck e

xecu

tion

time

(nor

mal

ised

to se

quen

tial)

Competition for CPU-time

Bette

r

Page 67: Cores, cores, everywhere

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

http://www.barrelfish.org

Page 68: Cores, cores, everywhere

©2010 Microsoft Corporation. All rights reserved.This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.