Cores, cores, everywhere
Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson,
Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania
Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work
Amdahl’s law
“Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with 128 cores, how many cores do you need to use to get a 4x speed-up on the overall program?”
Amdahl’s law, f=70%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
#cores
Spee
dup
Desired 4x speedup
Speedup achieved (perfect scaling on 70%)
Limit as c→∞ = 1/(1-f) = 3.33
Amdahl’s law, f=10%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
#cores
Spee
dup
Speedup achieved with perfect scaling
Amdahl’s law limit, just 1.11x
Amdahl’s law, f=98%
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 1270
10
20
30
40
50
60
#cores
Spee
dup
Amdahl’s law & multi-coreSuppose that the same h/w budget (space or power) can make us:
1 2
5 6
3 4
7 8
9 10
13 14
11 12
15 16
1
1 2
3 4
(analysis from Hill & Marty “Amdahl’s law in the multicore era”)
Perf of big & small cores
1/16 1/8 1/4 1/2 10.0
0.2
0.4
0.6
0.8
1.0
1.2
Resources dedicated to core
Core
per
f (re
lativ
e to
1 b
ig c
ore
Assumption: perf = α √resource
Total perf:16 * 1/4 = 4
Total perf:1 * 1 = 1
(analysis from Hill & Marty “Amdahl’s law in the multicore era”)
Amdahl’s law, f=98%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
#Cores
Perf
(rela
tive
to 1
big
core
)
1 big
4 medium
16 small
(analysis from Hill & Marty “Amdahl’s law in the multicore era”)
Amdahl’s law, f=75%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0
0.2
0.4
0.6
0.8
1.0
1.2
#Cores
Perf
(rela
tive
to 1
big
core
)
1 big
4 medium
16 small
(analysis from Hill & Marty “Amdahl’s law in the multicore era”)
Asymmetric chips
1
3 4
7 8
9 10
13 14
11 12
15 16
Amdahl’s law, f=75%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
#Cores
Perf
(rela
tive
to 1
big
core
)
1 big4 medium
16 small
1+12
(analysis from Hill & Marty “Amdahl’s law in the multicore era”)
Two hardware trends
Traditional multi-processor machines
Asymmetric performance and/or
instruction sets
Cache-coherent multicore
AMD Istanbul: 6 cores, per-core L2, per-package L3
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
L3
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
L3
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
L3
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
CoreL2
L3
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
Single-chip cloud computer (SCC)
24 * 2-core tilesOn-chip mesh n/w
Non-coherent cachesHardware supported messaging
L2 Core
L2
Router MPB
Core
VRC
MC-
1
MC-
3
MC-
0
MC-
4System
interface
RAM RAM
RAMRAM
MSR Beehive
Ring interconnectMessage passing in h/w
No cache coherenceSplit-phase memory access
Module MemMux
MQ
DDR Controller
Core 2
RingIn[31:0],SlotTypeIn[3:0],SrcDestIn[3:0]
Core 3Core N
Module RISCNModule RISCN Module RISCN
Messages, Locks
RA from display controller
RA,WA
WDRD (128 bits) Rdreturn (32 bits)
(pipelined bus toall cores)
RD toDisplay
controller
Core 1
Module RISCN
RAM
Two hardware trends
Traditional multi-processor machines
Asymmetric performance and/or
instruction sets
Non-cache-coherent access to memory
Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work
Messaging vs shared data as default
• Fundamental model is message based• “It’s better to have shared memory and
not need it than to need shared memory and not have it”
Shared state,one-big-lock
Fine-grainedlocking
Clustered objects,partitioning
Distributed state,replica maintenance
Traditional operating systemsBarrelfish multikernel
The Barrelfish multi-kernel OS
x64
Message passing
App
x64 ARM Accelerator core
App
OS node OS node OS node OS nodeState
replicaState
replicaState
replicaState
replica
App App
Hardware interconnect
The Barrelfish multi-kernel OS
x64
Message passing
App
x64 ARM Accelerator core
App
OS node OS node OS node OS nodeState
replicaState
replicaState
replicaState
replica
App App
Hardware interconnect
System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64
The Barrelfish multi-kernel OS
x64
Message passing
App
x64 ARM Accelerator core
App
OS node OS node OS node OS nodeState
replicaState
replicaState
replicaState
replica
App App
Hardware interconnect
System components, each local to a specific core, and using
message passing
System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64
The Barrelfish multi-kernel OS
x64
Message passing
App
x64 ARM Accelerator core
App
OS node OS node OS node OS nodeState
replicaState
replicaState
replicaState
replica
App App
Hardware interconnect
User-mode programs: several models supported, including conventional shared-memory
OpenMP & pthreads
System components, each local to a specific core, and using
message passing
System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64
Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work
Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { bool ok = true; for (core in cores) ok &= permUpdateRequest_rpc(core, page, flags); if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page, flags); } return ok;}
Two-Phase Commit
Voting Phase
Commit Phase
Blocking RPC before sending to next core
~400 cyclesassuming process is scheduled on other
core!
Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { state_t *st = malloc (sizeof(state_t)); st->ok=true; st->page=page; st->flags=flags; st->count=0; for (core in cores) { permUpdateRequest_send(core, page, flags, st); st.count++;}}void recvReply(state_t st, bool ok) { st->ok &= ok; if (st->count-- == 0) { if (st->ok) { localUpdatePermissions(st->page, st->flags); for (core in cores) permUpdateCommit_send(core, st->page, st->flags); } else { for (core in cores) permUpdateAbort_send(core, st->page , st->flags); free(st);}}
Stack-RippedCan fail to send immediately (e.g., due to full channel)
Need to Stack-Rip
and here
and here…
AC: Asynchronous C
Synchronous Event-Driven
Easy to program
Poor Performance
Difficult to program
Good Performance
AC:Similar programing model to
sync Similar performance to event-
driven
Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { bool ok = true; do { for (core in cores) async { ok &= permUpdateRequest_AC(core, page, flags); } } finish; if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page , flags); } return ok;}
Identify code that can block – execution can continue after async
AC versions of message RPCs
Don’t pass finish until all async work created in do {} finish
block has complete
2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
10000
20000
30000
40000
50000
# Cores
Tim
e pe
r ope
ratio
n / c
ycle
s
Shared Resource Database Consensus
Event-DrivenSynchronous
AC
Performance
Ping-pong test Minimum-sized messages
AMD 4 * 4-core machineUsing cores sharing L3
cache
Ping-pong latency (cycles)
Using UMP channel directly
931
Using event-based stubs
1134
Synchronous model (client only)
1266
Synchronous model(client and server)
1405
MPI (Visual Studio 2008 + HPC-Pack 2008 SDK)
2780
PerformanceFunction call latency
(cycles)Direct
(normal function call)8
async foo()(foo does not block)
12
async foo()(foo blocks)
1692
• “Do not fear async”– Think about correctness: if the callee doesn’t
block then perf is basically unchanged
Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work
Adding Parallelism
do { async msg_send(core_1, “Computing Forces”); par fluidAnimate (computeForces, cells, range); } finish; Spawn a bunch of parallel
tasks that can be run across multiple cores
Wait for parallel and async tasks to complete before
continuing
FluidAnimate• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Static Partitioning• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Static Partitioning• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Static Partitioning• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Uneven workload
Static Partitioning• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Barrier Synchronization
Static Partitioning• for each frame
–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Thread Preemption
Approach taken by (e.g.) OpenMP and Intel Parallel Building Blocks
They assume you own the machine and know your workload
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Spawn / Sync Overhead
Cilk-5: 218 cycles per task
Wool (old version): 97 cycles per task
Density calculation task:~ 10 cycles per particle
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Cache Locality
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Cache Locality
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Cache Locality
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Cache Locality
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Cache Locality
Dynamic Partitioning (Work-Stealing)
• for each frame–move particles to correct cell
–calculate cell density–calculate particle forces–calculate particles position
–render frame
Problem: Data Synchronization
Space-Time Continuum
• Controlled partitioning programming model– Flexible enough to enable movement on this
spectrum– Runtime system controls re-partitioning– Application controls how
• Parameterise how data is partitioned• Decide whether data-synchronisation is
necessary
DynamicPartitioning
Static Partitioning
Workload 1Workload 2
Workload 1Workload 264 Core Server
4 Core Laptop
Controlled Partitioning
Controlled Partitioning
Controlled Partitioning
Controlled Partitioning
Controlled Partitioning
Controlled Partitioning
FluidAnimatevoid computeForces(cell_t [][][] cells, dimentions_t d) { range_t range= { .x_start=0, .x_curr=0, .x_end=d.x_len, ...}; do { par fluidAnimate (computeForces, cells, range); } finish;}par_task fluidAnimate { task computeForces(cell_t cell) { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); }} range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}}
FluidAnimatevoid __computeForces_task(range_t my_range, cells_t [][][] cells) { cell_t cell = __fluidAnimate_getNext(cells, my_range); do { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); } if ((int num = calico_should_subdivide()) > 0) { range_t[] new_ranges = __fluidAnimate_subdivide(my_range, num); calico_schedule_par(__computeForces_task, new_ranges, cells); return; } while ((cell = __fluidAnimate_getNext(cells, my_range)) != NULL);} range_t[] __fluidAnimate_subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new}cell_t __fluidAnimate_getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}
Aggregation of multiple task iterations
Automatic Repartitioning when necessary
FluidAnimatepar_task fluidAnimate {
task moveParticles(cell_t cell) { ... } task computeDensities(cell_t cell) { ... } task computeForces(cell_t cell) { ... } task renderCell(cell_t cell) { ... }
range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished } bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range}}
FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}
FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}
calcOnDifferentCore(new_cell, my_range);
FluidAnimate Results
0 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
1.4
Parsec NativeCalico
Number of Cores
Wal
l-clo
ck e
xecu
tion
time
(nor
mal
ised
to se
quen
tial)
No competition for CPU-time
Bette
r
FluidAnimate Results
0 1 2 3 4 5 6 7 80
0.5
1
1.5
2
2.5
Parsec NativeCalico
Number of Cores
Wal
l-clo
ck e
xecu
tion
time
(nor
mal
ised
to se
quen
tial)
Competition for CPU-time
Bette
r
Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work
http://www.barrelfish.org
©2010 Microsoft Corporation. All rights reserved.This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.