Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Memory Servers for Multicore Systems
Rodolfo Pellizzoni and Heechul Yun
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Resource Contention
2
Core Core Core
Shared Physical Resource
INTERFERENCE!!!
t t t
We need a resource delay analysis to bound such effects
/ 28
Analysis Taxonomy
• Request-driven bound: – Compute maximum per-Request Delay – : # requests task under analysis – Delay = – Computed delay can be very pessimistic!
• Job-driven bound: – : bound on # requests by other cores – Delay = – Non compositional – requires knowledge of tasks
running on all cores! 3
RD
H
RD ⇥H
Hf(H,H)
/ 28
Example: DRAM Write Buffering • Reads are prioritized over writes; writes are buffered and
transmitted in non-preemptive batches (ARM A15 – 18 req.) • If a write batch begins right before a read under analysis
arrives, the request is delayed by 18 write requests…
4
Memory Controller
13
Read request buffer
Write request buffer
Bank 1 scheduler
Channel scheduler
Bank 2 scheduler
Bank N scheduler
DRAM chip
Memory requests from cores
Writes are buffered and processed opportunistically
/ 28
COTS Arbiters are Unfair
• The main issue: unfair arbitration – cores – A request can be delayed by other requests
• Many other examples… – First-Ready FCFS DRAM arbitration – Non-blocking caches – Wormhole NoC arbitration
5
M
� M � 1
/ 28
How bad can it be?
6
Results with SPEC2006 Benchmarks
• Main source of pessimism: – The pathological case of write (LLC write-backs) processing
25
Even with just 4 cores, measured execution time is ~8x larger
Provable bound is much higher due to write buffering
/ 28
Summary: The Problem • Compositional analysis is highly desirable • Develop each application (core) independently • But bounds are bad due to request-driven analysis
7
Core#1–Applica/on#1
Shared Physical Resource
Core#3–Applica/on#3
Core#2–Applica/on#2
/ 28
Summary: The Problem • Compositional analysis is highly desirable • Develop each application (core) independently • But bounds are bad due to request-driven analysis
7
Shared Physical Resource
CoreCoret
t
t
/ 28
Summary: The Problem • Or forget about compositionality to use job-driven bounds... • ... but then any change to other applications forces me to
redo timing analysis
8
Shared Physical Resource
t
t
t t
t
t
t
t
/ 28
Summary: The Problem • Or forget about compositionality to use job-driven bounds... • ... but then any change to other applications forces me to
redo timing analysis
8
Shared Physical Resource
t
t
t t
t
t
t
t
Let me change a compiler flag here…
/ 28
Summary: The Problem • Or forget about compositionality to use job-driven bounds... • ... but then any change to other applications forces me to
redo timing analysis
8
Shared Physical Resource
t
t
t t
t
t
t
t
Let me change a compiler flag here… OMG!!!
/ 28
Summary: The Problem • Or forget about compositionality to use job-driven bounds... • ... but then any change to other applications forces me to
redo timing analysis
8
Shared Physical Resource
t
t
t t
t
t
t
t
Let me change a compiler flag here… OMG!!!
We need a way to keep compositionality but use job-driven bounds!
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Memory Server • How do we solve the problem?
1. Add a server with budget to each core p
2. Server constraints the core to make no more than accesses to the shared resource every time units
3. Now we know the maximum resource accesses by other cores, so we can use job-driven analysis!
9
Qp
Qp
P
/ 28
Server Implementation
10
• Reserve per-core memory bandwidth via the OS scheduler – Use h/w PMC to monitor memory request rate
1ms 2ms 0
Suspend the RT task
Budget
Core activity
2 1
computation memory fetch
Qp
Regulation period P
Heechul Yun et al: MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms, RTAS’13
/ 28
Memory vs Hierarchical Server • Isn’t this similar to a hierarchical server?
11
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
Yes
• Multiple applications on the same core
/ 28
Memory vs Hierarchical Server • Isn’t this similar to a hierarchical server?
11
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
• Step 1: perform local analysis to derive sched. interface
Yes
/ 28
Memory vs Hierarchical Server • Isn’t this similar to a hierarchical server? Yes
11
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
System-level Analysis
• Step 2: perform system-level analysis to check sched. – Ex:
XQp/P Ub
/ 28
Memory vs Hierarchical Server • For memory server:
12
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
• Multiple application on different cores
/ 28
Memory vs Hierarchical Server • For memory server:
12
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
• Step 1: perform core-level analysis to derive interface
/ 28
Memory vs Hierarchical Server • For memory server:
12
t
t
t t
t
Application#1 Application#2
Sched. Interface Sched. Interface
System-level Analysis
• Step 2: perform system-level analysis to check sched.
/ 28
Interfaces and Compositionality • In both cases, the interface represents a contract
– Each application’s developer guarantees that he will respect the contract
– Check schedulability based on the contracts for the other applications, not the tasks inside
13
I want to add a new task to my application...
Do what you want, just respect the contract
/ 28
Memory vs Hierarchical Server • Key difference: interfaces • For fixed , the schedulability of an application in a
hierarchical server depends only on the assigned budget
14
PQp
0 2 4 6 8 Qp
/ 28
2
4
6
8
10
QBp
Memory vs Hierarchical Server • Key difference: interfaces • … but the delay (hence schedulability) for memory server also
depends on the total budget assigned to other cores!
14 0 2 4 6 8
0 Qp
Key intuition: there is no global resource utilization bound – it depends on the tasks within the application
QBp
/ 28
Detailed Contribution • Core-level analysis computes 2D feasibility region for a core
– Based on state-of-the-art DRAM delay analysis; but any request-driven / job-driven analysis could be used
• System-level analysis determines feasibility – Is there a budget assignment that satisfies all interfaces?
• Both analyses have reasonable complexity
15
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Core-level Analysis • Fixed priority scheduling
16
t
t
t
Bini & Buttazzo: Schedulability analysis of periodic fixed priority systems, TC’04
For each task…
• Compute set of time instants • Iff for any time instant : computation demand( ) ,
task is schedulable t̄ [0, t̄) t̄
t̄
/ 28
Core-level Analysis
16
t
t
t
• Fixed priority scheduling Bini & Buttazzo: Schedulability analysis of periodic fixed priority systems, TC’04
For each task…
• With resource interference: computation demand( ) + resource delay( ) [0, t̄) [0, t̄) t̄
t̄
/ 28
Resource Delay: Regulation • Regulation: tasks in the busy interval are delayed because
they exhaust the budget
• Regulation delay decreases with larger budget values
17
resource delay( ) computation demand( ) [0, t̄) [0, t̄) t̄�
Qp
QBp
Qp
Qp
/ 28
Resource Delay: Contention • Contention: tasks in the busy interval are delayed due to
requests of other cores
• Contention delay decreases with smaller budget values
18
QBp
Qp
QBp
QBp
resource delay( ) computation demand( ) [0, t̄) [0, t̄) t̄�
/ 28
Memory Delay: One Time Instant • Key obs.: resource delay is the max of regulation, contention
– Independent of analysis; true bc of how regulation works • Task is schedulable at if it meets both regulation and
contention constraints
19
t̄
Qp
QBp
Qp
QBp
/ 28
Interface: One Task • Task is schedulable if it meets constraints at any time
instant
20
t̄
Qp
QBp
Qp
QBp
t̄1Region for Region for t̄2
/ 28
Interface: One Task
20
Qp
QBp
Region for ⌧1
• Task is schedulable if it meets constraints at any time instant t̄
/ 28
Interface: One Core
21
• Task set is schedulable if all tasks are schedulable
Qp
QBp
Qp
QBp
Region for ⌧1Region for ⌧2
/ 28
Interface: One Core
21
• Task set is schedulable if all tasks are schedulable
Qp
QBp
Region for the core
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
System-level analysis
22
0 2 4 6 8
0
2
4
6
8
Qp
QBp
0 2 4 6 8
0
2
4
6
8
Qp
QBp
4
4
3
3
• Start with smallest budget values • Determine budget of other cores • If not feasible, move to next budget value
Interface for Core#1 Interface for Core#2
Qp
QBp
/ 28
System-level analysis
22
0 2 4 6 8
0
2
4
6
8
Qp
0 2 4 6 8
0
2
4
6
8
Qp
6
6
5
5
Interface for Core#1 Interface for Core#2
QBp QBp
• Start with smallest budget values • Determine budget of other cores • If not feasible, move to next budget value
Qp
QBp
/ 28
System-level analysis
22
0 2 4 6 8
0
2
4
6
8
Qp
0 2 4 6 8
0
2
4
6
8
Qp
6
6
7
7
Interface for Core#1 Interface for Core#2
QBp QBp
• Start with smallest budget values • Determine budget of other cores • If not feasible, move to next budget value
Qp
QBp
OK!
/ 28
Result Summary • System-level analysis is optimal (based on computed
core-level interfaces) – If there is a feasible budget assignment, it will find it
• Computation complexity – cores, tasks per core, time instances per task – Core-level analysis is – System-level analysis is
• Note: is , but generally much smaller
23
M N K
O�N ·K · log(N ·K)
�
O�M2 ·N ·K)
K O�N ·max(deadline)/min(period)
�
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Evaluation
• Randomly generated task sets
• Only shared resource is DRAM
• FR-FCFS arbitration, private banks, write buffering
• ARM A15, LPDDR2@533Mhz, = 1ms
• Random task bandwidth in – : fraction of maximum memory bandwidth
24
Heechul Yun et al: Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems, ECRTS’15
P
[0, Bmax
]
Bmax
/ 28
Analyses • non-comp
– Non-compositional: all tasks on all cores are known – No regulation, but can use job-driven analysis
• rw-reg • read-reg
– Memory server. PMCs measure either read+write requests, or read only
• legacy – All servers have the same fixed budget
• unreg – No regulation, compositional, request-driven only
25
/ 28
M = 4, N = 10, = 1%
26
Bmax
Per-Core Utilization U0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge S
ched
ulab
le T
ask
Sets
0
10
20
30
40
50
60
70
80
90
100
non-comprw-regread-reglegacyunreg
This is the price of compositionality with our approach
/ 28
M = 4, N = 10, = 1%
26
Bmax
Per-Core Utilization U0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge S
ched
ulab
le T
ask
Sets
0
10
20
30
40
50
60
70
80
90
100
non-comprw-regread-reglegacyunreg
This is the price of compositionality without regulation
/ 28
M = 4, N = 10, = 1%
26
Bmax
Per-Core Utilization U0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge S
ched
ulab
le T
ask
Sets
0
10
20
30
40
50
60
70
80
90
100
non-comprw-regread-reglegacyunreg
This is the improvement over previous regulation approach
Renato Mancuso et al: WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence, ECRTS’15
/ 28
M = 4, N = 10,
27
Bmax
2 [0%, 10%]
Maximum Task Bandwidth Bmax in %0 1 2 3 4 5 6 7 8 9 10
Sche
dula
ble
Util
izat
ion
for 5
0% T
hres
hold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
non-comprw-regread-reglegacyunreg
/ 28
Outline
• Delay analysis and compositionality
• Solution: Memory Server
• Core-level analysis
• System-level analysis
• Evaluation
• Conclusions
1
/ 28
Conclusions • Resource regulation leads to a “better” compositionality by
establishing contracts (interface) on resource usage
• The approach works if the cost of regulation is less than the improvement over request-driven analysis
• Downside: interface is more complex compared to hierarchical scheduling – Note: this paper computes a tight interface – it might be
easier to use a (linearized?) subset of feasibility region
• Future work: – Multiple applications per core – Multiple resources – More complex resources (NoC) 28
Thank You!
Questions?