Memory Servers for Multicore Systems - RTAS 20162016.rtas.org/wp-content/uploads/2016/06/81.pdfHeechul Yun et al: MemGuard: Memory Bandwidth Reservation System for Efficient Performance

Memory Servers for Multicore Systems

Rodolfo Pellizzoni and Heechul Yun

/ 28

Outline

•  Delay analysis and compositionality

•  Solution: Memory Server

•  Core-level analysis

•  System-level analysis

•  Evaluation

•  Conclusions

1

/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

Resource Contention

2

Core Core Core

Shared Physical Resource

INTERFERENCE!!!

t t t

We need a resource delay analysis to bound such effects

/ 28

Analysis Taxonomy

•  Request-driven bound: –  Compute maximum per-Request Delay –  : # requests task under analysis –  Delay = –  Computed delay can be very pessimistic!

•  Job-driven bound: –  : bound on # requests by other cores –  Delay = –  Non compositional – requires knowledge of tasks

running on all cores! 3

RD

H

RD ⇥H

Hf(H,H)

/ 28

Example: DRAM Write Buffering •  Reads are prioritized over writes; writes are buffered and

transmitted in non-preemptive batches (ARM A15 – 18 req.) •  If a write batch begins right before a read under analysis

arrives, the request is delayed by 18 write requests…

4

Memory Controller

13

Read request buffer

Write request buffer

Bank 1 scheduler

Channel scheduler

Bank 2 scheduler

Bank N scheduler

DRAM chip

Memory requests from cores

Writes are buffered and processed opportunistically

/ 28

COTS Arbiters are Unfair

•  The main issue: unfair arbitration –  cores –  A request can be delayed by other requests

•  Many other examples… –  First-Ready FCFS DRAM arbitration –  Non-blocking caches –  Wormhole NoC arbitration

5

M

� M � 1

/ 28

How bad can it be?

6

Results with SPEC2006 Benchmarks

• Main source of pessimism: – The pathological case of write (LLC write-backs) processing

25

Even with just 4 cores, measured execution time is ~8x larger

Provable bound is much higher due to write buffering

/ 28

Summary: The Problem •  Compositional analysis is highly desirable •  Develop each application (core) independently •  But bounds are bad due to request-driven analysis

7

Core#1–Applica/on#1




/ 28

Summary: The Problem •  Compositional analysis is highly desirable •  Develop each application (core) independently •  But bounds are bad due to request-driven analysis

7


CoreCoret

t

t

/ 28

Summary: The Problem •  Or forget about compositionality to use job-driven bounds... •  ... but then any change to other applications forces me to

redo timing analysis

8


t

t

t t

t

t

t

t

/ 28



8


t

t

t t

t

t

t

t

Let me change a compiler flag here…

/ 28



8


t

t

t t

t

t

t

t

Let me change a compiler flag here… OMG!!!

/ 28



8


t

t

t t

t

t

t

t

Let me change a compiler flag here… OMG!!!

We need a way to keep compositionality but use job-driven bounds!

/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

Memory Server •  How do we solve the problem?

1.  Add a server with budget to each core p

2.  Server constraints the core to make no more than accesses to the shared resource every time units

3.  Now we know the maximum resource accesses by other cores, so we can use job-driven analysis!

9

Qp

Qp

P

/ 28

Server Implementation

10

•  Reserve per-core memory bandwidth via the OS scheduler –  Use h/w PMC to monitor memory request rate

1ms 2ms 0

Suspend the RT task

Budget

Core activity

2 1

computation memory fetch

Qp

Regulation period P

Heechul Yun et al: MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms, RTAS’13

/ 28

Memory vs Hierarchical Server •  Isn’t this similar to a hierarchical server?

11

t

t

t t

t

Application#1 Application#2

Sched. Interface Sched. Interface

Yes

•  Multiple applications on the same core

/ 28

Memory vs Hierarchical Server •  Isn’t this similar to a hierarchical server?

11

t

t

t t

t



•  Step 1: perform local analysis to derive sched. interface

Yes

/ 28

Memory vs Hierarchical Server •  Isn’t this similar to a hierarchical server? Yes

11

t

t

t t

t



System-level Analysis

•  Step 2: perform system-level analysis to check sched. –  Ex:

XQp/P Ub

/ 28

Memory vs Hierarchical Server •  For memory server:

12

t

t

t t

t



•  Multiple application on different cores

/ 28


12

t

t

t t

t



•  Step 1: perform core-level analysis to derive interface

/ 28


12

t

t

t t

t



System-level Analysis

•  Step 2: perform system-level analysis to check sched.

/ 28

Interfaces and Compositionality •  In both cases, the interface represents a contract

–  Each application’s developer guarantees that he will respect the contract

–  Check schedulability based on the contracts for the other applications, not the tasks inside

13

I want to add a new task to my application...

Do what you want, just respect the contract

/ 28

Memory vs Hierarchical Server •  Key difference: interfaces •  For fixed , the schedulability of an application in a

hierarchical server depends only on the assigned budget

14

PQp

0 2 4 6 8 Qp

/ 28

2

4

6

8

10

QBp

Memory vs Hierarchical Server •  Key difference: interfaces •  … but the delay (hence schedulability) for memory server also

depends on the total budget assigned to other cores!

14 0 2 4 6 8

0 Qp

Key intuition: there is no global resource utilization bound – it depends on the tasks within the application

QBp

/ 28

Detailed Contribution •  Core-level analysis computes 2D feasibility region for a core

–  Based on state-of-the-art DRAM delay analysis; but any request-driven / job-driven analysis could be used

•  System-level analysis determines feasibility –  Is there a budget assignment that satisfies all interfaces?

•  Both analyses have reasonable complexity

15

/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

Core-level Analysis •  Fixed priority scheduling

16

t

t

t

Bini & Buttazzo: Schedulability analysis of periodic fixed priority systems, TC’04

For each task…

•  Compute set of time instants •  Iff for any time instant : computation demand( ) ,

task is schedulable t̄ [0, t̄) t̄

t̄

/ 28

Core-level Analysis

16

t

t

t

•  Fixed priority scheduling Bini & Buttazzo: Schedulability analysis of periodic fixed priority systems, TC’04

For each task…

•  With resource interference: computation demand( ) + resource delay( ) [0, t̄) [0, t̄) t̄

t̄

/ 28

Resource Delay: Regulation •  Regulation: tasks in the busy interval are delayed because

they exhaust the budget

•  Regulation delay decreases with larger budget values

17

resource delay( ) computation demand( ) [0, t̄) [0, t̄) t̄�

Qp

QBp

Qp

Qp

/ 28

Resource Delay: Contention •  Contention: tasks in the busy interval are delayed due to

requests of other cores

•  Contention delay decreases with smaller budget values

18

QBp

Qp

QBp

QBp

resource delay( ) computation demand( ) [0, t̄) [0, t̄) t̄�

/ 28

Memory Delay: One Time Instant •  Key obs.: resource delay is the max of regulation, contention

–  Independent of analysis; true bc of how regulation works •  Task is schedulable at if it meets both regulation and

contention constraints

19

t̄

Qp

QBp

Qp

QBp

/ 28

Interface: One Task •  Task is schedulable if it meets constraints at any time

instant

20

t̄

Qp

QBp

Qp

QBp

t̄1Region for Region for t̄2

/ 28

Interface: One Task

20

Qp

QBp

Region for ⌧1

•  Task is schedulable if it meets constraints at any time instant t̄

/ 28

Interface: One Core

21

•  Task set is schedulable if all tasks are schedulable

Qp

QBp

Qp

QBp

Region for ⌧1Region for ⌧2

/ 28

Interface: One Core

21

•  Task set is schedulable if all tasks are schedulable

Qp

QBp

Region for the core

/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

System-level analysis

22

0 2 4 6 8

0

2

4

6

8

Qp

QBp

0 2 4 6 8

0

2

4

6

8

Qp

QBp

4

4

3

3

•  Start with smallest budget values •  Determine budget of other cores •  If not feasible, move to next budget value

Interface for Core#1 Interface for Core#2

Qp

QBp

/ 28


22

0 2 4 6 8

0

2

4

6

8

Qp

0 2 4 6 8

0

2

4

6

8

Qp

6

6

5

5


QBp QBp


Qp

QBp

/ 28


22

0 2 4 6 8

0

2

4

6

8

Qp

0 2 4 6 8

0

2

4

6

8

Qp

6

6

7

7


QBp QBp


Qp

QBp

OK!

/ 28

Result Summary •  System-level analysis is optimal (based on computed

core-level interfaces) –  If there is a feasible budget assignment, it will find it

•  Computation complexity –  cores, tasks per core, time instances per task –  Core-level analysis is –  System-level analysis is

•  Note: is , but generally much smaller

23

M N K

O�N ·K · log(N ·K)

�

O�M2 ·N ·K)

K O�N ·max(deadline)/min(period)

�

/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

Evaluation

•  Randomly generated task sets

•  Only shared resource is DRAM

•  FR-FCFS arbitration, private banks, write buffering

•  ARM A15, LPDDR2@533Mhz, = 1ms

•  Random task bandwidth in –  : fraction of maximum memory bandwidth

24

Heechul Yun et al: Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems, ECRTS’15

P

[0, Bmax

]

Bmax

/ 28

Analyses •  non-comp

–  Non-compositional: all tasks on all cores are known –  No regulation, but can use job-driven analysis

•  rw-reg •  read-reg

–  Memory server. PMCs measure either read+write requests, or read only

•  legacy –  All servers have the same fixed budget

•  unreg –  No regulation, compositional, request-driven only

25

/ 28

M = 4, N = 10, = 1%

26

Bmax

Per-Core Utilization U0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge S

ched

ulab

le T

ask

Sets

0

10

20

30

40

50

60

70

80

90

100

non-comprw-regread-reglegacyunreg

This is the price of compositionality with our approach

/ 28

M = 4, N = 10, = 1%

26

Bmax


Perc

enta

ge S

ched

ulab

le T

ask

Sets

0

10

20

30

40

50

60

70

80

90

100


This is the price of compositionality without regulation

/ 28

M = 4, N = 10, = 1%

26

Bmax


Perc

enta

ge S

ched

ulab

le T

ask

Sets

0

10

20

30

40

50

60

70

80

90

100


This is the improvement over previous regulation approach

Renato Mancuso et al: WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence, ECRTS’15

/ 28

M = 4, N = 10,

27

Bmax

2 [0%, 10%]

Maximum Task Bandwidth Bmax in %0 1 2 3 4 5 6 7 8 9 10

Sche

dula

ble

Util

izat

ion

for 5

0% T

hres

hold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


/ 28

Outline





•  Evaluation

•  Conclusions

1

/ 28

Conclusions •  Resource regulation leads to a “better” compositionality by

establishing contracts (interface) on resource usage

•  The approach works if the cost of regulation is less than the improvement over request-driven analysis

•  Downside: interface is more complex compared to hierarchical scheduling –  Note: this paper computes a tight interface – it might be

easier to use a (linearized?) subset of feasibility region

•  Future work: –  Multiple applications per core –  Multiple resources –  More complex resources (NoC) 28

Thank You!

Questions?

Documents

Memory Servers for Multicore Systems - RTAS 20162016.rtas.org/wp-content/uploads/2016/06/81.pdfHeechul Yun et al: MemGuard: Memory Bandwidth Reservation System for Efficient Performance