Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 1/23

METERG: Measurement-Based End-to-End Performance Estimation

Techniquein QoS-Capable Multiprocessors

Jae W. Lee and Krste Asanovic{leejw, krste}@mit.edu

MIT Computer Science and Artificial Intelligence LabApril 5, 2006

J. LeeApril 5, 2006

MIT CSAIL 2/23

Multiprocessors Are Ubiquitous

Embedded CMP(e.g. ARM MPCore)

P1

MEM I/O

Pn

Shared bus

Servers

Desktops

Handheld/Embedded

• Our focus: Running soft real-time apps (e.g. media codecs, web services) on general-purpose MPs

– Assuming multiprogrammed workloads: real-time + best-effort apps

• Problem: Contention for shared resources causes wider variation of a task’s execution time.

J. LeeApril 5, 2006

MIT CSAIL 3/23

Execution Time Variation in MPs

• For fixed input to fixed task, execution time varies depending on the activities of other processors because of shared memory system.

Processor 1

Shared Bus

SharedMemory

Processor n

Shared Resource

Instance # of TaskX

Executiontime

i i+4i+2

SevereContention

NoContention

i+1 i+3

J. LeeApril 5, 2006

MIT CSAIL 4/23

Impact of Resource Contention

• Execution time increases by 85 % when number of background processes increases from 0 to 7.

• How can we limit the impact of resource contention?

QoS Support

0

0.5

1

1.5

2

0 1 3 7

1.021.00

1.85

1.05

Normalized Execution Time of memread on P1 (a fixed input)

MeasureExecution

Time

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

Variable Number of Background Processes

Number of Background Processes

J. LeeApril 5, 2006

MIT CSAIL 5/23Memory QoS Reduces MP Execution Time Variation

• Memory QoS guarantees per-processor memory bandwidth and latency.

– It guarantees minimum performance and distributes unclaimed bandwidth to sharers in order to maximize throughput.

• Challenge: How can we find the minimal QoS parameters that meet a RT task’s given performance goal?

– Minimal reservation improves total system throughput.

Instance # of TaskX

i i+2i+1

Range ofExecution Time

i+3 i+4

Without QoS

With QoS

J. LeeApril 5, 2006

MIT CSAIL 6/23Translating Performance Goal in User Metric into QoS Parameters

• User’s performance goal (user metric) should be translated into QoS parameters (system metric).

– User metric: Execution time, Transactions per Second (TPS), Frames per Second (FPS)

– System metric: Memory bandwidth, Latency

• Example: MP server virtualization– Question: What guarantees can you make to users?

– Users prefer user metric (e.g. TPS) to system metric (e.g. Memory bandwidth).

Partition

1

Partition

2

Partition

n

Multiprocessor System Substrate

Minimal QoS parameters are important

to maximize system throughput.

J. LeeApril 5, 2006

MIT CSAIL 7/23Finding Minimal Resource Reservation (1): Analysis

• Static timing analyses: Tools for WCET analyses – Consist of Hardware modeling + program analysis

– [+] Can find minimal resource reservation for all possible inputs

– [-] Analysis cost is prohibitive.

» Becoming harder for increasing hardware/software complexity in MPs

» Possibly overkill for soft RT apps:

Our primary concern is performance degradation from resource contention (not from input data).

We can tolerate a small number of deadline violations to maximize overall system throughput.

J. LeeApril 5, 2006

MIT CSAIL 8/23Finding Minimal Resource Reservation (2): Measurement

• Measurement-based Approach– Measures task’s execution time for a certain input set

– Motto: “The best model for a system is the system itself.” [Colin-RTSS ’03]

– Goal: To find execution time under worst-case contention (=TMAX(x, i)) for given QoS parameter (x) and instruction sequence (i)

Execution Timefor fixed inst seq i

Probabilitydistribution

Range of variation

TMAX(x, i)

Degree of Contention

(x: QoS parameter)

Low ← → High

J. LeeApril 5, 2006

MIT CSAIL 9/23Finding Minimal Resource Reservation (2): Measurement

• Measurement-based Approach (Cont)– [+] Easy

– [-] Not absolutely safe

Can be practically useful for soft real-time apps

Can be useful if we are given “near-worst” input set

– [-] Measured time varies by degree of contention in runtime

We will address this problem in the rest of this talk!Probabilitydistribution

Range of variation

TMAX(x, i)

Degree of Contention

(x: QoS parameter)

Low ← → High


J. LeeApril 5, 2006

MIT CSAIL 10/23

Our Approach: METERG

• METERG QoS System: Improving measurement-based approach

– Estimates TMAX(x, i) easily by measurement

– Provides hardware support for safe and tight estimation for TMAX(x, i)

– Introduces two QoS modes

» Deployment mode (operation mode)

» Enforcement mode (measurement mode for estimation)

TMAX(x, i)

Probabilitydistribution

Range of variation(Deployment)


Range of variation(Enforcement) 2. Very narrow and

close to TMAX(x, i) (Tight)

1. Always greater than TMAX(x, i) (Safe)

J. LeeApril 5, 2006

MIT CSAIL 11/23Modifying Shared Blocks to Support Two QoS Modes

• Now each QoS block supports two operation modes:

1) Deployment mode: Conventional QoS mode (Operation)– QoS parameters are treated as a lower bound of received resources

in runtime.

2) Enforcement mode: New QoS Mode (Estimation)– QoS parameters are treated as an upper bound of received

resources in runtime.

QoSParameter

5 0 5 01) Deployment Mode 2) Enforcement Mode

J. LeeApril 5, 2006

MIT CSAIL 12/23

Assumptions

• Single-threaded apps; no shared objects.

• No contentions within a node.

(e.g. multithreaded processors)

• No preemption of a scheduled task.

• Negligible impact of initial states on execution time.

J. LeeApril 5, 2006

MIT CSAIL 13/23

Baseline MP System

Processor 1

Shared Bus

SharedMemory

Processor n

QoS-capableShared block

x1 xn

• Simple fixed frame-based scheduling for memory access– Example (x1=0.2, xn=0.3)

– xi determines

» Minimum bandwidth

» Worst-case latency

1 n n 1 n

TakenBy P1

Not taken:Given to a requesterby round-robin fashion

Frame (size = 10 Timeslots)

xi: Fraction of time slots reserved for Processor i

J. LeeApril 5, 2006

MIT CSAIL 14/23Runtime Usage Example in METERG System

• Example (x1=0.2)

Reservation:

1 1

Frame (size = 10 Time Slots)

Runtime usage (Enforcement):

X X X X X X X X

Reserved &used

Reservedbut unused

Runtime usage (Deployment):

Reserved &used

Not allowed to use

Not reservedbut used

0.2

0.2

Processor 1

Shared Bus

SharedMemory

Processor n

QoS-capableShared block

x1 xn

J. LeeApril 5, 2006

MIT CSAIL 15/23Obtain Minimal Resource Reservation Using METERG

How to find minimal resource reservation with METERG:

In measurement phase:

• Step 1: Measure performance in enforcement mode for a QoS parameter.

• Step 2: Iterate Step 1 to find a minimal QoS parameter (yet meeting the performance goal).

• Step 3: Store the minimal QoS parameter.

In operation phase:

• Step 4: Execute the program in deployment mode with the stored QoS parameter for guaranteed performance.

J. LeeApril 5, 2006

MIT CSAIL 16/23Bandwidth Guarantees Are Not Enough: An Example

• Shared bus example again (x1=0.2)

• Bandwidth inequality is guaranteed but not latency.BandwidthDEP (xi) ≥ BandwidthENF (xi) (O)

MAX [LatencyDEP (xi)] ≤ MIN [LatencyENF (xi)] (X)

May cause end-to-end performance inversion.

Enforcement Mode

X X X X X X X X

Deployment Mode

Not allowedto use

Memory Request

2 Timeslots4 Timeslots

Memory Request

Used by otherprocessors

J. LeeApril 5, 2006

MIT CSAIL 17/23Two Enforcement Modes: Safety-Tightness Tradeoff

• Relaxed enforcement (R-ENF) mode: Bandwidth-only guarantees

BandwidthDEP (xi) ≥ BandwidthR-ENF (xi)

– There is a (small) chance for observed execution time in enforcement mode to be smaller than TMAX(x, i). (Not Safe)

• Strict enforcement (S-ENF) mode: Bandwidth and latency guarantees

BandwidthDEP (xi) ≥ BandwidthS-ENF (xi) &&

MAX[LatencyDEP (xi)] ≤ MIN[LatencyS-ENF (xi)]

– Observed execution time is always greater than TMAX(x, i) in enforcement mode as long as there is no timing anomaly in processor. (Safe)

– Estimation of TMAX(x, i) is looser.

J. LeeApril 5, 2006

MIT CSAIL 18/23Memory System with Strict Enforcement Mode: An Implementation

Processor 1

Processor n

METERGQoS

Memoryblock

Mem_Reply

Mem_Req

Delay Queue(bypassed in DEP mode)

1 data1 TMAX(DEP)(x1)-Tactual1

1 data20

V Data Timer

TMAX(DEP)(x1)-Tactual2

OpMode1

• Delay queue– In Deployment mode: Bypassed

– In Enforcement mode: Deferring delivery to processor for “lucky” messages

TMAX(DEP)(x1): Worst-Case latencyin Deployment Modefor given param x1

Tactual1:Actual time taken toservice request 1

J. LeeApril 5, 2006

MIT CSAIL 19/23

Evaluation: Simulation Setup

Processor 1

Shared Bus

E D

RequestReply

NetworkInterface

(NI)

E D

ReplyRequest

SharedMemory

OpMode1OpMode8

Processor 8

METERG QoS block

• Simulation Setup– 8-way SMP running Linux– In-order core running with 5x

faster clock than the system bus– Detailed memory contention

model with a simple fixed-frame TDM bus

– 32 KB unified blocking L1 cache / No L2 cache

• Synthetic benchmark: Memread– Infinite loop accessing a large

chunk of memory sequentially– Performance bottlenecked by

memory system performance

J. LeeApril 5, 2006

MIT CSAIL 20/23Evaluation: Varying Number of Processes

0.8

1

1.2

1.4

1.6

1.8

2

0 1 3 7

1.85 (BE)

1.45 (ENF)

1.10 (DEP)

• Execution time degradation as number of background processes increases from 0 to 7without QoS (Best-Effort): 85 %

with QoS (Deployment): 10 %

• Estimated execution time upper bound for given QoS parameter (x1=0.25): 45 %

0 1 3 7# of Background Processes

Normalized Execution Time

(x1 =0.25)

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

Variable # of BE Processes

Measure P1(x1 = 0.25)

J. LeeApril 5, 2006

MIT CSAIL 21/23Evaluation: Varying QoS Parameters

0.9

1.2

1.5

1.8

2.1

2.4

2.7

0.1 0.25 0.5 0.75

2.35 (ENF)

1.51 (DEP)

Normalized Execution Time

• Performance estimation becomes tighter as QoS parameter (x1) increases.– Because the worst-case latency in accessing memory decreases

accordingly

QoS Parameter (x1)

1.011.09

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

7 BE Background Processes

Measure P1(Variable x1)

J. LeeApril 5, 2006

MIT CSAIL 22/23Evaluation: Varying Number of QoS Processes

0

0.2

0.4

0.6

0.8

1

1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE

P1

P2

P3

P4

P5

P6

P7

P8

(0.25, 0, 0, 0, 0, 0, 0, 0)

(0.25, 0.25, 0.25, 0.20, 0, 0, 0, 0)

(0.25, 0.25, 0, 0, 0, 0, 0, 0)

(0.25, 0.25, 0.25, 0, 0, 0, 0, 0)

0.91

(0.54)

(0.69)

0.45

0.91 0.90

0.37

0.89

0.27

Estimatedlower boundfor xi=0.25

8-BECase

0.91 0.90

0.14

Normalized IPC*

• Performance impact on a QoS process by other QoS processes is negligible (<2%) as long as system is not oversubscribed.

1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE

0.81

QoS Parameters:

(* Higher Number = Better Performance)

J. LeeApril 5, 2006

MIT CSAIL 23/23

Conclusion

• In general-purpose MPs, shared resource contention causes large variation of execution time of a task.

(~2x increase observed in 8-processor case)

• METERG QoS System provides an easy way (by measurement) to estimate execution time under worst-case resource contention for a given QoS parameter.

– Introducing a new QoS mode (Enforcement Mode) for performance estimation

• Using METERG QoS System, minimal QoS parameter can be easily found for given instruction sequence and performance goal.

Documents

Jae W. Lee and Krste Asanovic { leejw, krste } @mit