23
J. L April 5, 20 MIT CSAIL 1/ METERG: Measurement-Based End-to-End Performance Estimation Technique in QoS-Capable Multiprocessors Jae W. Lee and Krste Asanovic {leejw, krste}@mit.edu IT Computer Science and Artificial Intelligence Lab April 5, 2006

Jae W. Lee and Krste Asanovic { leejw, krste } @mit

  • Upload
    thai

  • View
    44

  • Download
    1

Embed Size (px)

DESCRIPTION

METERG: Measurement-Based End-to-End Performance Estimation Technique in QoS-Capable Multiprocessors. Jae W. Lee and Krste Asanovic { leejw, krste } @mit.edu MIT Computer Science and Artificial Intelligence Lab April 5, 2006. - PowerPoint PPT Presentation

Citation preview

Page 1: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 1/23

METERG: Measurement-Based End-to-End Performance Estimation

Techniquein QoS-Capable Multiprocessors

Jae W. Lee and Krste Asanovic{leejw, krste}@mit.edu

MIT Computer Science and Artificial Intelligence LabApril 5, 2006

Page 2: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 2/23

Multiprocessors Are Ubiquitous

Embedded CMP(e.g. ARM MPCore)

P1

MEM I/O

Pn

Shared bus

Servers

Desktops

Handheld/Embedded

• Our focus: Running soft real-time apps (e.g. media codecs, web services) on general-purpose MPs

– Assuming multiprogrammed workloads: real-time + best-effort apps

• Problem: Contention for shared resources causes wider variation of a task’s execution time.

Page 3: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 3/23

Execution Time Variation in MPs

• For fixed input to fixed task, execution time varies depending on the activities of other processors because of shared memory system.

Processor 1

Shared Bus

SharedMemory

Processor n

Shared Resource

Instance # of TaskX

Executiontime

i i+4i+2

SevereContention

NoContention

i+1 i+3

Page 4: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 4/23

Impact of Resource Contention

• Execution time increases by 85 % when number of background processes increases from 0 to 7.

• How can we limit the impact of resource contention?

QoS Support

0

0.5

1

1.5

2

0 1 3 7

1.021.00

1.85

1.05

Normalized Execution Time of memread on P1 (a fixed input)

MeasureExecution

Time

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

Variable Number of Background Processes

Number of Background Processes

Page 5: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 5/23Memory QoS Reduces MP Execution Time Variation

• Memory QoS guarantees per-processor memory bandwidth and latency.

– It guarantees minimum performance and distributes unclaimed bandwidth to sharers in order to maximize throughput.

• Challenge: How can we find the minimal QoS parameters that meet a RT task’s given performance goal?

– Minimal reservation improves total system throughput.

Instance # of TaskX

i i+2i+1

Range ofExecution Time

i+3 i+4

Without QoS

With QoS

Page 6: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 6/23Translating Performance Goal in User Metric into QoS Parameters

• User’s performance goal (user metric) should be translated into QoS parameters (system metric).

– User metric: Execution time, Transactions per Second (TPS), Frames per Second (FPS)

– System metric: Memory bandwidth, Latency

• Example: MP server virtualization– Question: What guarantees can you make to users?

– Users prefer user metric (e.g. TPS) to system metric (e.g. Memory bandwidth).

Partition

1

Partition

2

Partition

n

Multiprocessor System Substrate

Minimal QoS parameters are important

to maximize system throughput.

Page 7: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 7/23Finding Minimal Resource Reservation (1): Analysis

• Static timing analyses: Tools for WCET analyses – Consist of Hardware modeling + program analysis

– [+] Can find minimal resource reservation for all possible inputs

– [-] Analysis cost is prohibitive.

» Becoming harder for increasing hardware/software complexity in MPs

» Possibly overkill for soft RT apps:

Our primary concern is performance degradation from resource contention (not from input data).

We can tolerate a small number of deadline violations to maximize overall system throughput.

Page 8: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 8/23Finding Minimal Resource Reservation (2): Measurement

• Measurement-based Approach– Measures task’s execution time for a certain input set

– Motto: “The best model for a system is the system itself.” [Colin-RTSS ’03]

– Goal: To find execution time under worst-case contention (=TMAX(x, i)) for given QoS parameter (x) and instruction sequence (i)

Execution Timefor fixed inst seq i

Probabilitydistribution

Range of variation

TMAX(x, i)

Degree of Contention

(x: QoS parameter)

Low ← → High

Page 9: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 9/23Finding Minimal Resource Reservation (2): Measurement

• Measurement-based Approach (Cont)– [+] Easy

– [-] Not absolutely safe

Can be practically useful for soft real-time apps

Can be useful if we are given “near-worst” input set

– [-] Measured time varies by degree of contention in runtime

We will address this problem in the rest of this talk!Probabilitydistribution

Range of variation

TMAX(x, i)

Degree of Contention

(x: QoS parameter)

Low ← → High

Execution Timefor fixed inst seq i

Page 10: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 10/23

Our Approach: METERG

• METERG QoS System: Improving measurement-based approach

– Estimates TMAX(x, i) easily by measurement

– Provides hardware support for safe and tight estimation for TMAX(x, i)

– Introduces two QoS modes

» Deployment mode (operation mode)

» Enforcement mode (measurement mode for estimation)

TMAX(x, i)

Probabilitydistribution

Range of variation(Deployment)

Execution Timefor fixed inst seq i

Range of variation(Enforcement) 2. Very narrow and

close to TMAX(x, i) (Tight)

1. Always greater than TMAX(x, i) (Safe)

Page 11: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 11/23Modifying Shared Blocks to Support Two QoS Modes

• Now each QoS block supports two operation modes:

1) Deployment mode: Conventional QoS mode (Operation)– QoS parameters are treated as a lower bound of received resources

in runtime.

2) Enforcement mode: New QoS Mode (Estimation)– QoS parameters are treated as an upper bound of received

resources in runtime.

QoSParameter

5 0 5 01) Deployment Mode 2) Enforcement Mode

Page 12: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 12/23

Assumptions

• Single-threaded apps; no shared objects.

• No contentions within a node.

(e.g. multithreaded processors)

• No preemption of a scheduled task.

• Negligible impact of initial states on execution time.

Page 13: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 13/23

Baseline MP System

Processor 1

Shared Bus

SharedMemory

Processor n

QoS-capableShared block

x1 xn

• Simple fixed frame-based scheduling for memory access– Example (x1=0.2, xn=0.3)

– xi determines

» Minimum bandwidth

» Worst-case latency

1 n n 1 n

TakenBy P1

Not taken:Given to a requesterby round-robin fashion

Frame (size = 10 Timeslots)

xi: Fraction of time slots reserved for Processor i

Page 14: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 14/23Runtime Usage Example in METERG System

• Example (x1=0.2)

Reservation:

1 1

Frame (size = 10 Time Slots)

Runtime usage (Enforcement):

X X X X X X X X

Reserved &used

Reservedbut unused

Runtime usage (Deployment):

Reserved &used

Not allowed to use

Not reservedbut used

0.2

0.2

Processor 1

Shared Bus

SharedMemory

Processor n

QoS-capableShared block

x1 xn

Page 15: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 15/23Obtain Minimal Resource Reservation Using METERG

How to find minimal resource reservation with METERG:

In measurement phase:

• Step 1: Measure performance in enforcement mode for a QoS parameter.

• Step 2: Iterate Step 1 to find a minimal QoS parameter (yet meeting the performance goal).

• Step 3: Store the minimal QoS parameter.

In operation phase:

• Step 4: Execute the program in deployment mode with the stored QoS parameter for guaranteed performance.

Page 16: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 16/23Bandwidth Guarantees Are Not Enough: An Example

• Shared bus example again (x1=0.2)

• Bandwidth inequality is guaranteed but not latency.BandwidthDEP (xi) ≥ BandwidthENF (xi) (O)

MAX [LatencyDEP (xi)] ≤ MIN [LatencyENF (xi)] (X)

May cause end-to-end performance inversion.

Enforcement Mode

X X X X X X X X

Deployment Mode

Not allowedto use

Memory Request

2 Timeslots4 Timeslots

Memory Request

Used by otherprocessors

Page 17: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 17/23Two Enforcement Modes: Safety-Tightness Tradeoff

• Relaxed enforcement (R-ENF) mode: Bandwidth-only guarantees

BandwidthDEP (xi) ≥ BandwidthR-ENF (xi)

– There is a (small) chance for observed execution time in enforcement mode to be smaller than TMAX(x, i). (Not Safe)

• Strict enforcement (S-ENF) mode: Bandwidth and latency guarantees

BandwidthDEP (xi) ≥ BandwidthS-ENF (xi) &&

MAX[LatencyDEP (xi)] ≤ MIN[LatencyS-ENF (xi)]

– Observed execution time is always greater than TMAX(x, i) in enforcement mode as long as there is no timing anomaly in processor. (Safe)

– Estimation of TMAX(x, i) is looser.

Page 18: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 18/23Memory System with Strict Enforcement Mode: An Implementation

Processor 1

Processor n

METERGQoS

Memoryblock

Mem_Reply

Mem_Req

Delay Queue(bypassed in DEP mode)

1 data1 TMAX(DEP)(x1)-Tactual1

1 data20

V Data Timer

TMAX(DEP)(x1)-Tactual2

OpMode1

• Delay queue– In Deployment mode: Bypassed

– In Enforcement mode: Deferring delivery to processor for “lucky” messages

TMAX(DEP)(x1): Worst-Case latencyin Deployment Modefor given param x1

Tactual1:Actual time taken toservice request 1

Page 19: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 19/23

Evaluation: Simulation Setup

Processor 1

Shared Bus

E D

RequestReply

NetworkInterface

(NI)

E D

ReplyRequest

SharedMemory

OpMode1OpMode8

Processor 8

METERG QoS block

• Simulation Setup– 8-way SMP running Linux– In-order core running with 5x

faster clock than the system bus– Detailed memory contention

model with a simple fixed-frame TDM bus

– 32 KB unified blocking L1 cache / No L2 cache

• Synthetic benchmark: Memread– Infinite loop accessing a large

chunk of memory sequentially– Performance bottlenecked by

memory system performance

Page 20: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 20/23Evaluation: Varying Number of Processes

0.8

1

1.2

1.4

1.6

1.8

2

0 1 3 7

1.85 (BE)

1.45 (ENF)

1.10 (DEP)

• Execution time degradation as number of background processes increases from 0 to 7without QoS (Best-Effort): 85 %

with QoS (Deployment): 10 %

• Estimated execution time upper bound for given QoS parameter (x1=0.25): 45 %

0 1 3 7# of Background Processes

Normalized Execution Time

(x1 =0.25)

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

Variable # of BE Processes

Measure P1(x1 = 0.25)

Page 21: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 21/23Evaluation: Varying QoS Parameters

0.9

1.2

1.5

1.8

2.1

2.4

2.7

0.1 0.25 0.5 0.75

2.35 (ENF)

1.51 (DEP)

Normalized Execution Time

• Performance estimation becomes tighter as QoS parameter (x1) increases.– Because the worst-case latency in accessing memory decreases

accordingly

QoS Parameter (x1)

1.011.09

Processor 1

Shared Bus

SharedMemory

Processor 8

Shared Resource

Processor 2

7 BE Background Processes

Measure P1(Variable x1)

Page 22: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 22/23Evaluation: Varying Number of QoS Processes

0

0.2

0.4

0.6

0.8

1

1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE

P1

P2

P3

P4

P5

P6

P7

P8

(0.25, 0, 0, 0, 0, 0, 0, 0)

(0.25, 0.25, 0.25, 0.20, 0, 0, 0, 0)

(0.25, 0.25, 0, 0, 0, 0, 0, 0)

(0.25, 0.25, 0.25, 0, 0, 0, 0, 0)

0.91

(0.54)

(0.69)

0.45

0.91 0.90

0.37

0.89

0.27

Estimatedlower boundfor xi=0.25

8-BECase

0.91 0.90

0.14

Normalized IPC*

• Performance impact on a QoS process by other QoS processes is negligible (<2%) as long as system is not oversubscribed.

1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE

0.81

QoS Parameters:

(* Higher Number = Better Performance)

Page 23: Jae W. Lee and Krste Asanovic { leejw, krste } @mit

J. LeeApril 5, 2006

MIT CSAIL 23/23

Conclusion

• In general-purpose MPs, shared resource contention causes large variation of execution time of a task.

(~2x increase observed in 8-processor case)

• METERG QoS System provides an easy way (by measurement) to estimate execution time under worst-case resource contention for a given QoS parameter.

– Introducing a new QoS mode (Enforcement Mode) for performance estimation

• Using METERG QoS System, minimal QoS parameter can be easily found for given instruction sequence and performance goal.