24
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell [email protected]

Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell [email protected]

Embed Size (px)

Citation preview

Page 1: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Metrics and Techniques for Evaluating the Performability

of Internet Services

Pete [email protected]

Page 2: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Outline

1. Introduction to performability2. Performability metrics for Internet

services• Throughput-based metrics (Rutgers)• Latency-based metrics (ROC)

3. Analysis and future directions

Page 3: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

• Goal of ROC project: develop metrics to evaluate new recovery techniques

• Problem: concept of availability assumes system is either “up” or “down” at a given time

• Availability doesn’t capture system’s capacity to support degraded service– degraded performance during failures– reduced data quality during high load

Motivation

Page 4: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

What is “performability”?• Combination of performance and

dependability measures• Classical defn: probabilistic (model-

based) measure of a system’s “ability to perform” in the presence of faults1

– Concept from traditional fault-tolerant systems community, ca. 1978

– Has since been applied to other areas, but still not in widespread use

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Page 5: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Performability ExampleDiscrete-time Markov chain (DTMC) model of

a RAID-5 disk array1

1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

pi(t) = probability that system is in state i at time t

p0(t)

Normaloperation

= failure rate of a single disk drive

D = number of data disks

(D+1)

p1(t)

1 disk failed,repair necessary

= disk repair rate

D

p2(t)

Failure -data loss

wi(t) = reward (disk I/O operations/sec)

w0(t) w1(t) w2(t)

Page 6: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Performability for Online Services: Rutgers Study

• Rich Martin (UCB alum) et al. wanted to quantify tradeoffs between web server designs, using a single metric for both performance and availability

• Approach:– Performed fault injection on PRESS, a

locality-aware, cluster-based web server– Measured throughput of cluster during

simulated faults and normal operation

Page 7: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

REPAIR(human

operator)

DETECT

Degraded Service During a PRESS Component Fault

FAILURE

STABILIZE

RECOVER

RESET(optional)

Time

Throughput

Req

ues

ts/s

ec

Page 8: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Calculation of Average Throughput, Given Faults

Throughput

Time

Degraded throughput

Req

ues

ts/s

ec Average throughput

Normal throughput

Page 9: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Behavior of a Performability Metric

Effect of improving degraded performance

Per

form

abili

ty

Performance during faults

Page 10: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Behavior of a Performability Metric

Effect of improving component availability(shorter MTTR, longer MTTF)

MTTR MTTF

Per

form

abili

ty

Aavailability = MTTFMTTF + MTTR

Page 11: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Behavior of a Performability Metric

Effect of improving overall performance

Per

form

abili

ty

Overall performance (includes normal operation)

Most performability metrics scale linearly as component availability, degraded performance and overall performance increase

Page 12: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

0

10

20

30

40

50

60

70

80

90

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

Web server version

Per

form

abil

ity

Reduced human monitoring

Original system

RAID storage subsystem

Results of Rutgers Study: Design Comparisons

Page 13: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

An Alternative Metric: Response Latency

• Originally, performability metrics were meant to capture end-user experience1

• Latency better describes the experience of an end user of a web site– response time >8 sec = site abandonment

= lost income $$2

• Throughput describes the raw processing ability of a service– best used to quantify expenses

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 19942 Zona Research and Keynote Systems, The Need for Speed II, 2001

Page 14: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Effect of Component Failure on Response Latency

Time

Responselatency

(sec)

REPAIR

8s

Abandonmentregion

FAILURE

Annoyanceregion?

Page 15: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Issues With Latency As a Performability Metric

• Modeling concerns:– Human element: retries and abandonment– Queuing issues: buffering and timeouts– Unavailability of load balancer due to faults– Burstiness of workload

• Latency is more accurately modeled at service, rather than end-to-end1

• Alternate approach: evaluate an existing system

1 M. Merzbacher and D. Patterson, Measuring End-User Availability on the Web: Practical Experience, 2002

Page 16: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Analysis• Queuing behavior may have a

significant effect on latency-based performability evaluation– Long component MTTRs = longer

waits, lower latency-based score– High performance in normal case =

faster queue reduction after repair, higher latency-based score

• More study is needed!

Page 17: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Future Work• Further collaboration with Rutgers on

collecting new measurements for latency-based performability analysis

• Development of more realistic fault and workload models, other performability factors such as data quality

• Research into methods for conducting automated performability evaluations of web services

Page 18: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Metrics and Techniques for Evaluating the Performability

of Internet Services

Pete [email protected]

Page 19: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Back-of-the-Envelope Latency Calculations

• Attempted to infer average request latency for PRESS servers from Rutgers data set– Required many simplifying assumptions, relying

upon knowledge of PRESS server design– Hoped to expose areas in which throughput- and

latency-based performability evaluations differ

• Assumptions:– FIFO queuing w/no timeouts, overflows– Independent faults, constant workload (also the case

for throughput-based model)

• Current models do not capture “completeness” of data returned to user

Page 20: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Comparison ofPerformability Metrics

0

5000

10000

15000

20000

25000

30000

35000

I-PRESS TCP-PRESS

ReTCP-PRESS

VIA-PRESS

Web server versions

Per

form

abil

ity

Latency-basedpeformability

Throughput-basedperformability

Page 21: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Rutgers calculations for long-term performability

Goal: metric that scales linearly with both- performance (throughput) and- availability [MTTF / (MTTF + MTTR)]

Tn = normal throughput for server

AI = ideal availability (.99999)Average throughput (AT) =

Tn during normal operation + per-component throughput during failure

Average availability (AA) = AT / Tn

Performability = Tn x [log(AI) / log(AA)]

Page 22: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Results of Rutgers study: performance comparison

Throughput

0

1000

2000

3000

4000

5000

6000

7000

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

Req

ues

ts/s

ec

Page 23: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Results of Rutgers study: availability comparison

Unavailability by Component

0

0.001

0.002

0.003

0.004

0.005

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

% U

nav

aila

bil

ity application crash

node freezenode crashscsi hangscsi timeoutinternal switch internal link

Page 24: Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Results of Rutgers study: performability comparison

Performability

0

10

20

30

40

50

60

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

Th

rou

gh

pu

t X

S

cale

d A

vail

abil

ity