70
© Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems How NOT to Measure Latency Matt Schuetze Product Management Director, Azul Systems 4/19/2016 1 South Bay (LA) Java User Group El Segundo, California

How NOT to Measure Latency - Azul Systems, Inc. · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

How NOT to Measure Latency

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 1

South Bay (LA) Java User Group

El Segundo, California

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

Understanding Latency and Application Responsiveness

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 2

South Bay (LA) Java User Group

El Segundo, California

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

The Oh $@%T! talk.

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 3

South Bay (LA) Java User Group

El Segundo, California

© Copyright Azul Systems 2015

About me: Matt Schuetze

Product Management Director at Azul Systems

Translate Voice of Customer into Zing and Zulu requirements and work items

Sing the praises of Azul efforts through product launches

Azul alternate on JCP exec committee, co-lead Detroit Java User Group

Stand on the shoulders of giants and admit it

4/19/2016 4

© Copyright Azul Systems 2015

Philosophy and motivation

What do we actually care about. And why?

4/19/2016 5

© Copyright Azul Systems 2015

Latency Behavior

Latency: The time it took one operation to happen

Each operation occurrence has its own latency

What we care about is how latency behaves

Behavior is a lot more than “the common case was X”

4/19/2016 6

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

95%’ile

The “We only want to show good things” chart

We like to look at charts

4/19/2016 7

© Copyright Azul Systems 2015

What do you care about?

Do you :

Care about latency in your system?

Care about the worst case?

Care about the 99.99%’ile?

Only care about the fastest thing in the day?

Only care about the best 50%

Only need 90% of operations to meet requirements?

4/19/2016 8

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

We like to rant about latency

4/19/2016 9

© Copyright Azul Systems 2015

99%‘ile is ~60 usec. (but mean is ~210usec)

“outliers”, “averages” and other nonsense

We nicknamed these spikes “hiccups”

4/19/2016 10

© Copyright Azul Systems 2015

Dispelling standard deviation

4/19/2016 11

© Copyright Azul Systems 2015

Mean = 0.06 msec Std. Deviation (𝞂) = 0.21msec

99.999% = 38.66msec

In a normal distribution,

These are NOT normal distributions

~184 σ (!!!) away from the mean

the 99.999%’ile falls within 4.5 σ

Dispelling standard deviation

4/19/2016 12

© Copyright Azul Systems 2015

Is the 99%’ile “rare”?

4/19/2016 13

© Copyright Azul Systems 2015

What are the chances of a single web page view experiencing the 99%’ile latency of:

- A single search engine node?

- A single Key/Value store node?

- A single Database node?

- A single CDN request?

Cumulative probability…

4/19/2016 14

© Copyright Azul Systems 2015 4/19/2016 15

© Copyright Azul Systems 2015

Which HTTP response time metric is more “representative” of user

experience?

The 95%’ile or the 99.9%’ile

4/19/2016 16

© Copyright Azul Systems 2015

Example: A typical user session involves 5 page loads, averaging 40 resources per page.

- How many of our users will NOT experience something

worse than the 95%’ile?

Answer: ~0.003%

- How many of our users will experience at least one response that is longer than the 99.9%’ile?

Answer: ~18%

Gauging user experience

4/19/2016 17

© Copyright Azul Systems 2015

Classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Average? Max?

Median? 90%? 99.9%

4/19/2016 18

© Copyright Azul Systems 2015

Hiccups are strongly multi-modal

They don’t look anything like a normal distribution

A complete shift from one mode/behavior to another

Mode A: “good”

Mode B: “somewhat bad”

Mode C: “terrible”, ...

The real world is not a gentle, smooth curve

Mode transitions are “phase changes”

4/19/2016 19

© Copyright Azul Systems 2015

Proven ways to deal with hiccups

Actually characterizing latency

Requirements

Response Time Percentile plot

line

4/19/2016 20

© Copyright Azul Systems 2015

Different throughputs, configurations, or other parameters on one graph

Comparing Behavior

4/19/2016 21

© Copyright Azul Systems 2015

Shameless Bragging

4/19/2016 22

© Copyright Azul Systems 2015

Comparing Behaviors - Actual Latency sensitive messaging distribution application: HotSpot vs. Zing

4/19/2016 23

© Copyright Azul Systems 2015

Zing

A standards-compliant JVM for Linux/x86 servers

Eliminates Garbage Collection as a concern for enterprise applications in Java, Scala, or any JVM language

Very wide operating range: Used in both low latency and large scale enterprise application spaces

Decouples scale metrics from response time concerns

Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.

Leverages elastic memory for resilient operation

4/19/2016 24

© Copyright Azul Systems 2015

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

And you use using more than ~300MB of memory, perhaps 8-16 GB, on up to as high as 2TB memory,

Then Zing will likely deliver superior behavior metrics

4/19/2016 25

© Copyright Azul Systems 2015

Where Zing shines

Low latency Eliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff” Support higher *sustainable* throughput (one that meets SLAs)

Messaging, queues, market data feeds, fraud detection, analytics

Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)

“Large” data and in-memory analytics Make batch stuff “business real time”. Gain super-efficiencies.

Cassandra, Spark, Solr/Lucene, any large dataset in fast motion 4/19/2016 26

© Copyright Azul Systems 2015

An accidental conspiracy...

The coordinated omission problem

4/19/2016 27

© Copyright Azul Systems 2015

The coordinated omission problem

Common load testing example:

– each “client” issues requests at a certain rate

– measure/log response time for each request

So what’s wrong with that?

– works only if ALL responses fit within interval

– implicit “automatic back off” coordination

Begin audience participation exercise now…

4/19/2016 28

© Copyright Azul Systems 2015

It is MUCH more common than you may think...

Is coordinated omission rare?

4/19/2016 29

© Copyright Azul Systems 2015

Before Correction

After Correcting

for Omission

JMeter makes this mistake... And so do other tools!

4/19/2016 30

© Copyright Azul Systems 2015

Before Correction

After Correction

Wrong by 7x

Real World Coordinated Omission effects

4/19/2016 31

© Copyright Azul Systems 2015

Uncorrected Data

Real World Coordinated Omission effects

4/19/2016 32

© Copyright Azul Systems 2015

Uncorrected Data

Corrected for Coordinated

Omission

Real World Coordinated Omission effects

4/19/2016 33

© Copyright Azul Systems 2015

A ~2500x difference in

reported percentile levels for the problem

that Zing eliminates

Zing

“other” JVM

Real World Coordinated Omission effects Why I care

4/19/2016 34

© Copyright Azul Systems 2015

Response Time vs. Service Time

© Copyright Azul Systems 2015

Service Time vs. Response Time

© Copyright Azul Systems 2015

Coordinated Omission

Usually

makes something that you think is a Response Time metric only represent the Service Time component

© Copyright Azul Systems 2015

Response Time vs. Service Time @2K/sec

© Copyright Azul Systems 2015

Response Time vs. Service Time @20K/sec

© Copyright Azul Systems 2015

Response Time vs. Service Time @60K/sec

© Copyright Azul Systems 2015

Response Time vs. Service Time @80K/sec

© Copyright Azul Systems 2015

Response Time vs. Service Time @90K/sec

© Copyright Azul Systems 2015

How “real” people react

4/19/2016 43

© Copyright Azul Systems 2015

Suggestions

Whatever your measurement technique is, test it.

Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Don’t ever use or derive from standard deviation

Always measure Max time. Consider what it means...

Be suspicious.

Measure %‘iles. Lots of them.

4/19/2016 44

© Copyright Azul Systems 2015

HdrHistogram

4/19/2016 45

© Copyright Azul Systems 2015

Then you need both good dynamic range and good resolution

HdrHistogram

If you want to be able to produce charts like this...

4/19/2016 46

© Copyright Azul Systems 2015 4/19/2016 47

© Copyright Azul Systems 2015 4/19/2016 48

© Copyright Azul Systems 2015

Shape of Constant latency

10K fixed line latency 4/19/2016 49

© Copyright Azul Systems 2015

Shape of Gaussian latency

10K fixed line latency with added Gaussian noise (std dev. = 5K)

4/19/2016 50

© Copyright Azul Systems 2015

Shape of Random latency

10K fixed line latency with added Gaussian (std dev. = 5K) vs. random (+5K)

4/19/2016 51

© Copyright Azul Systems 2015

Shape of Stalling latency

10K fixed base, stall magnitude of 50K stall likelihood = 0.00005 (interval = 100)

4/19/2016 52

© Copyright Azul Systems 2015

Shape of Queuing latency

10K fixed base, occasional bursts of 500 msgs handling time = 100, burst likelihood = 0.00005

4/19/2016 53

© Copyright Azul Systems 2015

Shape of Multi Modal latency

10K mode0 70K mode1 (likelihood 0.01) 180K mode2 (likelihood 0.00001)

4/19/2016 54

© Copyright Azul Systems 2015

And this what the real(?) world sometimes looks like…

4/19/2016 55

© Copyright Azul Systems 2015

Real world “deductive reasoning”

4/19/2016 56

© Copyright Azul Systems 2015

http://www.jhiccup.org

4/19/2016 57

© Copyright Azul Systems 2015

jHiccup

4/19/2016 58

© Copyright Azul Systems 2015

Discontinuity in Java execution

4/19/2016 59

© Copyright Azul Systems 2015

Examples

4/19/2016 60

© Copyright Azul Systems 2015

Oracle HotSpot ParallelGC Oracle HotSpot G1

1GB live set in 8GB heap, same app, same HotSpot, different GC

4/19/2016 61

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC

4/19/2016 62

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC- drawn to scale

4/19/2016 63

© Copyright Azul Systems 2015

Zing

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 64

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing Oracle HotSpot (pure newgen)

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 65

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing

Low latency trading application – drawn to scale

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 66

© Copyright Azul Systems 2015

jhiccup.com

hdrhistogram.com

azul.com/zing

4/19/2016 67

Key Latency Utils and Zing

© Copyright Azul Systems 2015

Compulsory Marketing Pitch

4/19/2016 68

© Copyright Azul Systems 2015

Azul Hot Topics

4/19/2016 69

Zing 16.01 available

2 TB Xmx

Docker

JMX

Performance

Zing for Cloud

Amazon AMIs

Mesos in R&D

Cloud Foundry on deck

Zing for Big Data

Cassandra offering

Hazelcast cert

Akka/Spark in Zing open source program

Zulu

Azure and AWS

JSE Embedded

ARM32 evals

Intel Edison evals

© Copyright Azul Systems 2015

Q&A and In Closing…

Go get some Zing today!

At very least download jHiccup.

Which is better dive bar The Office or Purple Orchid?

azul.com

4/19/2016 70

@schuetzematt