Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
© Copyright Azul Systems 2015
© Copyright Azul Systems 2015
@azulsystems
How NOT to Measure Latency
Matt Schuetze
Product Management Director, Azul Systems
4/19/2016 1
South Bay (LA) Java User Group
El Segundo, California
© Copyright Azul Systems 2015
© Copyright Azul Systems 2015
@azulsystems
Understanding Latency and Application Responsiveness
Matt Schuetze
Product Management Director, Azul Systems
4/19/2016 2
South Bay (LA) Java User Group
El Segundo, California
© Copyright Azul Systems 2015
© Copyright Azul Systems 2015
@azulsystems
The Oh $@%T! talk.
Matt Schuetze
Product Management Director, Azul Systems
4/19/2016 3
South Bay (LA) Java User Group
El Segundo, California
© Copyright Azul Systems 2015
About me: Matt Schuetze
Product Management Director at Azul Systems
Translate Voice of Customer into Zing and Zulu requirements and work items
Sing the praises of Azul efforts through product launches
Azul alternate on JCP exec committee, co-lead Detroit Java User Group
Stand on the shoulders of giants and admit it
4/19/2016 4
© Copyright Azul Systems 2015
Philosophy and motivation
What do we actually care about. And why?
4/19/2016 5
© Copyright Azul Systems 2015
Latency Behavior
Latency: The time it took one operation to happen
Each operation occurrence has its own latency
What we care about is how latency behaves
Behavior is a lot more than “the common case was X”
4/19/2016 6
© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.
95%’ile
The “We only want to show good things” chart
We like to look at charts
4/19/2016 7
© Copyright Azul Systems 2015
What do you care about?
Do you :
Care about latency in your system?
Care about the worst case?
Care about the 99.99%’ile?
Only care about the fastest thing in the day?
Only care about the best 50%
Only need 90% of operations to meet requirements?
4/19/2016 8
© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.
We like to rant about latency
4/19/2016 9
© Copyright Azul Systems 2015
99%‘ile is ~60 usec. (but mean is ~210usec)
“outliers”, “averages” and other nonsense
We nicknamed these spikes “hiccups”
4/19/2016 10
© Copyright Azul Systems 2015
Dispelling standard deviation
4/19/2016 11
© Copyright Azul Systems 2015
Mean = 0.06 msec Std. Deviation (𝞂) = 0.21msec
99.999% = 38.66msec
In a normal distribution,
These are NOT normal distributions
~184 σ (!!!) away from the mean
the 99.999%’ile falls within 4.5 σ
Dispelling standard deviation
4/19/2016 12
© Copyright Azul Systems 2015
Is the 99%’ile “rare”?
4/19/2016 13
© Copyright Azul Systems 2015
What are the chances of a single web page view experiencing the 99%’ile latency of:
- A single search engine node?
- A single Key/Value store node?
- A single Database node?
- A single CDN request?
Cumulative probability…
4/19/2016 14
© Copyright Azul Systems 2015 4/19/2016 15
© Copyright Azul Systems 2015
Which HTTP response time metric is more “representative” of user
experience?
The 95%’ile or the 99.9%’ile
4/19/2016 16
© Copyright Azul Systems 2015
Example: A typical user session involves 5 page loads, averaging 40 resources per page.
- How many of our users will NOT experience something
worse than the 95%’ile?
Answer: ~0.003%
- How many of our users will experience at least one response that is longer than the 99.9%’ile?
Answer: ~18%
Gauging user experience
4/19/2016 17
© Copyright Azul Systems 2015
Classic look at response time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
Average? Max?
Median? 90%? 99.9%
4/19/2016 18
© Copyright Azul Systems 2015
Hiccups are strongly multi-modal
They don’t look anything like a normal distribution
A complete shift from one mode/behavior to another
Mode A: “good”
Mode B: “somewhat bad”
Mode C: “terrible”, ...
The real world is not a gentle, smooth curve
Mode transitions are “phase changes”
4/19/2016 19
© Copyright Azul Systems 2015
Proven ways to deal with hiccups
Actually characterizing latency
Requirements
Response Time Percentile plot
line
4/19/2016 20
© Copyright Azul Systems 2015
Different throughputs, configurations, or other parameters on one graph
Comparing Behavior
4/19/2016 21
© Copyright Azul Systems 2015
Shameless Bragging
4/19/2016 22
© Copyright Azul Systems 2015
Comparing Behaviors - Actual Latency sensitive messaging distribution application: HotSpot vs. Zing
4/19/2016 23
© Copyright Azul Systems 2015
Zing
A standards-compliant JVM for Linux/x86 servers
Eliminates Garbage Collection as a concern for enterprise applications in Java, Scala, or any JVM language
Very wide operating range: Used in both low latency and large scale enterprise application spaces
Decouples scale metrics from response time concerns
Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.
Leverages elastic memory for resilient operation
4/19/2016 24
© Copyright Azul Systems 2015
What is Zing good for?
If you have a server-based Java application
And you are running on Linux
And you use using more than ~300MB of memory, perhaps 8-16 GB, on up to as high as 2TB memory,
Then Zing will likely deliver superior behavior metrics
4/19/2016 25
© Copyright Azul Systems 2015
Where Zing shines
Low latency Eliminate behavior blips down to the sub-millisecond-units level
Machine-to-machine “stuff” Support higher *sustainable* throughput (one that meets SLAs)
Messaging, queues, market data feeds, fraud detection, analytics
Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.
Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)
“Large” data and in-memory analytics Make batch stuff “business real time”. Gain super-efficiencies.
Cassandra, Spark, Solr/Lucene, any large dataset in fast motion 4/19/2016 26
© Copyright Azul Systems 2015
An accidental conspiracy...
The coordinated omission problem
4/19/2016 27
© Copyright Azul Systems 2015
The coordinated omission problem
Common load testing example:
– each “client” issues requests at a certain rate
– measure/log response time for each request
So what’s wrong with that?
– works only if ALL responses fit within interval
– implicit “automatic back off” coordination
Begin audience participation exercise now…
4/19/2016 28
© Copyright Azul Systems 2015
It is MUCH more common than you may think...
Is coordinated omission rare?
4/19/2016 29
© Copyright Azul Systems 2015
Before Correction
After Correcting
for Omission
JMeter makes this mistake... And so do other tools!
4/19/2016 30
© Copyright Azul Systems 2015
Before Correction
After Correction
Wrong by 7x
Real World Coordinated Omission effects
4/19/2016 31
© Copyright Azul Systems 2015
Uncorrected Data
Real World Coordinated Omission effects
4/19/2016 32
© Copyright Azul Systems 2015
Uncorrected Data
Corrected for Coordinated
Omission
Real World Coordinated Omission effects
4/19/2016 33
© Copyright Azul Systems 2015
A ~2500x difference in
reported percentile levels for the problem
that Zing eliminates
Zing
“other” JVM
Real World Coordinated Omission effects Why I care
4/19/2016 34
© Copyright Azul Systems 2015
Response Time vs. Service Time
© Copyright Azul Systems 2015
Service Time vs. Response Time
© Copyright Azul Systems 2015
Coordinated Omission
Usually
makes something that you think is a Response Time metric only represent the Service Time component
© Copyright Azul Systems 2015
Response Time vs. Service Time @2K/sec
© Copyright Azul Systems 2015
Response Time vs. Service Time @20K/sec
© Copyright Azul Systems 2015
Response Time vs. Service Time @60K/sec
© Copyright Azul Systems 2015
Response Time vs. Service Time @80K/sec
© Copyright Azul Systems 2015
Response Time vs. Service Time @90K/sec
© Copyright Azul Systems 2015
How “real” people react
4/19/2016 43
© Copyright Azul Systems 2015
Suggestions
Whatever your measurement technique is, test it.
Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior
Don’t waste time analyzing until you establish sanity
Don’t ever use or derive from standard deviation
Always measure Max time. Consider what it means...
Be suspicious.
Measure %‘iles. Lots of them.
4/19/2016 44
© Copyright Azul Systems 2015
HdrHistogram
4/19/2016 45
© Copyright Azul Systems 2015
Then you need both good dynamic range and good resolution
HdrHistogram
If you want to be able to produce charts like this...
4/19/2016 46
© Copyright Azul Systems 2015 4/19/2016 47
© Copyright Azul Systems 2015 4/19/2016 48
© Copyright Azul Systems 2015
Shape of Constant latency
10K fixed line latency 4/19/2016 49
© Copyright Azul Systems 2015
Shape of Gaussian latency
10K fixed line latency with added Gaussian noise (std dev. = 5K)
4/19/2016 50
© Copyright Azul Systems 2015
Shape of Random latency
10K fixed line latency with added Gaussian (std dev. = 5K) vs. random (+5K)
4/19/2016 51
© Copyright Azul Systems 2015
Shape of Stalling latency
10K fixed base, stall magnitude of 50K stall likelihood = 0.00005 (interval = 100)
4/19/2016 52
© Copyright Azul Systems 2015
Shape of Queuing latency
10K fixed base, occasional bursts of 500 msgs handling time = 100, burst likelihood = 0.00005
4/19/2016 53
© Copyright Azul Systems 2015
Shape of Multi Modal latency
10K mode0 70K mode1 (likelihood 0.01) 180K mode2 (likelihood 0.00001)
4/19/2016 54
© Copyright Azul Systems 2015
And this what the real(?) world sometimes looks like…
4/19/2016 55
© Copyright Azul Systems 2015
Real world “deductive reasoning”
4/19/2016 56
© Copyright Azul Systems 2015
http://www.jhiccup.org
4/19/2016 57
© Copyright Azul Systems 2015
jHiccup
4/19/2016 58
© Copyright Azul Systems 2015
Discontinuity in Java execution
4/19/2016 59
© Copyright Azul Systems 2015
Examples
4/19/2016 60
© Copyright Azul Systems 2015
Oracle HotSpot ParallelGC Oracle HotSpot G1
1GB live set in 8GB heap, same app, same HotSpot, different GC
4/19/2016 61
© Copyright Azul Systems 2015
Oracle HotSpot CMS Zing Pauseless GC
1GB live set in 8GB heap, same app, different JVM/GC
4/19/2016 62
© Copyright Azul Systems 2015
Oracle HotSpot CMS Zing Pauseless GC
1GB live set in 8GB heap, same app, different JVM/GC- drawn to scale
4/19/2016 63
© Copyright Azul Systems 2015
Zing
Low latency trading application
Oracle HotSpot (pure NewGen) Zing Pauseless GC
4/19/2016 64
© Copyright Azul Systems 2015
Oracle HotSpot (pure newgen) Zing Oracle HotSpot (pure newgen)
Low latency trading application
Oracle HotSpot (pure NewGen) Zing Pauseless GC
4/19/2016 65
© Copyright Azul Systems 2015
Oracle HotSpot (pure newgen) Zing
Low latency trading application – drawn to scale
Oracle HotSpot (pure NewGen) Zing Pauseless GC
4/19/2016 66
© Copyright Azul Systems 2015
jhiccup.com
hdrhistogram.com
azul.com/zing
4/19/2016 67
Key Latency Utils and Zing
© Copyright Azul Systems 2015
Compulsory Marketing Pitch
4/19/2016 68
© Copyright Azul Systems 2015
Azul Hot Topics
4/19/2016 69
Zing 16.01 available
2 TB Xmx
Docker
JMX
Performance
Zing for Cloud
Amazon AMIs
Mesos in R&D
Cloud Foundry on deck
Zing for Big Data
Cassandra offering
Hazelcast cert
Akka/Spark in Zing open source program
Zulu
Azure and AWS
JSE Embedded
ARM32 evals
Intel Edison evals
© Copyright Azul Systems 2015
Q&A and In Closing…
Go get some Zing today!
At very least download jHiccup.
Which is better dive bar The Office or Purple Orchid?
azul.com
4/19/2016 70
@schuetzematt