28
© 2014 MapR Technologies 1 © 2014 MapR Technologies How t-digest works and why Ted Dunning June 1, 2015

How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill Twitter @ApacheDrill

Embed Size (px)

Citation preview

Page 1: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 1© 2014 MapR Technologies

How t-digest works and why

Ted Dunning

June 1, 2015

Page 2: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 2

T-digest

Ted Dunning, Chief Applications Architect MapR Technologies

Email [email protected] [email protected]

Twitter @Ted_Dunning

Page 3: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 3

e-book available courtesy of MapR

http://bit.ly/1jQ9QuL

A New Look at Anomaly Detectionby Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)

Page 4: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 4

Last October: Time Series Databasesby Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)

Page 5: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 5

Available Now: Real World Hadoopby Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)

Page 6: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 6

Practical Machine Learning series (O’Reilly)

• Machine learning is becoming mainstream• Need pragmatic approaches that take into account real world

business settings:– Time to value– Limited resources– Availability of data– Expertise and cost of team to develop and to maintain system

• Look for approaches with big benefits for the effort expended

Page 7: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 7

Agenda

• Why should we estimate quantiles?• How t-digest works• How can you get it?• Questions

Page 8: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 8

Why on­line algorithms?

Page 9: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 9

Page 10: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 10

Why Quantiles (percentiles)

Page 11: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 11

Suppose You Have …

• 100 M users, 1K sites touched each day• What is 99.9% latency for each user/site combination?

for each user?

for each site?

for users in Kansas?

for users who complained?

for users who complained, but before they complained?

Page 12: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 12

Or Suppose …

• 1000 nodes, each with 24 disks, 100 unique RPC calls• Want latencies for all disks, all RPC calls between all nodes

– 50 %-ile, 99%-ile, 99.9%-ile

• <100ns overhead per measurement• <10MB overhead per node• No logs except for exceptionally slow cases• Summary at any time

Page 13: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 13

What about accuracy?

Page 14: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 14

What Accuracy Required?

• 50%-ile ± 0.5%

• 99.99%-ile ± 0.5%

• 99.99%-ile ± 0.001%

• 50%-ile ± 0.001%

Page 15: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 15

What Accuracy Required?

• 50%-ile ± 0.5%

• 99.99%-ile ± 0.5%

• 99.99%-ile ± 0.001%

• 50%-ile ± 0.001%

Often just fine

Nonsense

By definition

Over-kill

Page 16: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 16

The internals

Page 17: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 17

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

q

Clu

ster

 Siz

e

Variable Cluster Size for Constant Relative Accuracy

Small clusters give high accuracy

Large clusters give coarse accuracy

Page 18: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 18

Second-Order Accuracy via Interpolation

Cumulative distribution

Centroids are spaced widely near q = 0.5

and tightly nearq = 0 or q = 1

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

q

Page 19: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 19

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

8010

0

q

k

Translation Between Quantile and Cluster #

Centroid size can be controlled using translation to centroid scale

k1 k2 1

k sin1(2q1) / 2

Page 20: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 20

The Algorithm

• Static Buffers– n1 new points– n2 existing centroids– n2 merge space

• Algorithm– Collect new points until full– Sort new points– Merge with existing centroids

k2 – k1 < 1 criterion for merging

– Swap centroids and merge space

Page 21: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 21

The Algorithm

• Static Buffers– n1 new points– n2 existing centroids– n2 merge space

• Algorithm– Collect new points until full– Sort new points– Merge with existing centroids

k2 – k1 < 1 criterion for merging

– Swap centroids and merge space

• Can be implemented with in-place merge

• Can use approximate q-k mapping for speed

• Completely static memory

Page 22: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 22

Using t­digest

Page 23: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 23

Available As

• An aggregator in Elastic Search• In stream-lib• As a UDF for Apache Drill (soon!)• In Apache Mahout• From Maven Central

<dependency>    <groupId>com.tdunning</groupId>    <artifactId>t­digest</artifactId>    <version>3.1</version></dependency>

Page 24: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 24

The Upshot

• Streaming approximations are important• Accurate quantiles are important

• The t-digest algorithm is simple and very accurate

• You can use it almost anywhere

Page 25: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 25

Special Thanks To

• Otmar Ertl (k2-k1 idea)• Adrien Grand (best tree implementation)• Hossman (API improvements)• Cam Davidson-Pilon (great descriptive blog)

Page 26: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 26

Special Thanks To

• Otmar Ertl (k2-k1 idea)• Adrien Grand (best tree implementation)• Hoss (API improvements)• Cam Davidson-Pilon (great descriptive blog)

• _______ ________ (your name here)

Page 27: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 27

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email [email protected]@apache.org

Twitter @Ted_Dunning

Apache Mahout https://mahout.apache.org/

Twitter @ApacheMahout

Apache Drill http://incubator.apache.org/drill/

Twitter @ApacheDrill

Page 28: How digest works and why - Berlin Buzzwords · How t-digest works and why Ted Dunning June 1, 2015 © 2014 MapR Technologies 2 T-digest ... Apache Drill  Twitter @ApacheDrill

© 2014 MapR Technologies 28

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies