55
© 2017 MapR Technologies 1 Machine Learning Comparison and Evaluation

ML Workshop 2: Machine Learning Model Comparison & Evaluation

Embed Size (px)

Citation preview

Page 1: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 1

Machine Learning

Comparison and Evaluation

Page 2: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 2

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board Member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 3

Machine Learning Everywhere

Image courtesy Mtell used with permission.Images © Ellen Friedman.

Page 4: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 4

Scores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw

Page 5: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 5

ResultsRendezvousScores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw

Page 6: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 6

MetricsMetrics

ResultsRendezvousScores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw

Page 7: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 7

Let’s talk about how the rendezvous architecture makes

evaluation easier

Page 8: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 8

Decoy Model in the Rendezvous Architecture

InputScores

Decoy

Model 2

Model 3

Archive

• Looks like a server, but it just archives inputs

• Safe in a good streaming environment, less safe without good isolation

Page 9: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 9

Other Data Collected in Rendezvous

• Request ID + Input data

• All output scores

• Evaluation latency

• Round trip latency

• Rendezvous choices

Page 10: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 10

Direct Model Comparison

• Don’t need ground truth to compare models at a gross level

• For uncalibrated models, score quantiles are useful

• For mature models, most results will be very similar

– Large differences from known good models cannot be good

• Ultimately, ground truth is important

– But only for cases where scores differ significantly

Page 11: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 11

Direct Model Differencing

−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Page 12: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 12

Direct Model Differencing

−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically

Page 13: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 13

Direct Model Differencing

−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically

Quantiles

correct scaling

Page 14: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 14

Direct Model Differencing

−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically

Quantiles

correct scaling

Perfect match

on high scores

Page 15: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 15

Reject Inferencing

• Today’s model selects tomorrows training data

• Safe decisions often prevent data collection

– Fraud flag prevents the transaction

– Recommendation ranking has the same effect

• The model winds up confirming what it already knows

• Model comparison has same problem

– Champion says reject, challenger says retain

Page 16: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 16

Reject Inferencing Solution

• We must balance EXPLORATION

– Calling a bluff to look at ground truth

• Versus EXPLOITATION

– Doing what we think is right

• Exploration costs us because we make worse decisions

– But it can help make better decisions later

• Exploitation costs us because we don’t learn better answers

– But it is the best we know now

Page 17: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 17

Multi-Armed Bandits

• Classic formulation for explore/exploit trade-offs

• Thompson sampling is very good option

• Simple dithering may be good enough

• Key intuition is that we don’t need to perfectly characterize

losers … once we know they are losers, we don’t care

• Variant for ranking also good for model evaluation

– Also used to rank reddit comments

Page 18: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 18

Page 19: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 19

Page 20: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 20

Page 21: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 21

Page 22: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 22

Page 23: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 23

Page 24: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 24

Page 25: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 25

Page 26: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 26

Some Warnings

• Bad models can be good explorers

• That can make other models look better

• Offline evaluation is fine, but you don’t know what would have

happened … real innovation has high error bars

• Where models all agree, we learning nothing

• In the end, it is differences that matter the most

Page 27: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 27

Having complete and precise history is golden for

offline comparisons

Page 28: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 28

Allowing the rendezvous server

to do Thompson sampling is

even better

Page 29: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 29

Change Detection

• Model comparison is all fine and good until the world changes

• And the world will change

• One of the most sensitive indicators is score distribution for a

good model

– T-digest is very effective for sketching distributions, especially in tails

– Compare current vs historical distribution using q-q or KS

Page 30: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 30

Analyzing latencies

Page 31: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 31

Hotel Room Latencies

• These are ping latencies from my hotel

• Looks pretty good, right?

• But what about longer term?

208.302198.571185.099191.258201.392214.738197.389187.749201.693186.762185.296186.390183.960188.060190.763

> mean(y$t[i])[1] 198.6047> sd(y$t[i])[1] 71.43965

Page 32: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 32

Not So Fast …

Page 33: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 33

This is long-tailed land

Page 34: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 34

This is long-tailed land

You have to know the distribution

of values

Page 35: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 35

Page 36: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 36

A single number

is simply not enough

Page 37: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 37

And this histogram is hard to read

Page 38: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 38

Idea – Exponential Bins

• Suppose we want relative accuracy in measurement space

• Latencies are positive and only matter within a few percent

– 1.1 ms versus 1.0 ms

– 1100 ms versus 1000 ms

• We can cheat by using floating point representations

– Compute bin using magic

– Adjust bins slightly using more magic

– Count

Page 39: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 39

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space

Page 40: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 40

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space

• Bin index can be computed using FP representation!

Page 41: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 41

What about visualization?

Page 42: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 42

Can’t see small count bars

Page 43: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 43

Good Results

Page 44: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 44

Bad Results – 1% of measurements are 3x bigger

Page 45: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 45

Bad Results – 1% of measurements are 3x bigger

Page 46: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 46

Uniform Bins

Page 47: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 47

FloatHistogram Bins

Page 48: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 48

With FloatHistogram

Page 49: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 49

Sign Up for Next Workshop in the MLL Series

by Ted Dunning, Chief Applications Architect at MapR:

Machine Learning in the Enterprise:

How to do model management in production

http://bit.ly/mapr-machine-learning-logistics-series

Page 50: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 50

Additional Resources

O’Reilly report by Ted Dunning & Ellen Friedman © March 2017

Read free courtesy of MapR:

https://mapr.com/geo-distribution-big-data-and-analytics/

O’Reilly book by Ted Dunning & Ellen Friedman

© March 2016

Read free courtesy of MapR:

https://mapr.com/streaming-architecture-using-

apache-kafka-mapr-streams/

Page 51: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 51

Additional Resources

O’Reilly book by Ted Dunning & Ellen Friedman

© June 2014

Read free courtesy of MapR:

https://mapr.com/practical-machine-learning-

new-look-anomaly-detection/

O’Reilly book by Ellen Friedman & Ted Dunning

© February 2014

Read free courtesy of MapR:

https://mapr.com/practical-machine-learning/

Page 52: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 52

Additional Resources

by Ellen Friedman 8 Aug 2017 on MapR blog:

https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/

Interview by Thor Olavsrud in CIO:

https://www.cio.com.au/article/630299/

what-dataops-collaborative-cross-

functional-analytics/?fp=16&fpid=1

Page 53: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 53

Read more in new book on model management:

New O’Reilly book by Ted Dunning & Ellen Friedman© September 2017

Download free pdf courtesy of MapR:

https://mapr.com/ebook/machine-learning-logistics/

Page 54: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 54

Please support women in tech – help build

girls’ dreams of what they can accomplish

© Ellen Friedman 2015#womenintech #datawomen

Page 55: ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 55

Q&A

@mapr

Maprtechnologies

[email protected]

ENGAGE WITH US

@ted_dunning