ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 1

Machine Learning

Comparison and Evaluation


Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board Member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

mailto:[email protected]

mailto:[email protected]


Machine Learning Everywhere

Image courtesy Mtell used with permission.Images © Ellen Friedman.


Scores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw


ResultsRendezvousScores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw


MetricsMetrics

ResultsRendezvousScores

ArchiveDecoy

m1

m2

m3

Features / profiles

Input Raw


Let’s talk about how the rendezvous architecture makes

evaluation easier


Decoy Model in the Rendezvous Architecture

InputScores

Decoy

Model 2

Model 3

Archive

• Looks like a server, but it just archives inputs

• Safe in a good streaming environment, less safe without good isolation


Other Data Collected in Rendezvous

• Request ID + Input data

• All output scores

• Evaluation latency

• Round trip latency

• Rendezvous choices


Direct Model Comparison

• Don’t need ground truth to compare models at a gross level

• For uncalibrated models, score quantiles are useful

• For mature models, most results will be very similar

– Large differences from known good models cannot be good

• Ultimately, ground truth is important

– But only for cases where scores differ significantly


Direct Model Differencing

−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot



−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically



−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically

Quantiles

correct scaling



−2 0 2 4

02

46

Raw Scores

0.0 0.5 1.00.0

0.5

1.0

Q−Q plot

Scales may

differ radically

Quantiles

correct scaling

Perfect match

on high scores


Reject Inferencing

• Today’s model selects tomorrows training data

• Safe decisions often prevent data collection

– Fraud flag prevents the transaction

– Recommendation ranking has the same effect

• The model winds up confirming what it already knows

• Model comparison has same problem

– Champion says reject, challenger says retain


Reject Inferencing Solution

• We must balance EXPLORATION

– Calling a bluff to look at ground truth

• Versus EXPLOITATION

– Doing what we think is right

• Exploration costs us because we make worse decisions

– But it can help make better decisions later

• Exploitation costs us because we don’t learn better answers

– But it is the best we know now


Multi-Armed Bandits

• Classic formulation for explore/exploit trade-offs

• Thompson sampling is very good option

• Simple dithering may be good enough

• Key intuition is that we don’t need to perfectly characterize

losers … once we know they are losers, we don’t care

• Variant for ranking also good for model evaluation

– Also used to rank reddit comments










Some Warnings

• Bad models can be good explorers

• That can make other models look better

• Offline evaluation is fine, but you don’t know what would have

happened … real innovation has high error bars

• Where models all agree, we learning nothing

• In the end, it is differences that matter the most


Having complete and precise history is golden for

offline comparisons


Allowing the rendezvous server

to do Thompson sampling is

even better


Change Detection

• Model comparison is all fine and good until the world changes

• And the world will change

• One of the most sensitive indicators is score distribution for a

good model

– T-digest is very effective for sketching distributions, especially in tails

– Compare current vs historical distribution using q-q or KS


Analyzing latencies


Hotel Room Latencies

• These are ping latencies from my hotel

• Looks pretty good, right?

• But what about longer term?

208.302198.571185.099191.258201.392214.738197.389187.749201.693186.762185.296186.390183.960188.060190.763

> mean(y$t[i])[1] 198.6047> sd(y$t[i])[1] 71.43965


Not So Fast …


This is long-tailed land


This is long-tailed land

You have to know the distribution

of values



A single number

is simply not enough


And this histogram is hard to read


Idea – Exponential Bins

• Suppose we want relative accuracy in measurement space

• Latencies are positive and only matter within a few percent

– 1.1 ms versus 1.0 ms

– 1100 ms versus 1000 ms

• We can cheat by using floating point representations

– Compute bin using magic

– Adjust bins slightly using more magic

– Count


FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space


FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space

• Bin index can be computed using FP representation!


What about visualization?


Can’t see small count bars


Good Results


Bad Results – 1% of measurements are 3x bigger


Bad Results – 1% of measurements are 3x bigger


Uniform Bins


FloatHistogram Bins


With FloatHistogram


Sign Up for Next Workshop in the MLL Series

by Ted Dunning, Chief Applications Architect at MapR:

Machine Learning in the Enterprise:

How to do model management in production

http://bit.ly/mapr-machine-learning-logistics-series

http://bit.ly/mapr-machine-learning-logistics-series


Additional Resources

O’Reilly report by Ted Dunning & Ellen Friedman © March 2017

Read free courtesy of MapR:

https://mapr.com/geo-distribution-big-data-and-analytics/

O’Reilly book by Ted Dunning & Ellen Friedman

© March 2016


https://mapr.com/streaming-architecture-using-

apache-kafka-mapr-streams/



O’Reilly book by Ted Dunning & Ellen Friedman

© June 2014


https://mapr.com/practical-machine-learning-

new-look-anomaly-detection/

O’Reilly book by Ellen Friedman & Ted Dunning

© February 2014


https://mapr.com/practical-machine-learning/



by Ellen Friedman 8 Aug 2017 on MapR blog:

https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/

Interview by Thor Olavsrud in CIO:

https://www.cio.com.au/article/630299/

what-dataops-collaborative-cross-

functional-analytics/?fp=16&fpid=1


Read more in new book on model management:

New O’Reilly book by Ted Dunning & Ellen Friedman© September 2017

Download free pdf courtesy of MapR:

https://mapr.com/ebook/machine-learning-logistics/


Please support women in tech – help build

girls’ dreams of what they can accomplish

© Ellen Friedman 2015#womenintech #datawomen


Q&A

@mapr

Maprtechnologies

[email protected]

ENGAGE WITH US

@ted_dunning

Data & Analytics

ML Workshop 2: Machine Learning Model Comparison & Evaluation