54
How to Tell Which Algorithms Really Matter Ted Dunning MapR Technologies

How to tell which algorithms really matter

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: How to tell which algorithms really matter

How to Tell Which Algorithms Really Matter

Ted DunningMapR Technologies

Page 2: How to tell which algorithms really matter

© 2014 MapR Technologies 2

Which Algorithms are Important?(and how can you know?)

Ted Dunning, Chief Application ArchitectMapR Technologies

Page 3: How to tell which algorithms really matter

© 2014 MapR Technologies 3

00:011.65TBWITH 298 SERVERS

Page 4: How to tell which algorithms really matter

© 2014 MapR Technologies 4

129KRECCOMENDATIONS

00:02

Page 5: How to tell which algorithms really matter

© 2014 MapR Technologies 5

Advertising Automation

Cloud

Sellers Cloud

BuyersCloud

63MAD AUCTIONS

00:03

Page 6: How to tell which algorithms really matter

© 2014 MapR Technologies 6

00:04422.2KGENETIC SEQUENCES

Page 7: How to tell which algorithms really matter

© 2014 MapR Technologies 7

Largest Biometric Database

00:054.73MAUTHENTICATIONS

Page 8: How to tell which algorithms really matter

© 2014 MapR Technologies 8© 2014 MapR Technologies

But How is This Done?

What really matters?

Page 9: How to tell which algorithms really matter

© 2014 MapR Technologies 9

Topic For Today

• What is important? What is not?• Why?• What is the difference from academic research?• Some examples

Page 10: How to tell which algorithms really matter

© 2014 MapR Technologies 10

What is Important?

• Deployable

• Robust

• Transparent

• Skillset and mindset matched?

• Proportionate

Page 11: How to tell which algorithms really matter

© 2014 MapR Technologies 11

What is Important?

• Deployable– Clever prototypes don’t count if they can’t be standardized

• Robust

• Transparent

• Skillset and mindset matched?

• Proportionate

Page 12: How to tell which algorithms really matter

© 2014 MapR Technologies 12

What is Important?

• Deployable– Clever prototypes don’t count

• Robust– Mishandling is common

• Transparent– Will degradation be obvious?

• Skillset and mindset matched?

• Proportionate

Page 13: How to tell which algorithms really matter

© 2014 MapR Technologies 13

What is Important?

• Deployable– Clever prototypes don’t count

• Robust– Mishandling is common

• Transparent– Will degradation be obvious?

• Skillset and mindset matched?– How long will your fancy data scientist enjoy doing standard ops tasks?

• Proportionate– Where is the highest value per minute of effort?

Page 14: How to tell which algorithms really matter

© 2014 MapR Technologies 14

Academic Goals vs Pragmatics

• Academic goals– Reproducible– Isolate theoretically important aspects– Work on novel problems

• Pragmatics– Highest net value– Available data is constantly changing– Diligence and consistency have larger impact than cleverness– Many systems feed themselves, exploration and exploitation are both

important– Engineering constraints on budget and schedule

Page 15: How to tell which algorithms really matter

© 2014 MapR Technologies 15

Example 1:Making Recommendations Better

Page 16: How to tell which algorithms really matter

© 2014 MapR Technologies 16

Recommendation Advances

• What are the most important algorithmic advances in recommendations over the last 10 years?

• Cooccurrence analysis?

• Matrix completion via factorization?

• Latent factor log-linear models?

• Temporal dynamics?

Page 17: How to tell which algorithms really matter

© 2014 MapR Technologies 17

The Winner – None of the Above

• What are the most important algorithmic advances in recommendations over the last 10 years?

1. Result dithering (random noise)

2. Anti-flood (don’t repeat yourself)

Page 18: How to tell which algorithms really matter

© 2014 MapR Technologies 18

The Real Issues

• Exploration• Diversity• Speed

• Not the last fraction of a percent

Page 19: How to tell which algorithms really matter

© 2014 MapR Technologies 19

Result Dithering

• Dithering is used to re-order recommendation results – Re-ordering is done randomly

• Dithering is guaranteed to make off-line performance worse

• Dithering also has a near perfect record of making actual performance much better

Page 20: How to tell which algorithms really matter

© 2014 MapR Technologies 20

Result Dithering

• Dithering is used to re-order recommendation results – Re-ordering is done randomly

• Dithering is guaranteed to make off-line performance worse

• Dithering also has a near perfect record of making actual performance much better

“Made more difference than any other change”

Page 21: How to tell which algorithms really matter

© 2014 MapR Technologies 22

Example … ε = 0.5

Page 22: How to tell which algorithms really matter

© 2014 MapR Technologies 23

Example … ε = log 2 = 0.69

Page 23: How to tell which algorithms really matter

© 2014 MapR Technologies 24

Exploring The Second Page

Page 24: How to tell which algorithms really matter

© 2014 MapR Technologies 25

Lesson 1:Exploration is good

Page 25: How to tell which algorithms really matter

© 2014 MapR Technologies 26

Example 2:Bayesian Bandits

Page 26: How to tell which algorithms really matter

© 2014 MapR Technologies 27

Bayesian Bandits

• Based on Thompson sampling• Very general sequential test • Near optimal regret• Trade-off exploration and exploitation

• Possibly best known solution for exploration/exploitation

• Incredibly simple

Page 27: How to tell which algorithms really matter

© 2014 MapR Technologies 30

Fast Convergence

Page 28: How to tell which algorithms really matter

© 2014 MapR Technologies 31

Thompson Sampling on Ads

An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011

Page 29: How to tell which algorithms really matter

© 2014 MapR Technologies 32

Bayesian Bandits versus Result Dithering

• Many useful systems are difficult to frame in fully Bayesian form• Thompson sampling cannot be applied without posterior

sampling

• Can still do useful exploration with dithering

• But better to use Thompson sampling if possible

Page 30: How to tell which algorithms really matter

© 2014 MapR Technologies 33

Lesson 2:Exploration is easy to do and pays big benefits.

Page 31: How to tell which algorithms really matter

© 2014 MapR Technologies 34

Example 3:On-line Clustering

Page 32: How to tell which algorithms really matter

© 2014 MapR Technologies 35

The Problem

• K-means clustering is useful for feature extraction or compression

• At scale and at high dimension, the desirable number of clusters increases

• Very large number of clusters may require more passes through the data

• Super-linear scaling is generally infeasible

Page 33: How to tell which algorithms really matter

© 2014 MapR Technologies 36

The Solution

• Sketch-based algorithms produce a sketch of the data• Streaming k-means uses adaptive dp-means to produce this

sketch in the form of many weighted centroids which approximate the original distribution

• The size of the sketch grows very slowly with increasing data size

• Many operations such as clustering are well behaved on sketches

Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.

Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.

Page 34: How to tell which algorithms really matter

© 2014 MapR Technologies 37

An Example

Page 35: How to tell which algorithms really matter

© 2014 MapR Technologies 38

An Example

Page 36: How to tell which algorithms really matter

© 2014 MapR Technologies 43

Streaming k-means Ideas

• By using a sketch with lots (k log N) of centroids, we avoid pathological cases

• We still get a very good result if the sketch is created – in one pass– with approximate search

• In fact, adaptive dp-means works just fine

• In the end, the sketch can be used for clustering or …

Page 37: How to tell which algorithms really matter

© 2014 MapR Technologies 44

Lesson 3:Sketches make big data small.

Page 38: How to tell which algorithms really matter

© 2014 MapR Technologies 45

Example 4:Search Abuse

Page 39: How to tell which algorithms really matter

© 2014 MapR Technologies 46

Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Alice

Charles

Page 40: How to tell which algorithms really matter

© 2014 MapR Technologies 47

Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Bob got an apple

Alice

Bob

Charles

Page 41: How to tell which algorithms really matter

© 2014 MapR Technologies 48

Recommendations

What else would Bob like??

Alice

Bob

Charles

Page 42: How to tell which algorithms really matter

© 2014 MapR Technologies 49

Log Files

Alice

Bob

Charles

Alice

Bob

Charles

Alice

Page 43: How to tell which algorithms really matter

© 2014 MapR Technologies 50

History Matrix: Users by Items

Alice

Bob

Charles

✔ ✔ ✔

✔ ✔

✔ ✔

Page 44: How to tell which algorithms really matter

© 2014 MapR Technologies 51

Co-occurrence Matrix: Items by Items

-

1 2

1 1

1

1

2 1

How do you tell which co-occurrences are useful?.

00

0 0

Page 45: How to tell which algorithms really matter

© 2014 MapR Technologies 53

Indicator Matrix: Anomalous Co-Occurrence

✔✔

Result: The marked row will be added to the indicator field in the item document…

Page 46: How to tell which algorithms really matter

© 2014 MapR Technologies 54

Indicator Matrix

id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

indicators: (t1)

That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine.

Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.

Page 47: How to tell which algorithms really matter

© 2014 MapR Technologies 56

Internals of the Recommender Engine

56

Page 48: How to tell which algorithms really matter

© 2014 MapR Technologies 58

Real-life example

Page 49: How to tell which algorithms really matter

© 2014 MapR Technologies 59

Lesson 4:Recursive search abuse pays

Search can implement recsWhich can implement search

Page 50: How to tell which algorithms really matter

© 2014 MapR Technologies 60

How Does This Apply?

Page 51: How to tell which algorithms really matter

© 2014 MapR Technologies 61

How Can I Start?

Page 52: How to tell which algorithms really matter

© 2014 MapR Technologies 62

Q & A

@ted_dunning @mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Page 53: How to tell which algorithms really matter

© 2014 MapR Technologies 64

Page 54: How to tell which algorithms really matter