18
1 ©MapR Technologies - Confidential Super-Fast Clustering Report from MapR workshop

London data science

Embed Size (px)

DESCRIPTION

Talk given to the London Data Science meetup about recent work on k-nearest neighbor search and fast clustering.

Citation preview

Page 1: London data science

1©MapR Technologies - Confidential

Super-Fast ClusteringReport from MapR workshop

Page 2: London data science

2©MapR Technologies - Confidential

For Book Discount: @ellen_friedman Contact:– [email protected]– @ted_dunning

Twitter for this talk– #mapr_uk

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Page 3: London data science

3©MapR Technologies - Confidential

Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs

Page 4: London data science

4©MapR Technologies - Confidential

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

Page 5: London data science

5©MapR Technologies - Confidential

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

Page 6: London data science

6©MapR Technologies - Confidential

The Problem

A certain bank– had lots of customers– had lots of prospective customers– had a non-trivial number of fraudulent customers– had a non-trivial number of fraudulent merchants

They also – collected data– built models– collected more data– built more models

Page 7: London data science

7©MapR Technologies - Confidential

But …

These models were arduous to build

And hard to test

So people suggested something simpler

Like k-nearest neighbor

Page 8: London data science

8©MapR Technologies - Confidential

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

Page 9: London data science

9©MapR Technologies - Confidential

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans

Page 10: London data science

10©MapR Technologies - Confidential

Projection Search

Page 11: London data science

11©MapR Technologies - Confidential

K-means Search

Page 12: London data science

12©MapR Technologies - Confidential

But These Require k-means!

Need a new k-means algorithm to get speed

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable

Page 13: London data science

13©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Page 14: London data science

14©MapR Technologies - Confidential

Parallel Speedup?

Page 15: London data science

15©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

Page 16: London data science

16©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Page 17: London data science

17©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Page 18: London data science

18©MapR Technologies - Confidential

Thank You