London data science

1©MapR Technologies - Confidential

Super-Fast ClusteringReport from MapR workshop


For Book Discount: @ellen_friedman Contact:– [email protected]– @ted_dunning

Twitter for this talk– #mapr_uk

Slides and such:– http://info.mapr.com/ted-uk-05-2012

mailto:[email protected]

http://info.mapr.com/ted-uk-2012-05


Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs


We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design


We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design


The Problem

A certain bank– had lots of customers– had lots of prospective customers– had a non-trivial number of fraudulent customers– had a non-trivial number of fraudulent merchants

They also – collected data– built models– collected more data– built more models


But …

These models were arduous to build

And hard to test

So people suggested something simpler

Like k-nearest neighbor


What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time


What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans


Projection Search


K-means Search


But These Require k-means!

Need a new k-means algorithm to get speed

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable


How It Works

For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly


Parallel Speedup?

✓


Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!


Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)


Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

mailto:[email protected]

http://info.mapr.com/ted-uk-2012-05


Thank You

Technology

London data science