40
1 Copyright (c) 2014 Scale Unlimited. Similarity at Scale Fuzzy matching and recommendations using Hadoop, Solr, and heuristics Ken Krugler Scale Unlimited

Similarity at scale

Embed Size (px)

DESCRIPTION

This is a presentation I gave at Hadoop Summit San Jose 2014, on doing fuzzy matching at large scale using combinations of Hadoop & Solr-based techniques.

Citation preview

Page 1: Similarity at scale

1

Copyright (c) 2014 Scale Unlimited.

Similarity at ScaleFuzzy matching and recommendationsusing Hadoop, Solr, and heuristics

Ken KruglerScale Unlimited

Page 2: Similarity at scale

2

Copyright (c) 2014 Scale Unlimited.

The Twitter Pitch

Wide class of problems that rely on "good" similarityFastAccurateScalable

Benefit from my mistakesScale Unlimited - consulting & trainingTalking about solutions to real problems

Page 3: Similarity at scale

3

Copyright (c) 2014 Scale Unlimited.

What are similarity problems?

ClusteringGrouping similar advertisers

DeduplicationJoining noisy sets of POI data

RecommendationsSuggesting pages to users

Entity resolutionFuzzy matching of people and companies

Page 4: Similarity at scale

4

Copyright (c) 2014 Scale Unlimited.

What is "Similarity"?

Exact matching is easy(er)Accuracy is a givenFast and scalable can still be hardLots of key/value systems like Cassandra, HBase, etc.

Fuzzy matching is harderTwo "things" aren't exactly the sameSimilarity is based on comparing features

Page 5: Similarity at scale

5

Copyright (c) 2014 Scale Unlimited.

Between two articles?

Features could be a bag of wordsAre these two articles the same?

Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters.

The inland is a geographically larger region and has a moderate continental climate, bookended by hot summers and cold and snowy winters.

Page 6: Similarity at scale

6

Copyright (c) 2014 Scale Unlimited.

What about now?Easy to create challenging situations for a person

Which is an impossible problem for a computerNeed to distinguish between "conceptually similar" and "derived from"

Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters.

Bosnia has a warm European climate, though the summers can be hot and the winters are often cold and wet.

Page 7: Similarity at scale

7

Copyright (c) 2014 Scale Unlimited.

Between two records?

Features could be field valuesAre these two people the same?

Name Bob Bogus Robert Bogus

Address 220 3rd Avenue 220 3rd Avenue

City Seattle Seattle

State WA WA

Zip 98104-2608 98104

Page 8: Similarity at scale

8

Copyright (c) 2014 Scale Unlimited.

What about now?

Need to get rid of false differences caused by abbreviationsHow does a computer know what's a "significant" difference?

Name Bob Bogus Robert H. Bogus

Address Apt 102, 3220 3rd Ave 220 3rd Avenue South

City Seattle Seattle

State Washington WA

Zip 98104

Page 9: Similarity at scale

9

Copyright (c) 2014 Scale Unlimited.

Between two users?Features could be...

Items a user has boughtAre these two users the same?

User 1 User 2

Page 10: Similarity at scale

10

Copyright (c) 2014 Scale Unlimited.

What about now?

Need more generic featuresE.g. product categories

User 1 User 2

Page 11: Similarity at scale

11

Copyright (c) 2014 Scale Unlimited.

How to measure similarity?

Assuming you have some features for two "things"How does a program determine their degree of similarity?

You want a number that represents their "closeness"Typically 1.0 means exactly the sameAnd 0.0 means completely different

Page 12: Similarity at scale

12

Copyright (c) 2014 Scale Unlimited.

Jaccard Coefficient

Ratio of number of items in common / total number of itemsWhere "items" typical means unique values (sets of things)So 1.0 is exactly the same, and 0.0 is completely different

Page 13: Similarity at scale

13

Copyright (c) 2014 Scale Unlimited.

Cosine Similarity

Assume a document only has three unique wordscat, dog, goldfishSet x = frequency of catSet y = frequency of dogSet z = frequency of goldfish

The result is a "term vector" with 3 dimensionsCalculate cosine of angle between term vectors

This is their "cosine similarity"

Page 14: Similarity at scale

14

Copyright (c) 2014 Scale Unlimited.

Why is scalability hard?Assume you have 8.5 million businesses in the US

There are ≈ N^2/2 pairs to evaluateThat's 36 trillion comparisons

Sometimes you can quickly trim this problemE.g. if you assume the ZIP code exists, and must matchThen this becomes about 4 billion comparisons

But often you don't have a "magic" field

Page 15: Similarity at scale

15

Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

DataStax Web Site Page Recommender

Page 16: Similarity at scale

16

Copyright (c) 2014 Scale Unlimited.

How to recommend pages?

Besides manually adding a bunch of links...Which is tedious, doesn't scale well, and gets busy

Page 17: Similarity at scale

17

Copyright (c) 2014 Scale Unlimited.

Can we exploit other users?

Classic shopping cart analysis"Users who bought X also bought Y"Based on actual activity, versus (noisy, skewed) ratings

Page 18: Similarity at scale

18

Copyright (c) 2014 Scale Unlimited.

What's the general approach?

We have web logs with IP addresses, time, path to page157.55.33.39 - - [18/Mar/2014:00:01:00 -0500] "GET /solutions/nosql HTTP/1.1"

A browsing session is a series of requests from one IP addressWith some maximum time gap between requests

Find sessions "similar to" the current user's sessionRecommend pages from these similar sessions

Page 19: Similarity at scale

19

Copyright (c) 2014 Scale Unlimited.

How to find similar sessions?

Create a Lucene search index with one document per sessionEach indexed document contains the page paths for one session

session-1 /path/to/page1, /path/to/page2, /path/to/page3session-2 /path/to/pageX, /path/to/pageY

Search for paths from the current user's session

Page 20: Similarity at scale

20

Copyright (c) 2014 Scale Unlimited.

Why is this a search issue?

Solr (search in general) is all about similarityFind documents similar to the words in my query

Cosine similarity is used to calculate similarityBetween the term vector for my queryand the term vector of each document

Page 21: Similarity at scale

21

Copyright (c) 2014 Scale Unlimited.

What's the algorithm?

Find sessions similar to the target (current user's) sessionCalculate similarity between these sessions and the target sessionAggregate similarity scores for all paths from these sessionsRemove paths that are already in the target sessionRecommend the highest scoring path(s)

Page 22: Similarity at scale

22

Copyright (c) 2014 Scale Unlimited.

Why do you sum similarities?

Give more weight to pages from sessions that are more similarPages from more similar sessions are assumed to be more interesting

Page 23: Similarity at scale

23

Copyright (c) 2014 Scale Unlimited.

What are some problems?

The classic problem is that we recommend "common" pagesE.g. if you haven't viewed the top-level page in your sessionBut this page is very common in most of the other sessionsSo then it becomes one of the top recommended pageBut that generally stinks as a recommendation

Page 24: Similarity at scale

24

Copyright (c) 2014 Scale Unlimited.

Can RowSimilarityJob help?

Part of the Mahout open source projectTakes as input a table of users (one per row) with lists of itemsGenerates an item-item co-occurrence matrix

Values are weights calculated using log-likelihood ratio (LLR)Unsurprising (common) items get low weights

If we run it on our data, where users = sessions and items = pagesWe get page-page co-occurrence matrix Page 1 Page 2 Page 3

Page 1 2.1 0.8Page 2 2.1 4.5Page 3 0.8 4.5

Page 25: Similarity at scale

25

Copyright (c) 2014 Scale Unlimited.

How to use co-occurrence?

Convert the matrix into an indexEach row is one Lucene documentDrop any low-scoring entriesCreate list of "related" pages

Search in Related Pages fieldUsing pages from current sessionSo Page 2 recommends Page 1 & 3

Page 1 Page 2 Page 3Page 1 2.1 0.8Page 2 2.1 4.5Page 3 0.8 4.5

Related PagesPage 1 Page 2Page 2 Page 1, Page 3Page 3 Page 2

Page 26: Similarity at scale

26

Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

EWSEntity Resolution

Page 27: Similarity at scale

27

Copyright (c) 2014 Scale Unlimited.

What is Early Warning?

Early Warning helps banks fight fraudIt's owned by the top 5 US banksAnd gets data from 800+ financial institutionsSo they have details on most US bank accounts

When somebody signs up for an accountThey need to quickly match the person to "known entities"And derive a risk score based on related account details

Page 28: Similarity at scale

28

Copyright (c) 2014 Scale Unlimited.

Why do they need similarity?

Assume you have information on 100s of millions of entitiesName(s), address(es), phone number(s), etc.And often a unique ID (Social Security Number, EIN, etc)

Why is this a similarity problem?Data is noisy - typos, abbreviations, partial dataPeople lie - much fraud starts with opening an account using bad data

Page 29: Similarity at scale

29

Copyright (c) 2014 Scale Unlimited.

How does search help?

We can quickly build a list of candidate entities, using searchQuery contains field data provided by the client bankSignificantly less than 1 second for 30 candidate entities

Then do more precise, sophisticated and CPU-intensive scoringThe end result is a ranked list of entities with similarity scoresWhich then is used to look up account status, fraud cases, etc.

Page 30: Similarity at scale

30

Copyright (c) 2014 Scale Unlimited.

What's the data pipeline?

Incoming data is cleaned up/normalized in HadoopSimple things like space strippingAlso phone number formattingZIP+4 expansion into just ZIP plus fullOther normalization happens inside of Solr

This gets loaded into Cassandra tablesAnd automatically indexed by Solr, via DataStax Enterprise

ZIP+4 Terms95014-2127 95014, 2127

Phone Terms4805551212 480, 5551212

Page 31: Similarity at scale

31

Copyright (c) 2014 Scale Unlimited.

What's the Solr setup?Each field in the index has very specific analysis

Simple things like normalizationSynonym expansion for names, abbreviationsSplit up fields so partial matches work

At query time we can weight the importance of each fieldWhich helps order the top N candidates similar to their real match scoresE.g. an SSN matching means much more than a first name matching

Page 32: Similarity at scale

32

Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Batch Similarity

Page 33: Similarity at scale

33

Copyright (c) 2014 Scale Unlimited.

Can we do batch similarity?

Search works well for real-time similarityBut batch processing at scale maxes out the search system

We can use two different techniques with Hadoop for batchSimHash - good for text document similarityParallel Set-Similarity Joins - good for record similarity

Page 34: Similarity at scale

34

Copyright (c) 2014 Scale Unlimited.

What is SimHash?Assume a document is a set of (unique) wordsCalculate a hash for each wordProbability that the minimum hash is the same for two documents...

...is magically equal to the Jaccard CoefficientTerm Hash

bosnia 78954874223is 53466156768

the 5064199193largest 3193621783

geographic -5718349925

Page 35: Similarity at scale

35

Copyright (c) 2014 Scale Unlimited.

What is a SimHash workflow?

Calculate N hash valuesEasy way is to use the N smallest hash values

Calculate number of matching hash values between doc pairs (M)Then the Jaccard Coefficient is ≈ M/N

Only works if N is much smaller than # of unique words in docsImplementation of this in cascading.utils open source project

https://github.com/ScaleUnlimited/cascading.utils

Page 36: Similarity at scale

36

Copyright (c) 2014 Scale Unlimited.

What is Set-Similarity Join?

Joining records in two sets that are "close enough"aka "fuzzy join"

Requires generation of "tokens" from record field(s)Typically words from text

Simple implementation has three phasesFirst calculate counts for each unique token valueThen output <token, record> for N most common tokens of each recordGroup by token, compare records in each group

Page 37: Similarity at scale

37

Copyright (c) 2014 Scale Unlimited.

How does fuzzy join work?

For two records to be "similar enough"...They need to share one of their common tokensGeneralization of the ZIP code "magic field" approach

Basic implementation has a number of issuesPassing around copies of full record is inefficientToo-common tokens create huge groups for comparisonTwo records compared multiple times

Page 38: Similarity at scale

38

Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Summary

Page 39: Similarity at scale

39

Copyright (c) 2014 Scale Unlimited.

The Net-Net

Similarity is a common requirement for many applicationsRecommendationsEntity matching

Combining Hadoop with search is a powerful combinationScalabilityPerformanceFlexibility

Page 40: Similarity at scale

40

Copyright (c) 2014 Scale Unlimited.

Questions?Feel free to contact me

http://www.scaleunlimited.com/contact/

Take a look at Pat Ferrel's Hadoop + Solr recommenderhttp://github.com/pferrel/solr-recommender

Check out Mahouthttp://mahout.apache.org

Read paper & code for fuzzyjoin projecthttp://asterix.ics.uci.edu/fuzzyjoin/