33
Alain Rodriguez, Fraud Platform, Uber Kelvin Chu, Hadoop Platform, Uber Locality Sensitive Hashing by Spark June 08, 2016

Locality Sensitive Hashing By Spark

Embed Size (px)

Citation preview

Page 1: Locality Sensitive Hashing By Spark

Alain Rodriguez, Fraud Platform, UberKelvin Chu, Hadoop Platform, Uber

Locality Sensitive Hashing by Spark

June 08, 2016

Page 2: Locality Sensitive Hashing By Spark

Overlapping RoutesFinding similar trips in a city

Page 3: Locality Sensitive Hashing By Spark

The problemDetect trips with a high degree of overlap

We are interested in detecting trips that have various degrees of overlap.

• Large number of trips

• Noisy, inconsistent GPS data

• Not looking for exact matches

• Directionality is important

Page 4: Locality Sensitive Hashing By Spark

Input DataMillions of trips scattered over time and space

GPS traces are represented as an ordered list of (latitude,longitude,time) tuples.

• Coordinates are reals and have noise

• Traces can be dense or sparse, yet overlapping

• Large time and geographic search space

[{

"latitude":25.7613453844,"epoch":1446577692,"longitude":-80.197244976

},{

"latitude":25.7613489535,"epoch":1446577693,"longitude":-80.1972450862

},…

]

Page 5: Locality Sensitive Hashing By Spark

Divides the world into consistently sized regions. Area segments can be had of different sizes

Google S2 CellsEfficient geo hashing

Page 6: Locality Sensitive Hashing By Spark

Jaccard indexSet similarity coefficient

The Jaccard index can be used as a measure of set similarity A = {a, b, c}, B = {b, c, d}, C = {c, d, e}

J(A, A) = 1.0J(A, B) = 0.5J(A, C) = 0.2

Page 7: Locality Sensitive Hashing By Spark

Sparse and dense traces should be matched Ensure points are at most X distance apart

Different devices generate varying data densities. Two segments that start and end at the same location should be detected as overlapping.

Densification ensures that continuous segments are independently overlapping.

HeuristicDensify sparse traces

A

A

B

B

A

A

B

B

Page 8: Locality Sensitive Hashing By Spark

Heuristic

Remove noise resulting from a vehicle stopped at a light or a very chatty device.

Remove contiguous duplicates

Discretize segments

Break down routes into equal size area segments; this eliminates route noise. Segment size determines matching sensitivity.

Discretize route segments

Page 9: Locality Sensitive Hashing By Spark

Directionality matters Shingling captures directionality

Two overlapping trips with opposite directions should not be matched.

Combining contiguous segments captures the sequence of moves from one segment to another.

HeuristicShingle contiguous area segments

1 2 3 4 5 6 7 8

A

A

B

B1 2 3 4 5 6 7 8

1->2 2->3 3->4 4->5 5->6 6->7 7->8

A

A

B

B2->1 3->2 4->3 5->4 6->5 7->6 8->7

Page 10: Locality Sensitive Hashing By Spark

Set overlap problemFind traces that have the desired level of common shingles

1->2

2->3

3->4 4->5

5->6 6->7

7->8

8->9

9->10

Page 11: Locality Sensitive Hashing By Spark

N^2 takes foreverLSH to the rescue

● Sifting through a month’s worth of trips for a city takes forever with the N^2 approach

● Locality-Sensitive Hashing allows us to find most matches quickly. Spark provides the perfect engine.

Page 12: Locality Sensitive Hashing By Spark

Locality-Sensitive Hashing (LSH)Quick Introduction

Page 13: Locality Sensitive Hashing By Spark

Problem - Near Neighbors Search

Set of Points P

Distance Function D

Query Point Q

Page 14: Locality Sensitive Hashing By Spark

Problem - Clustering

Set of Points P

Distance Function D

Page 15: Locality Sensitive Hashing By Spark

Curse of Dimensionality

1-Dimension e.g. single integer

Q: 7 Distance: 3

A Solution: Binary Tree e.g. Return 9, 4, 8, ...

2-Dimension e.g. GPS point

Q: (12.73, 61.45) Distance: 10

A Solution: Quadtree, R-tree, etc

Page 16: Locality Sensitive Hashing By Spark

Curse of Dimensionality

How about very high dimension?

1->2 2->3 3->4 4->5 5->6 6->7 7->8

Very hard problem

A trip often has thousands of shingles

->3k

Page 17: Locality Sensitive Hashing By Spark

Approximate Solution

Bucket1

T1

T2

h(T1)

h(T2)

D(T1 , T2) is small

With high probability T1 and T2 are hashed into the same bucket.

Trip T1 & Trip T2 are similar

Page 18: Locality Sensitive Hashing By Spark

Approximate Solution

Bucket1T1

T2

h(T1)

h(T2)

D(T1 , T2) is large

With high probability T1 and T2 are hashed into the different buckets.

Bucket2

Trip T1 & Trip T2 are not similar

Page 19: Locality Sensitive Hashing By Spark

Some distance functions have good companions of hash functions.

For Jaccard distance, it is MinHash function.

Page 20: Locality Sensitive Hashing By Spark

MinHash(S) = min { h(x) for all x in the set S }

h(x) is hash function such as (ax + b) % m where a & b are some good constants and m is the number of hash bins

Example:S = {26, 88, 109}h(x) = (2x + 7) % 8MinHash(S) = min {3, 7, 1} = 1

Page 21: Locality Sensitive Hashing By Spark

Distance Hash Function

Jaccard MinHash

Hamming i-th value of vector x

Cosine Sign of the dot product of x and a random vector

Some Other Examples

Page 22: Locality Sensitive Hashing By Spark

How to increase and control the probability?

It turns out the solution is very intuitive.

Page 23: Locality Sensitive Hashing By Spark

Use Multiple Hash

Bucket1T1

T2

h1(T1)

h1(T2)

Bucket2

Bucket3

T1

T2

h2(T1)

h2(T2)

Both h1 and h2 are MinHash, but with different parameters (e.g. a & b)

Page 24: Locality Sensitive Hashing By Spark

Our Approach of LSH on Spark

Page 25: Locality Sensitive Hashing By Spark

Shuffle Keys

h1 rangeT1

T2

h1(T1)

h1(T2)

h2(T1)

h2(T2)

● RDD[Trip]

● The hash values are shuffle keys

● h1 and h2 have non-overlapping key ranges

● groupByKey()

h2 range

other hash

Keys Range

Page 26: Locality Sensitive Hashing By Spark

Post Processing

Bucket1 T1, T2

● If T1 and T2 are hashed into the same bucket, it’s likely that they are similar.

● Compute the Jaccard distance.

Page 27: Locality Sensitive Hashing By Spark

Approach 2

h1 rangeT1

T2

h1(T1)

h1(T2)

h2(T1)

h2(T2)

● Same pair of trips are matched in both h1 and h2 buckets

● Use one more shuffle to dedup

● Network vs Distance Computation

h2 range

other hash

Keys Range

Page 28: Locality Sensitive Hashing By Spark

Approach 3

● Don’t send the actual trip vector in the LSH and Dedup shuffles

● Send only the trip ID

● After dedup, join back with the trip objects with one more shuffle

○ Then compute the Jaccard distance of each pair of matched trips.

● When the trip object is large, Approach 3 saves a lot of network usage.

Page 29: Locality Sensitive Hashing By Spark

How to Generate Thousands of Hash Functions

● Naive approach

○ Generate thousands tuples of (a, b, m)

● Cache friendly approach - CPU register/L1/L2

○ Generate only two hash functions

○ h1(x) = (a1x + b1) % m1

○ h2(x) = (a2x + b2) % m2

hi(x) = h1(x) + i * h2(x) i from 1 to number of hash functions

Page 30: Locality Sensitive Hashing By Spark

Other Features

● Amplification

○ Improve the probabilities○ Reduce computation, memory and network used in final post-processing○ More hashing (usually insignificant compared to the cost in final post-processing)

● Near Neighbors Search

○ Used in information retrieval, instances based machine learning

Page 31: Locality Sensitive Hashing By Spark

Other Applications of LSH

● Search for top K similar items

○ Documents, images, time-series, etc

● Cluster similar documents

○ Similar news articles, mirror web pages, etc

● Products recommendation

○ Collaborative filtering

Page 32: Locality Sensitive Hashing By Spark

Future Work

● Migrate to Spark ML API

○ DataFrame as first class citizen○ Integrate it into Spark

● Low latency inserts with Spark Streaming

○ Avoid re-hashing when new objects are streaming in

Page 33: Locality Sensitive Hashing By Spark

Thank you

Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be

reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or

by any information storage or retrieval systems, without permission in writing from Uber. This document is intended

only for the use of the individual or entity to whom it is addressed and contains information that is privileged,

confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified

that the information contained herein includes proprietary and confidential information of Uber, and recipient may not

make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person

other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.