62
Approximate nearest neighbors & vector models

Approximate nearest neighbor methods and vector models – NYC ML meetup

Embed Size (px)

Citation preview

Page 1: Approximate nearest neighbor methods and vector models – NYC ML meetup

Approximate nearest neighbors & vector

models

Page 2: Approximate nearest neighbor methods and vector models – NYC ML meetup

I’m Erik

• @fulhack

• Author of Annoy, Luigi

• Currently CTO of Better

• Previously 5 years at Spotify

Page 3: Approximate nearest neighbor methods and vector models – NYC ML meetup

What’s nearest neighbor(s)

• Let’s say you have a bunch of points

Page 4: Approximate nearest neighbor methods and vector models – NYC ML meetup

Grab a bunch of points

Page 5: Approximate nearest neighbor methods and vector models – NYC ML meetup

5 nearest neighbors

Page 6: Approximate nearest neighbor methods and vector models – NYC ML meetup

20 nearest neighbors

Page 7: Approximate nearest neighbor methods and vector models – NYC ML meetup

100 nearest neighbors

Page 8: Approximate nearest neighbor methods and vector models – NYC ML meetup

…But what’s the point?

• vector models are everywhere

• lots of applications (language processing, recommender systems, computer vision)

Page 9: Approximate nearest neighbor methods and vector models – NYC ML meetup

MNIST example• 28x28 = 784-dimensional dataset

• Define distance in terms of pixels:

Page 10: Approximate nearest neighbor methods and vector models – NYC ML meetup

MNIST neighbors

Page 11: Approximate nearest neighbor methods and vector models – NYC ML meetup

…Much better approach

1. Start with high dimensional data

2. Run dimensionality reduction to 10-1000 dims

3. Do stuff in a small dimensional space

Page 12: Approximate nearest neighbor methods and vector models – NYC ML meetup

Deep learning for food• Deep model trained on a GPU on 6M random pics

downloaded from Yelp15

6x15

6x32

154x

154x

32

152x

152x

32

76x7

6x64

74x7

4x64

72x7

2x64

36x3

6x12

8

34x3

4x12

8

32x3

2x12

8

16x1

6x25

6

14x1

4x25

6

12x1

2x25

6

6x6x

512

4x4x

512

2x2x

512

2048

2048

128

1244

3x3 convolutions

2x2 maxpoolfully

connected with dropout

bottleneck layer

Page 13: Approximate nearest neighbor methods and vector models – NYC ML meetup

Distance in smaller space1. Run image through the network

2. Use the 128-dimensional bottleneck layer as an item vector

3. Use cosine distance in the reduced space

Page 14: Approximate nearest neighbor methods and vector models – NYC ML meetup

Nearest food pics

Page 15: Approximate nearest neighbor methods and vector models – NYC ML meetup

Vector methods for text

• TF-IDF (old) – no dimensionality reduction

• Latent Semantic Analysis (1988)

• Probabilistic Latent Semantic Analysis (2000)

• Semantic Hashing (2007)

• word2vec (2013), RNN, LSTM, …

Page 16: Approximate nearest neighbor methods and vector models – NYC ML meetup

Represent documents and/or words as f-dimensional vector

Late

nt fa

ctor

1

Latent factor 2

banana

apple

boat

Page 17: Approximate nearest neighbor methods and vector models – NYC ML meetup

Vector methods for collaborative filtering

• Supervised methods: See everything from the Netflix Prize

• Unsupervised: Use NLP methods

Page 18: Approximate nearest neighbor methods and vector models – NYC ML meetup

CF vectors – examplesIPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

Page 19: Approximate nearest neighbor methods and vector models – NYC ML meetup

Geospatial indexing• Ping the world: https://github.com/erikbern/ping

• k-NN regression using Annoy

Page 20: Approximate nearest neighbor methods and vector models – NYC ML meetup

Nearest neighbors the brute force way

• we can always do an exhaustive search to find the nearest neighbors

• imagine MySQL doing a linear scan for every query…

Page 21: Approximate nearest neighbor methods and vector models – NYC ML meetup

Using word2vec’s brute force search

$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys0m16.362s

Page 22: Approximate nearest neighbor methods and vector models – NYC ML meetup

Introducing Annoy

• https://github.com/spotify/annoy

• mmap-based ANN library

• Written in C++, with Python and R bindings

• 585 stars on Github

Page 23: Approximate nearest neighbor methods and vector models – NYC ML meetup

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys0m0.162s

Page 24: Approximate nearest neighbor methods and vector models – NYC ML meetup

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys0m0.614s

Page 25: Approximate nearest neighbor methods and vector models – NYC ML meetup

(performance)

Page 26: Approximate nearest neighbor methods and vector models – NYC ML meetup

1. Building an Annoy index

Page 27: Approximate nearest neighbor methods and vector models – NYC ML meetup

Start with the point set

Page 28: Approximate nearest neighbor methods and vector models – NYC ML meetup

Split it in two halves

Page 29: Approximate nearest neighbor methods and vector models – NYC ML meetup

Split again

Page 30: Approximate nearest neighbor methods and vector models – NYC ML meetup

Again…

Page 31: Approximate nearest neighbor methods and vector models – NYC ML meetup

…more iterations later

Page 32: Approximate nearest neighbor methods and vector models – NYC ML meetup

Side note: making trees small

• Split until K items in each leaf (K~100)

• Takes (n/K) memory instead of n

Page 33: Approximate nearest neighbor methods and vector models – NYC ML meetup

Binary tree

Page 34: Approximate nearest neighbor methods and vector models – NYC ML meetup

2. Searching

Page 35: Approximate nearest neighbor methods and vector models – NYC ML meetup

Nearest neighbors

Page 36: Approximate nearest neighbor methods and vector models – NYC ML meetup

Searching the tree

Page 37: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 38: Approximate nearest neighbor methods and vector models – NYC ML meetup

Problemo

• The point that’s the closest isn’t necessarily in the same leaf of the binary tree

• Two points that are really close may end up on different sides of a split

• Solution: go to both sides of a split if it’s close

Page 39: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 40: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 41: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 42: Approximate nearest neighbor methods and vector models – NYC ML meetup

Trick 1: Priority queue

• Traverse the tree using a priority queue

• sort by min(margin) for the path from the root

Page 43: Approximate nearest neighbor methods and vector models – NYC ML meetup

Trick 2: many trees

• Construct trees randomly many times

• Use the same priority queue to search all of them at the same time

Page 44: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 45: Approximate nearest neighbor methods and vector models – NYC ML meetup

heap + forest = best

• Since we use a priority queue, we will dive down the best splits with the biggest distance

• More trees always helps!

• Only constraint is more trees require more RAM

Page 46: Approximate nearest neighbor methods and vector models – NYC ML meetup

Annoy query structure

1. Use priority queue to search all trees until we’ve found k items

2. Take union and remove duplicates (a lot)

3. Compute distance for remaining items

4. Return the nearest n items

Page 47: Approximate nearest neighbor methods and vector models – NYC ML meetup

Find candidates

Page 48: Approximate nearest neighbor methods and vector models – NYC ML meetup

Take union of all leaves

Page 49: Approximate nearest neighbor methods and vector models – NYC ML meetup

Compute distances

Page 50: Approximate nearest neighbor methods and vector models – NYC ML meetup

Return nearest neighbors

Page 51: Approximate nearest neighbor methods and vector models – NYC ML meetup

“Curse of dimensionality”

Page 52: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 53: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 54: Approximate nearest neighbor methods and vector models – NYC ML meetup

Are we screwed?

• Would be nice if the data is has a much smaller “intrinsic dimension”!

Page 55: Approximate nearest neighbor methods and vector models – NYC ML meetup
Page 56: Approximate nearest neighbor methods and vector models – NYC ML meetup

Improving the algorithm

Que

ries/

s

1-NN accuracy

more accurate

faster

Page 57: Approximate nearest neighbor methods and vector models – NYC ML meetup

• https://github.com/erikbern/ann-benchmarks

ann-benchmarks

Page 58: Approximate nearest neighbor methods and vector models – NYC ML meetup

perf/accuracy tradeoffs

Que

ries/

s

1-NN accuracy

search more nodes

more trees

Page 59: Approximate nearest neighbor methods and vector models – NYC ML meetup

Things that work

• Smarter plane splitting

• Priority queue heuristics

• Search more nodes than number of results

• Align nodes closer together

Page 60: Approximate nearest neighbor methods and vector models – NYC ML meetup

Things that don’t work

• Use lower-precision arithmetic

• Priority queue by other heuristics (number of trees)

• Precompute vector norms

Page 61: Approximate nearest neighbor methods and vector models – NYC ML meetup

Things for the future

• Use a optimization scheme for tree building

• Add more distance functions (eg. edit distance)

• Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2

Page 62: Approximate nearest neighbor methods and vector models – NYC ML meetup

Thanks!• https://github.com/spotify/annoy

• https://github.com/erikbern/ann-benchmarks

• https://github.com/erikbern/ann-presentation

• erikbern.com

• @fulhack