Approximate nearest neighbor methods and vector models – NYC ML meetup

Approximate nearest neighbors & vector

models

I’m Erik

• @fulhack

• Author of Annoy, Luigi

• Currently CTO of Better

• Previously 5 years at Spotify

What’s nearest neighbor(s)

• Let’s say you have a bunch of points

Grab a bunch of points

5 nearest neighbors

20 nearest neighbors

100 nearest neighbors

…But what’s the point?

• vector models are everywhere

• lots of applications (language processing, recommender systems, computer vision)

MNIST example• 28x28 = 784-dimensional dataset

• Define distance in terms of pixels:

MNIST neighbors

…Much better approach

1. Start with high dimensional data

2. Run dimensionality reduction to 10-1000 dims

3. Do stuff in a small dimensional space

Deep learning for food• Deep model trained on a GPU on 6M random pics

downloaded from Yelp15

3x3 convolutions

2x2 maxpoolfully

connected with dropout

bottleneck layer

Distance in smaller space1. Run image through the network

2. Use the 128-dimensional bottleneck layer as an item vector

3. Use cosine distance in the reduced space

Nearest food pics

Vector methods for text

• TF-IDF (old) – no dimensionality reduction

• Latent Semantic Analysis (1988)

• Probabilistic Latent Semantic Analysis (2000)

• Semantic Hashing (2007)

• word2vec (2013), RNN, LSTM, …

Represent documents and/or words as f-dimensional vector

Latent factor 2

banana

Vector methods for collaborative filtering

• Supervised methods: See everything from the Netflix Prize

• Unsupervised: Use NLP methods

CF vectors – examplesIPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

Geospatial indexing• Ping the world: https://github.com/erikbern/ping

• k-NN regression using Annoy

Nearest neighbors the brute force way

• we can always do an exhaustive search to find the nearest neighbors

• imagine MySQL doing a linear scan for every query…

Using word2vec’s brute force search

$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys0m16.362s

Introducing Annoy

• https://github.com/spotify/annoy

• mmap-based ANN library

• Written in C++, with Python and R bindings

• 585 stars on Github

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys0m0.162s

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys0m0.614s

(performance)

1. Building an Annoy index

Start with the point set

Split it in two halves

Split again

Again…

…more iterations later

Side note: making trees small

• Split until K items in each leaf (K~100)

• Takes (n/K) memory instead of n

Binary tree

2. Searching

Nearest neighbors

Searching the tree

Problemo

• The point that’s the closest isn’t necessarily in the same leaf of the binary tree

• Two points that are really close may end up on different sides of a split

• Solution: go to both sides of a split if it’s close

Trick 1: Priority queue

• Traverse the tree using a priority queue

• sort by min(margin) for the path from the root

Trick 2: many trees

• Construct trees randomly many times

• Use the same priority queue to search all of them at the same time

heap + forest = best

• Since we use a priority queue, we will dive down the best splits with the biggest distance

• More trees always helps!

• Only constraint is more trees require more RAM

Annoy query structure

1. Use priority queue to search all trees until we’ve found k items

2. Take union and remove duplicates (a lot)

3. Compute distance for remaining items

4. Return the nearest n items

Find candidates

Take union of all leaves

Compute distances

Return nearest neighbors

“Curse of dimensionality”

Are we screwed?

• Would be nice if the data is has a much smaller “intrinsic dimension”!

Improving the algorithm

1-NN accuracy

more accurate

faster

• https://github.com/erikbern/ann-benchmarks

ann-benchmarks

perf/accuracy tradeoffs

1-NN accuracy

search more nodes

more trees

Things that work

• Smarter plane splitting

• Priority queue heuristics

• Search more nodes than number of results

• Align nodes closer together

Things that don’t work

• Use lower-precision arithmetic

• Priority queue by other heuristics (number of trees)

• Precompute vector norms

Things for the future

• Use a optimization scheme for tree building

• Add more distance functions (eg. edit distance)

• Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2

Thanks!• https://github.com/spotify/annoy

• https://github.com/erikbern/ann-benchmarks

• https://github.com/erikbern/ann-presentation

• erikbern.com

• @fulhack

Approximate nearest neighbor methods and vector models – NYC ML meetup

Engineering

Adaptation of k-Nearest Neighbor Queries for Inter ...... · Nearest Neighbor Algorithm Nearest neighbor is an algorithm to ﬁnding the given object of interest that is closest to

The Nearest-Neighbor Classifier

Nearest neighbor classification

Nearest Neighbor based Greedy Coordinate Descentpapers.nips.cc/paper/4425-nearest-neighbor-based-greedy... · 2014-04-23 · Nearest Neighbor based Greedy Coordinate Descent Inderjit

Nearest Neighbor Customer Insight

Nearest Neighbor Analysis a4

Multi-Type Nearest and Reverse Nearest Neighbor Search

Fuzzy2012 4 Nearest Neighbor

K-Nearest Neighbor Classifier

K Nearest Neighbor (K NN)

Proximity Graphs for Nearest Neighbor

K nearest neighbor

Nearest-Neighbor Classiﬁcation and Search

Won't You be my Nearest Neighbor

Exploiting k-Nearest Neighbor Information with Many Data...Classification with Nearest Neighbors • Use majority voting (k-nearest neighbor classification) • k = 9 (five / four

Multidimensional Outlier Detection - Meetup · 2016-09-14 · o Geometric Minimum Spanning Tree (MST) o Nearest neighbor (NN) o K-Nearest neighbor (KNN) o Local Outlier Factor (LOF)

Supervised learning Nearest neighbor methods

K Nearest Neighbor Presentation

Nearest neighbor matching

Optimized Nearest Neighbor Methods