KNN & Naïve Bayes Hongning Wang CS@UVa. Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based

kNN & Naïve Bayes

Hongning WangCS@UVa

CS 6501: Text Mining 2

Today’s lecture

• Instance-based classifiers– k nearest neighbors– Non-parametric learning algorithm

• Model-based classifiers– Naïve Bayes classifier• A generative model

– Parametric learning algorithm

CS@UVa


How to classify this document?

CS@UVa

?

Sports

Politics

Finance

Documents by vector space representation


Let’s check the nearest neighbor

CS@UVa

?

Sports

Politics

Finance

Are you confident about this?


Let’s check more nearest neighbors

• Ask k nearest neighbors – Let them vote

CS@UVa

?

Sports

Politics

Finance


Probabilistic interpretation of kNN

• Approximate Bayes decision rule in a subset of data around the testing point

• Let be the volume of the dimensional ball around containing the nearest neighbors for , we have

CS@UVa

𝑝 (𝑥∨𝑦=1 )=𝑘1

𝑁1𝑉𝑝 (𝑦=1 )=

𝑁1

𝑁

With Bayes rule:𝑝 (𝑦=1∨𝑥 )=

𝑁 1

𝑁×

𝑘1𝑁1𝑉𝑘𝑁𝑉

=𝑘1𝑘

𝑝 (𝑥 )𝑉=𝑘𝑁

𝑝 (𝑥 )= 𝑘𝑁𝑉=>

Total number of instancesTotal number of instances in class 1

Nearest neighbors from class 1

Counting the nearest neighbors from class 1


kNN is close to optimal

• Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice of the Bayes error rate

• Decision boundary– 1NN - Voronoi tessellation

CS@UVa

A non-parametric estimation of posterior distribution


Components in kNN

• A distance metric– Euclidean distance/cosine similarity

• How many nearby neighbors to look at– k

• Instance look up– Efficiently search nearby points

CS@UVa


Effect of k

• Choice of k influences the “smoothness” of the resulting classifier

CS@UVa


Effect of k


CS@UVa

k=1


Effect of k


CS@UVa

k=5


Effect of k

• Large k -> smooth shape for decision boundary

• Small k -> complicated decision boundary

CS@UVa

Error on training set

Error on testing set

Model complexity

Error

Smaller kLarger k


Efficient instance look-up

• Recall MP1– In Yelp_small data set, there are 629K reviews for

training and 174K reviews for testing– Assume we have a vocabulary of 15k– Complexity of kNN

CS@UVa

Training corpus size Testing corpus size

Feature size



• Exact solutions– Build inverted index for documents• Special mapping: word -> document list• Speed-up is limited when average document length is

large

CS@UVa

information

retrieval

retrieved

is

helpful

Doc1 Doc2

Doc1

Doc2

Doc1 Doc2

Doc1 Doc2

Dictionary Postings



• Exact solutions– Build inverted index for documents• Special mapping: word -> document list• Speed-up is limited when average document length is

large

– Parallelize the computation• Map-Reduce

– Map training/testing data onto different reducers– Merge the nearest k neighbors from the reducers

CS@UVa



• Approximate solutions– Locality sensitive hashing• Similar documents -> (likely) same hash values

CS@UVa

h(x)



• Approximate solution– Locality sensitive hashing• Similar documents -> (likely) same hash values• Construct the hash function such that similar items

map to the same “buckets” with high probability– Learning-based: learn the hash function with annotated

examples, e.g., must-link, cannot-link– Random projection

CS@UVa


1 1 01 0 1

Random projection

• Approximate the cosine similarity between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the

hash value

CS@UVa

𝐷𝑥

𝐷𝑦

𝜃

𝒓𝟏𝐷𝑥

𝐷𝑦

𝜃𝒓𝟐𝐷𝑦

𝐷𝑥

𝜃

𝒓𝟑

𝐷𝑥

𝐷𝑦

𝒓𝟏 𝒓𝟐 𝒓𝟑


Random projection

• Approximate the cosine similarity between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the

hash value

CS@UVa

𝐷𝑥

𝐷𝑦

𝜃

𝒓𝟏

𝒓𝟐

𝒓𝟑

𝐷𝑥

𝐷𝑦

1 0 11 0 1

𝒓𝟏 𝒓𝟐 𝒓𝟑

𝐷𝑥

𝐷𝑦

𝜃

𝐷𝑥

𝐷𝑦

𝜃


Random projection

• Approximate the cosine distance between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the

hash value– Provable approximation error

CS@UVa



• Effectiveness of random projection– 1.2M images + 1000 dimensions

CS@UVa

1000x speed-up


Weight the nearby instances

• When the data distribution is highly skewed, frequent classes might dominate majority vote– They occur more often in the k nearest neighbors

just because they have large volume

CS@UVa

?Sports

Politics

Finance


Weight the nearby instances

• When the data distribution is highly skewed, frequent classes might dominate majority vote– They occur more often in the k nearest neighbors

just because they have large volume• Solution– Weight the neighbors in voting• or

CS@UVa


Summary of kNN

• Instance-based learning– No training phase– Assign label to a testing case by its nearest neighbors– Non-parametric– Approximate Bayes decision boundary in a local

region• Efficient computation– Locality sensitive hashing

• Random projection

CS@UVa


Recall optimal Bayes decision boundary

𝑓 ( 𝑋 )=𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 (𝑦∨𝑋 )CS@UVa

𝑋

𝑝 (𝑋 , 𝑦 )

𝑝 (𝑋|𝑦=1 )𝑝 (𝑦=1)𝑝 (𝑋|𝑦=0 )𝑝 (𝑦=0)

�̂�=0 �̂�=1

False positiveFalse negative

*Optimal Bayes decision boundary


Estimating the optimal classifier

CS@UVa

¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 ( 𝑋|𝑦 )𝑃 (𝑦)

Class conditional density Class prior

text information identify mining mined is useful to from apple delicious Y

D1 1 1 1 1 0 1 1 1 0 0 0 1

D2 1 1 0 0 1 1 1 0 1 0 0 1

D3 0 0 0 0 0 1 0 0 0 1 1 0

V binary features

#parameters: |𝑌|−1|𝑌|×(2𝑉−1)

Requirement:

|D |>>|𝒀|×(𝟐𝑽−𝟏)


We need to simplify this

• Features are conditionally independent given class label

– E.g.,

CS@UVa

¿𝑝 (𝑥2|𝑦 )𝑝 (𝑥1∨𝑦 )

This does not mean ‘white house’ is independent of ‘obama’!


Conditional v.s. marginal independence

• Features are not necessarily marginally independent from each other

• However, once we know the class label, features become independent from each other– Knowing it is already political news, observing

‘obama’ contributes little about occurrence of ‘while house’

CS@UVa


Naïve Bayes classifier

CS@UVa


¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 ∏𝑖=1

¿𝑑∨¿𝑃 (𝑥 𝑖∨𝑦)𝑃 ( 𝑦 )¿

¿

Class conditional density Class prior

#parameters: |𝑌|−1|𝑌|×𝑉

|𝑌|×(2𝑉−1)

v.s.Computationally feasible


Naïve Bayes classifier

CS@UVa


¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 ∏𝑖=1

¿𝑑∨¿𝑃 (𝑥 𝑖∨𝑦)𝑃 ( 𝑦 )¿

¿

y

x2 x3 xvx1…

By Bayes rule

By independence assumption


Estimating parameters

• Maximial likelihood estimator

CS@UVa

text information identify mining mined is useful to from apple delicious Y

D1 1 1 1 1 0 1 1 1 0 0 0 1D2 1 1 0 0 1 1 1 0 1 0 0 1D3 0 0 0 0 0 1 0 0 0 1 1 0


Enhancing Naïve Bayes for text classification I

• The frequency of words in a document matters

– In log space

CS@UVa

¿𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 log 𝑃 (𝑦 )+ ∑𝑖=1

¿𝑑∨¿𝑐 (𝑥 𝑖 ,𝑑)log 𝑃(𝑥 𝑖∨𝑦 )¿

¿

Class bias Model parameterFeature vector

Essentially, estimating different language models!


Enhancing Naïve Bayes for text classification

• For binary case

CS@UVa

¿ 𝑠𝑔𝑛(log 𝑃 (𝑦=1 )𝑃 ( 𝑦=0 )

+∑𝑖=1

|𝑑|

𝑐 (𝑥𝑖 ,𝑑) log𝑃 (𝑥𝑖|𝑦=1 )𝑃 (𝑥𝑖|𝑦=0 ) )

¿ 𝑠𝑔𝑛(𝑤𝑇 𝑥)where

𝑤=( log 𝑃 (𝑦=1 )𝑃 (𝑦=0 )

, log𝑃 (𝑥1|𝑦=1 )𝑃 (𝑥1|𝑦=0 )

,…, log𝑃 (𝑥𝑣|𝑦=1 )𝑃 (𝑥𝑣|𝑦=0 ) )

𝑥=(1 ,𝑐 (𝑥1 ,𝑑) ,…,𝑐 (𝑥𝑣 ,𝑑))

a linear model with vector space representation?

We will come back to this topic later.


Enhancing Naïve Bayes for text classification II

• Usually, features are not conditionally independent

• Enhance the conditional independence assumptions by N-gram language models

CS@UVa


Enhancing Naïve Bayes for text classification III

• Sparse observation

– Then, no matter what values the other features take,

• Smoothing class conditional density– All smoothing techniques we have discussed in

language models are applicable here

CS@UVa


Maximum a Posterior estimator

• Adding pseudo instances– Priors: and – MAP estimator for Naïve Bayes

CS@UVa

#pseudo instances

Can be estimated from a related corpus or manually tuned


Summary of Naïve Bayes

• Optimal Bayes classifier– Naïve Bayes with independence assumptions

• Parameter estimation in Naïve Bayes– Maximum likelihood estimator– Smoothing is necessary

CS@UVa


Today’s reading

• Introduction to Information Retrieval– Chapter 13: Text classification and Naive Bayes• 13.2 – Naive Bayes text classification• 13.4 – Properties of Naive Bayes

– Chapter 14: Vector space classification• 14.3 k nearest neighbor• 14.4 Linear versus nonlinear classifiers

CS@UVa