Upload
leonard-powers
View
219
Download
3
Embed Size (px)
Citation preview
kNN & Naïve Bayes
Hongning WangCS@UVa
CS 6501: Text Mining 2
Today’s lecture
• Instance-based classifiers– k nearest neighbors– Non-parametric learning algorithm
• Model-based classifiers– Naïve Bayes classifier• A generative model
– Parametric learning algorithm
CS@UVa
CS 6501: Text Mining 3
How to classify this document?
CS@UVa
?
Sports
Politics
Finance
Documents by vector space representation
CS 6501: Text Mining 4
Let’s check the nearest neighbor
CS@UVa
?
Sports
Politics
Finance
Are you confident about this?
CS 6501: Text Mining 5
Let’s check more nearest neighbors
• Ask k nearest neighbors – Let them vote
CS@UVa
?
Sports
Politics
Finance
CS 6501: Text Mining 6
Probabilistic interpretation of kNN
• Approximate Bayes decision rule in a subset of data around the testing point
• Let be the volume of the dimensional ball around containing the nearest neighbors for , we have
CS@UVa
𝑝 (𝑥∨𝑦=1 )=𝑘1
𝑁1𝑉𝑝 (𝑦=1 )=
𝑁1
𝑁
With Bayes rule:𝑝 (𝑦=1∨𝑥 )=
𝑁 1
𝑁×
𝑘1𝑁1𝑉𝑘𝑁𝑉
=𝑘1𝑘
𝑝 (𝑥 )𝑉=𝑘𝑁
𝑝 (𝑥 )= 𝑘𝑁𝑉=>
Total number of instancesTotal number of instances in class 1
Nearest neighbors from class 1
Counting the nearest neighbors from class 1
CS 6501: Text Mining 7
kNN is close to optimal
• Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice of the Bayes error rate
• Decision boundary– 1NN - Voronoi tessellation
CS@UVa
A non-parametric estimation of posterior distribution
CS 6501: Text Mining 8
Components in kNN
• A distance metric– Euclidean distance/cosine similarity
• How many nearby neighbors to look at– k
• Instance look up– Efficiently search nearby points
CS@UVa
CS 6501: Text Mining 9
Effect of k
• Choice of k influences the “smoothness” of the resulting classifier
CS@UVa
CS 6501: Text Mining 10
Effect of k
• Choice of k influences the “smoothness” of the resulting classifier
CS@UVa
k=1
CS 6501: Text Mining 11
Effect of k
• Choice of k influences the “smoothness” of the resulting classifier
CS@UVa
k=5
CS 6501: Text Mining 12
Effect of k
• Large k -> smooth shape for decision boundary
• Small k -> complicated decision boundary
CS@UVa
Error on training set
Error on testing set
Model complexity
Error
Smaller kLarger k
CS 6501: Text Mining 13
Efficient instance look-up
• Recall MP1– In Yelp_small data set, there are 629K reviews for
training and 174K reviews for testing– Assume we have a vocabulary of 15k– Complexity of kNN
CS@UVa
Training corpus size Testing corpus size
Feature size
CS 6501: Text Mining 14
Efficient instance look-up
• Exact solutions– Build inverted index for documents• Special mapping: word -> document list• Speed-up is limited when average document length is
large
CS@UVa
information
retrieval
retrieved
is
helpful
Doc1 Doc2
Doc1
Doc2
Doc1 Doc2
Doc1 Doc2
Dictionary Postings
CS 6501: Text Mining 15
Efficient instance look-up
• Exact solutions– Build inverted index for documents• Special mapping: word -> document list• Speed-up is limited when average document length is
large
– Parallelize the computation• Map-Reduce
– Map training/testing data onto different reducers– Merge the nearest k neighbors from the reducers
CS@UVa
CS 6501: Text Mining 16
Efficient instance look-up
• Approximate solutions– Locality sensitive hashing• Similar documents -> (likely) same hash values
CS@UVa
h(x)
CS 6501: Text Mining 17
Efficient instance look-up
• Approximate solution– Locality sensitive hashing• Similar documents -> (likely) same hash values• Construct the hash function such that similar items
map to the same “buckets” with high probability– Learning-based: learn the hash function with annotated
examples, e.g., must-link, cannot-link– Random projection
CS@UVa
CS 6501: Text Mining 18
1 1 01 0 1
Random projection
• Approximate the cosine similarity between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the
hash value
CS@UVa
𝐷𝑥
𝐷𝑦
𝜃
𝒓𝟏𝐷𝑥
𝐷𝑦
𝜃𝒓𝟐𝐷𝑦
𝐷𝑥
𝜃
𝒓𝟑
𝐷𝑥
𝐷𝑦
𝒓𝟏 𝒓𝟐 𝒓𝟑
CS 6501: Text Mining 19
Random projection
• Approximate the cosine similarity between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the
hash value
CS@UVa
𝐷𝑥
𝐷𝑦
𝜃
𝒓𝟏
𝒓𝟐
𝒓𝟑
𝐷𝑥
𝐷𝑦
1 0 11 0 1
𝒓𝟏 𝒓𝟐 𝒓𝟑
𝐷𝑥
𝐷𝑦
𝜃
𝐷𝑥
𝐷𝑦
𝜃
CS 6501: Text Mining 20
Random projection
• Approximate the cosine distance between vectors– , is a random unit vector– Each defines one hash function, i.e., one bit in the
hash value– Provable approximation error
CS@UVa
CS 6501: Text Mining 21
Efficient instance look-up
• Effectiveness of random projection– 1.2M images + 1000 dimensions
CS@UVa
1000x speed-up
CS 6501: Text Mining 22
Weight the nearby instances
• When the data distribution is highly skewed, frequent classes might dominate majority vote– They occur more often in the k nearest neighbors
just because they have large volume
CS@UVa
?Sports
Politics
Finance
CS 6501: Text Mining 23
Weight the nearby instances
• When the data distribution is highly skewed, frequent classes might dominate majority vote– They occur more often in the k nearest neighbors
just because they have large volume• Solution– Weight the neighbors in voting• or
CS@UVa
CS 6501: Text Mining 24
Summary of kNN
• Instance-based learning– No training phase– Assign label to a testing case by its nearest neighbors– Non-parametric– Approximate Bayes decision boundary in a local
region• Efficient computation– Locality sensitive hashing
• Random projection
CS@UVa
CS 6501: Text Mining 25
Recall optimal Bayes decision boundary
𝑓 ( 𝑋 )=𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 (𝑦∨𝑋 )CS@UVa
𝑋
𝑝 (𝑋 , 𝑦 )
𝑝 (𝑋|𝑦=1 )𝑝 (𝑦=1)𝑝 (𝑋|𝑦=0 )𝑝 (𝑦=0)
�̂�=0 �̂�=1
False positiveFalse negative
*Optimal Bayes decision boundary
CS 6501: Text Mining 26
Estimating the optimal classifier
CS@UVa
¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 ( 𝑋|𝑦 )𝑃 (𝑦)
Class conditional density Class prior
text information identify mining mined is useful to from apple delicious Y
D1 1 1 1 1 0 1 1 1 0 0 0 1
D2 1 1 0 0 1 1 1 0 1 0 0 1
D3 0 0 0 0 0 1 0 0 0 1 1 0
V binary features
#parameters: |𝑌|−1|𝑌|×(2𝑉−1)
Requirement:
|D |>>|𝒀|×(𝟐𝑽−𝟏)
CS 6501: Text Mining 27
We need to simplify this
• Features are conditionally independent given class label
– E.g.,
CS@UVa
¿𝑝 (𝑥2|𝑦 )𝑝 (𝑥1∨𝑦 )
This does not mean ‘white house’ is independent of ‘obama’!
CS 6501: Text Mining 28
Conditional v.s. marginal independence
• Features are not necessarily marginally independent from each other
• However, once we know the class label, features become independent from each other– Knowing it is already political news, observing
‘obama’ contributes little about occurrence of ‘while house’
CS@UVa
CS 6501: Text Mining 29
Naïve Bayes classifier
CS@UVa
¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 ( 𝑋|𝑦 )𝑃 (𝑦)
¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 ∏𝑖=1
¿𝑑∨¿𝑃 (𝑥 𝑖∨𝑦)𝑃 ( 𝑦 )¿
¿
Class conditional density Class prior
#parameters: |𝑌|−1|𝑌|×𝑉
|𝑌|×(2𝑉−1)
v.s.Computationally feasible
CS 6501: Text Mining 30
Naïve Bayes classifier
CS@UVa
¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦𝑃 ( 𝑋|𝑦 )𝑃 (𝑦)
¿𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 ∏𝑖=1
¿𝑑∨¿𝑃 (𝑥 𝑖∨𝑦)𝑃 ( 𝑦 )¿
¿
y
x2 x3 xvx1…
By Bayes rule
By independence assumption
CS 6501: Text Mining 31
Estimating parameters
• Maximial likelihood estimator
CS@UVa
text information identify mining mined is useful to from apple delicious Y
D1 1 1 1 1 0 1 1 1 0 0 0 1D2 1 1 0 0 1 1 1 0 1 0 0 1D3 0 0 0 0 0 1 0 0 0 1 1 0
CS 6501: Text Mining 32
Enhancing Naïve Bayes for text classification I
• The frequency of words in a document matters
– In log space
CS@UVa
¿𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 log 𝑃 (𝑦 )+ ∑𝑖=1
¿𝑑∨¿𝑐 (𝑥 𝑖 ,𝑑)log 𝑃(𝑥 𝑖∨𝑦 )¿
¿
Class bias Model parameterFeature vector
Essentially, estimating different language models!
CS 6501: Text Mining 33
Enhancing Naïve Bayes for text classification
• For binary case
CS@UVa
¿ 𝑠𝑔𝑛(log 𝑃 (𝑦=1 )𝑃 ( 𝑦=0 )
+∑𝑖=1
|𝑑|
𝑐 (𝑥𝑖 ,𝑑) log𝑃 (𝑥𝑖|𝑦=1 )𝑃 (𝑥𝑖|𝑦=0 ) )
¿ 𝑠𝑔𝑛(𝑤𝑇 𝑥)where
𝑤=( log 𝑃 (𝑦=1 )𝑃 (𝑦=0 )
, log𝑃 (𝑥1|𝑦=1 )𝑃 (𝑥1|𝑦=0 )
,…, log𝑃 (𝑥𝑣|𝑦=1 )𝑃 (𝑥𝑣|𝑦=0 ) )
𝑥=(1 ,𝑐 (𝑥1 ,𝑑) ,…,𝑐 (𝑥𝑣 ,𝑑))
a linear model with vector space representation?
We will come back to this topic later.
CS 6501: Text Mining 34
Enhancing Naïve Bayes for text classification II
• Usually, features are not conditionally independent
• Enhance the conditional independence assumptions by N-gram language models
CS@UVa
CS 6501: Text Mining 35
Enhancing Naïve Bayes for text classification III
• Sparse observation
– Then, no matter what values the other features take,
• Smoothing class conditional density– All smoothing techniques we have discussed in
language models are applicable here
CS@UVa
CS 6501: Text Mining 36
Maximum a Posterior estimator
• Adding pseudo instances– Priors: and – MAP estimator for Naïve Bayes
CS@UVa
#pseudo instances
Can be estimated from a related corpus or manually tuned
CS 6501: Text Mining 37
Summary of Naïve Bayes
• Optimal Bayes classifier– Naïve Bayes with independence assumptions
• Parameter estimation in Naïve Bayes– Maximum likelihood estimator– Smoothing is necessary
CS@UVa
CS 6501: Text Mining 38
Today’s reading
• Introduction to Information Retrieval– Chapter 13: Text classification and Naive Bayes• 13.2 – Naive Bayes text classification• 13.4 – Properties of Naive Bayes
– Chapter 14: Vector space classification• 14.3 k nearest neighbor• 14.4 Linear versus nonlinear classifiers
CS@UVa