View
215
Download
1
Category
Tags:
Preview:
Citation preview
2
RoadmapNaïve Bayes
Multi-variate Bernoulli event model (recap)Multinomial event modelAnalysis
HW#3
3
Naïve Bayes Models in Detail
(McCallum & Nigam, 1998)
Alternate models for Naïve Bayes Text Classification
Multivariate Bernoulli event modelBinary independence model
Features treated as binary – counts ignored
Multinomial event modelUnigram language model
4
Multivariate Bernoulli Event Text Model
Each document:Result of |V| independent Bernoulli trials I.e. for each word in vocabulary,
does the word appear in the document?
From general Naïve Bayes perspectiveEach word corresponds to two variables, wt and
In each doc, either wt or appearsAlways have |V| elements in a document
7
Multinomial DistributionTrial: select a word according to its probability
Possible outcomes: {w1,w2,…,w|V|}
8
Multinomial DistributionTrial: select a word according to its probability
Possible outcomes: {w1,w2,…,w|V|}
Document is viewed as result of:One trial for each position
P(word = wi) = pi
Σipi= 1
9
Multinomial DistributionTrial: select a word according to its probability
Possible outcomes: {w1,w2,…,w|V|}
Document is viewed as result of:One trial for each position
P(word = wi) = pi
Σipi= 1
P(X1=x1,X2=x2,….,X|V|=x|V|)
10
Multinomial DistributionTrial: select a word according to its probability
Possible outcomes: {w1,w2,…,w|V|}
Document is viewed as result of:One trial for each position
P(word = wi) = pi
Σipi= 1
P(X1=x1,X2=x2,….,X|V|=x|V|)
11
Multinomial DistributionTrial: select a word according to its probability
Possible outcomes: {w1,w2,…,w|V|}
Document is viewed as result of:One trial for each position
P(word = wi) = pi
Σipi= 1
P(X1=x1,X2=x2,….,X|V|=x|V|)
13
ExampleConsider a vocabulary V with only three words:
a, b, c
Document di contains only 2 word instances
Due to F. Xia
14
ExampleConsider a vocabulary V with only three words:
a, b, c
Document di contains only 2 word instances
For each position:(P(w=a)=p1, P(w=b)=p2, P(w=c) = p3
Due to F. Xia
15
ExampleConsider a vocabulary V with only three words:
a, b, c
Document di contains only 2 word instances
For each position:(P(w=a)=p1, P(w=b)=p2, P(w=c) = p3
What is the probability that we see ‘a’ once and ‘b’ once in di?
Due to F. Xia
17
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
Due to F. Xia
18
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
How many sequences with one ‘a’ and one ‘b’?
Due to F. Xia
19
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
How many sequences with one ‘a’ and one ‘b’?n!/(x1!..x|v|!) = 2
Probability of the sequence ‘ab’ is:
Due to F. Xia
20
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
How many sequences with one ‘a’ and one ‘b’?n!/(x1!..x|v|!) = 2
Probability of the sequence ‘ab’ is: p1*p2
Probability of the sequence ‘ba’
Due to F. Xia
21
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
How many sequences with one ‘a’ and one ‘b’?n!/(x1!..x|v|!) = 2
Probability of the sequence ‘ab’ is: p1*p2
Probability of the sequence ‘ba’ : p1 * p2
So probability of seeing ‘a’ once and ‘b’ once is:
Due to F. Xia
22
Example (cont’d)
How many possible sequences? 3^2 = 9Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc
How many sequences with one ‘a’ and one ‘b’?n!/(x1!..x|v|!) = 2
Probability of the sequence ‘ab’ is: p1*p2
Probability of the sequence ‘ba’ : p1 * p2
So probability of seeing ‘a’ once and ‘b’ once is:
= 2 p1*p2
Due to F. Xia
23
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
24
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
Define Nit = # of occurrences of wt in document di
25
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
Define Nit = # of occurrences of wt in document di
Then under multinomial event model:
26
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
Define Nit = # of occurrences of wt in document di
Then under multinomial event model:
27
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
Define Nit = # of occurrences of wt in document di
Then under multinomial event model:
28
Multinomial Event ModelDocument is sequence of word events drawn
from vocabulary V.Assume document length independent of classAssume (Naïve Bayes) words independent of
context
Define Nit = # of occurrences of wt in document di
Then under multinomial event model:
33
TrainingP(cj|di)=1 if document di is of class cj, and 0 o.w.
So,
Contrast this with multivariate Bernoulli
36
Two Naïve Bayes ModelsMulti-variate Bernoulli event model:
Models binary presence/absence of word feature
37
Two Naïve Bayes ModelsMulti-variate Bernoulli event model:
Models binary presence/absence of word feature
Multinomial event model:Models counts of word features, unigram models
38
Two Naïve Bayes ModelsMulti-variate Bernoulli event model:
Models binary presence/absence of word feature
Multinomial event model:Models counts of word features, unigram models
In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998)
39
Thinking about Performance
Naïve Bayes: conditional independence assumptionClearly unrealistic, but performance is often goodWhy?
40
Thinking about Performance
Naïve Bayes: conditional independence assumptionClearly unrealistic, but performance is often goodWhy?
Classification based on sign, not magnitude Direction of classification usually right
Multivariate Bernoulli vs MultinomialWhy does multinomial perform better?
41
Thinking about Performance
Naïve Bayes: conditional independence assumptionClearly unrealistic, but performance is often goodWhy?
Classification based on sign, not magnitude Direction of classification usually right
Multivariate Bernoulli vs MultinomialWhy does multinomial perform better?
Captures additional information: presence/absence+freq
What if we wanted to include other types of features?
42
Thinking about Performance
Naïve Bayes: conditional independence assumptionClearly unrealistic, but performance is often goodWhy?
Classification based on sign, not magnitude Direction of classification usually right
Multivariate Bernoulli vs MultinomialWhy does multinomial perform better?
Captures additional information: presence/absence+freq
What if we wanted to include other types of features?Multivariate: just another Bernoulli trial
Multinomial can’t mix distributions
44
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary
Trial
P(c)
P(w|c)
Testing
45
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial
P(c)
P(w|c)
Testing
46
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial Each word in vocabulary
P(c)
P(w|c)
Testing
47
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial Each word in vocabulary Each position in document
P(c)
P(w|c)
Testing
48
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial Each word in vocabulary Each position in document
P(c)
P(w|c)
Testing
49
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial Each word in vocabulary Each position in document
P(c)
P(w|c)
Testing
50
Model ComparisonMultivariate Bernoulli Multinomial Event
Features Binary # of occurrences
Trial Each word in vocabulary Each position in document
P(c)
P(w|c)
TestingP(c) P(c)
52
Naïve Bayes: StrengthsAdvantages:
Simplicity (conceptual)Training efficiencyTesting efficiencyScales fairly well to large dataPerforms multiclass classificationCan provide n-best outputs
53
Naïve Bayes: WeaknessesDisadvantages:
Theoretical foundation weak:Ragingly inaccurate independence assumption
Decent accuracy, but outperformed by more sophisticated
55
HW#3Naïve Bayes Classification:
Experiment with the Mallet Naïve Bayes Learner
Implement Multivariate Bernoulli event model
Implement Multinomial event modelCompare with binary variables
Analyze results
56
NotesUse add-delta smoothing (vs add-one)
Beware numerical underflow log probs are your friend
Also converts exponents to multipliers
Look out for repeated computationPrecompute normalization denominators
E.g. for multinomial P(w|c), compute once for each c
Recommended