31
Text Classifier Induction: Naive Bayes Classifiers ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ [email protected], [email protected] 2017 1

Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Text Classifier Induction: Naive Bayes Classifiers

ML for NLPLecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado

https://www.cs.tcd.ie/kevin.koidl/cs4062/

[email protected], [email protected]

2017

1

Page 2: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Defining a CSV function

I Inductive construction of a text categorization module consistsof defining a Categorization Status Value (CSV) function

I CSV for Ranking and Hard classifiers:I Ranking classifiers: for each category ci ∈ C, define a function

CSVi with the following signature:

CSVi : D → [0, 1] (1)

I Hard classifiers: one can either define CSVi as above anddefine a threshold τi above which a document is said to belongto ci , or constrain CSVi to range over {T ,F} directly.

2

Page 3: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Category membership thresholds

I Hard classifier status value, CSV hi : D → {T ,F}, can then be

defined as follows:

CSV hi (d) =

{T if CSVi ≥ τi ,F otherwise.

(2)

I Thresholds can be determined analytically or experimentally.

I Analytically derived thresholds are typical of TC systems thatoutput probability estimates of membership of documents tocategories

I τi is then determined by decision-theoretic measures (e.g.utility)

3

Page 4: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Experimental thresholds

I CSV thresholding or SCut: Scut stands for optimalthresholding on the confidence scores of category candidates:

I Vary τi on Tv and choose the one that maximises effectiveness

I Proportional thresholding: choose τi s.t. that generalitymeasure gTr (ci ) is closest to gTv (ci ).

I RCut or fixed thresholding: stipulate that a fixed number ofcategories are to be assigned to each document.

I See [Yang, 2001] for a survey of thresholding strategies.

4

Page 5: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

ML methods for learning CSV functions

I Symbolic, numeric and meta-classification methods.

I Numeric methods implement classification indirectly: theclassification function f outputs a numerical score, hardclassification via thresholding

I probabilistic classifiers, regression methods, ...

I Symbolic methods usually implement hard classification directlyI e.g.: decision trees, decision rules, ...

I Meta-classification methods combine results from independentclassifiers

I e.g.: classifier ensembles, committees, ...

5

Page 6: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Probabilistic classifiers

I The CSV() of probabilistic classifiers produces an estimate ofthe conditional probability P(c |~d) = f (d , c) that an instancerepresented as ~d should be classified as c .

I Components of ~d regarded as random variables Ti

(1 ≤ i ≤ |T |)I Need to estimate probabilities for all possible representations

i.e. P(c |Ti , . . . ,Tn).

I Too costly in practice: for discrete case and m possiblenominal values that is O(mT )

I Independence assumptions help...

6

Page 7: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Conditional independence assumption

I Using Bayes’ rule we get

P(c |~dj) =

P(c)P(~dj |c)

P(~dj)(3)

I Naıve Bayes classifiers: assume Ti , . . . ,Tn are independent of eachother given the target category:

P(~d|c) =

|T |∏k=1

P(tk |c) (4)

I maximum a posteriori hypothesis: choose c that maximises (3)

I maximum likelihood hypothesis: choose c that maximises P(~dj |c)(i.e. assume all c ’s are equally likely)

7

Page 8: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Conditional independence assumption

I Using Bayes’ rule we get

P(c |~dj) =P(c)P(~dj |c)

P(~dj)(3)

I Naıve Bayes classifiers: assume Ti , . . . ,Tn are independent of eachother given the target category:

P(~d|c) =

|T |∏k=1

P(tk |c) (4)

I maximum a posteriori hypothesis: choose c that maximises (3)

I maximum likelihood hypothesis: choose c that maximises P(~dj |c)(i.e. assume all c ’s are equally likely)

7

Page 9: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Conditional independence assumption

I Using Bayes’ rule we get

P(c |~dj) =P(c)P(~dj |c)

P(~dj)(3)

I Naıve Bayes classifiers: assume Ti , . . . ,Tn are independent of eachother given the target category:

P(~d|c) =

|T |∏k=1

P(tk |c) (4)

I maximum a posteriori hypothesis: choose c that maximises (3)

I maximum likelihood hypothesis: choose c that maximises P(~dj |c)(i.e. assume all c ’s are equally likely)

7

Page 10: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Conditional independence assumption

I Using Bayes’ rule we get

P(c |~dj) =P(c)P(~dj |c)

P(~dj)(3)

I Naıve Bayes classifiers: assume Ti , . . . ,Tn are independent of eachother given the target category:

P(~d|c) =

|T |∏k=1

P(tk |c) (4)

I maximum a posteriori hypothesis: choose c that maximises (3)

I maximum likelihood hypothesis: choose c that maximises P(~dj |c)(i.e. assume all c ’s are equally likely)

7

Page 11: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Variants of Naive Bayes classifiers

I multi-variate Bernoulli models, in which features are modelledas Boolean random variables, and

I multinomial models where the variables represent count data[McCallum and Nigam, 1998]

I Continuous models which use numeric data representation:attributes represented by continuous probability distributions

I using Gaussian distributions, the conditionals can be estimatedas

P(Ti = t|c) =1

σ√

2πe−

(t−µ)2

2σ2 (5)

I Non-parametric kernel density estimation has also beenproposed [John and Langley, 1995]

8

Page 12: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Some Uses of NB in NLP

I Information retrieval [Robertson and Jones, 1988]

I Text categorisation (see [Sebastiani, 2002] for a survey)

I Spam filters

I Word sense disambiguation [Gale et al., 1992]

9

Page 13: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

CSV for multi-variate Bernoulli models

I Starting from the independence assumption

P(~d|c) =

|T |∏k=1

P(tk |c)

I and Bayes’ rule

P(c |~dj) =P(c)P(~dj |c)

P(~dj)

I derive a monotonically increasing function of P(c |−→d ):

f (d , c) =

|T |∑i=1

ti logP(ti |c)[1− P(ti |c)]

P(ti |c)[1− P(ti |c)](6)

I Need to estimate 2|T |, rather than 2|T | parameters.

10

Page 14: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Estimating the parameters

I For each term ti ∈ T , make:I nc ← the number of ~d s.t. f (~d, c) = 1I ni ← the number of ~d for which ti = 1 and f (~d, c) = 1

P(ti |c)← ni + 1

nc + 2(7)

I (sums in numerator and denominator for smoothing; see nextslides)

I nc ← the number of ~d s.t. f (~d, c) = 0I ni ← the number of ~d for which ti = 1 and f (~d, c) = 0

P(ti |c)← ni + 1

nc + 2(8)

11

Page 15: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

An Alternative: multinomial models

I An alternative implementation of the Naıve Bayes Classifier isdescribed in [Mitchell, 1997].

I In this approach, words appear as values rather than names ofattributes

I A document representation for this slide would look like this:

~d = 〈a1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . 〉I Problem: each attribute’s value would range over the entire

vocabulary. Many values would be missing for a typical document.

12

Page 16: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Dealing with missing values

I what if none of the training instances with target category cjhave attribute value ai?

P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?

I Smoothing: make Bayesian estimate for P(ai |cj)

P(ai |cj)←nc + mp

n + m

where:I n is number of training examples for which C = cj ,I nc number of examples for which C = cj and A = aiI p is prior estimate for P(ai |cj)I m is weight given to prior (i.e. number of “virtual” examples)

13

Page 17: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Dealing with missing values

I what if none of the training instances with target category cjhave attribute value ai?

P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?

I Smoothing: make Bayesian estimate for P(ai |cj)

P(ai |cj)←nc + mp

n + m

where:I n is number of training examples for which C = cj ,I nc number of examples for which C = cj and A = aiI p is prior estimate for P(ai |cj)I m is weight given to prior (i.e. number of “virtual” examples)

13

Page 18: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Dealing with missing values

I what if none of the training instances with target category cjhave attribute value ai?

P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?

I Smoothing: make Bayesian estimate for P(ai |cj)

P(ai |cj)←nc + mp

n + m

where:I n is number of training examples for which C = cj ,I nc number of examples for which C = cj and A = aiI p is prior estimate for P(ai |cj)I m is weight given to prior (i.e. number of “virtual” examples)

13

Page 19: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Learning in multinomial models

1 NB Learn (Tr , C )2 /∗ c o l l e c t a l l t okens tha t occu r i n Tr ∗/3 T ← a l l d i s t i n c t words and o th e r tokens i n Tr4 /∗ c a l c u l a t e P(cj) and P(tk |cj) ∗/5 f o r each t a r g e t v a l u e cj i n C do6 Tr j ← s ub s e t o f Tr f o r which t a r g e t v a l u e i s cj

7 P(cj) ← |Tr j ||Tr|

8 Textj ← c on c a t ena t i o n o f a l l t e x t s i n Tr j

9 n ← t o t a l number o f tokens i n Textj10 f o r each word tk i n T do11 nk ← number o f t imes word tk oc cu r s i n Textj12 P(tk |cj)← nk+1

n+|T |13 done14 done

Note an additional assumption: position is irrelevant, i.e.:

P(ai = tk |cj) = P(am = tk |cj) ∀i ,m

14

Page 20: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Sample Classification Algorithm

I Could calculate posterior probabilities for soft classification

f (d) = P(c)n∏

k=1

P(tk |c)

(where n is then number of tokens in d that occur in T ) and usethresholding as before

I Or, for SLTC, implement hard categorisation directly:

1 positions ← a l l word p o s i t i o n s i n d2 t ha t c on t a i n tokens found i n T3 Return cnb , where4 cnb = argmaxci∈C P(ci )

∏k∈positions P(tk |ci )

15

Page 21: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Classification Performance[Mitchell, 1997]: Given 1000 training documents from each group,learn to classify new documents according to which newsgroup itcame from

comp.graphics misc.forsalecomp.os.ms-windows.misc rec.autoscomp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseballcomp.windows.x rec.sport.hockey

alt.atheism sci.spacesoc.religion.christian sci.crypt

talk.religion.misc sci.electronicstalk.politics.mideast sci.med

talk.politics.misctalk.politics.guns

Naive Bayes: 89% classification accuracy.16

Page 22: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Learning performance

I Learning Curve for 20 Newsgroups:

NB: TFIDF and PRTFIDF are non-Bayesian probabilisticmethods we will see later in the course. See [Joachims, 1996]for details.

17

Page 23: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

NB and continuous variables

I Another model: suppose we want our document vectors torepresent, say, the TF-IDF scores of each term in thedocument:

~d = 〈a1 = tfidf (t1), . . . , an = tfidf (tn)〉 (9)

I How would we estimate P(c |~d)?

I A: assuming an underlying (e.g. normal) distribution:

P(c |~d) ∝n∏

i=n

P(ai |c)

=1

σc√

2πe− (x−µc )2

2σ2c (10)

µb and σ2b are mean and variance of the values taken by the

attributes for positive instances.

18

Page 24: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

NB and continuous variables

I Another model: suppose we want our document vectors torepresent, say, the TF-IDF scores of each term in thedocument:

~d = 〈a1 = tfidf (t1), . . . , an = tfidf (tn)〉 (9)

I How would we estimate P(c |~d)?

I A: assuming an underlying (e.g. normal) distribution:

P(c |~d) ∝n∏

i=n

P(ai |c)

=1

σc√

2πe− (x−µc )2

2σ2c (10)

µb and σ2b are mean and variance of the values taken by the

attributes for positive instances.

18

Page 25: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Combining variables

I NB also allows you yo combine different types of variables.

I The result would be a Bayesian Network with continuous anddiscrete nodes. For instance:

...

C

a1 a2 ak an

I See [Luz, 2012, Luz and Su, 2010] for examples of the use ofsuch combined models in a different categorisation task.

19

Page 26: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Naive but subtle

I Conditional independence assumption is clearly false

P(a1, a2 . . . an|vj) =∏i

P(ai |vj)

I ...but NB works well anyway. Why?

I posteriors P(vj |x) don’t need to be correct; We need onlythat:

arg maxvj∈V

P(vj)∏i

P(ai |vj) = arg maxvj∈V

P(vj)P(a1 . . . , an|vj)

In othe words, error in NB classification is a zero-one loss function, oftencorrect even if posteriors are unrealistically close to 1 or 0[Domingos and Pazzani, 1996].Performance can be optimal if dependencies are evenly distributed overclasses, or if they cancel each other out [Zhang, 2004].

20

Page 27: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Naive but subtle

I Conditional independence assumption is clearly false

P(a1, a2 . . . an|vj) =∏i

P(ai |vj)

I ...but NB works well anyway. Why?

I posteriors P(vj |x) don’t need to be correct; We need onlythat:

arg maxvj∈V

P(vj)∏i

P(ai |vj) = arg maxvj∈V

P(vj)P(a1 . . . , an|vj)

In othe words, error in NB classification is a zero-one loss function, oftencorrect even if posteriors are unrealistically close to 1 or 0[Domingos and Pazzani, 1996].Performance can be optimal if dependencies are evenly distributed overclasses, or if they cancel each other out [Zhang, 2004].

20

Page 28: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Other Probabilistic Classifiers

I Alternative approaches to probabilistic classifiers attempt toimprove effectiveness by:

I adopting weighted document vectors, rather thanbinary-valued ones

I introducing document length normalisation, in order to correctdistortions in CSVi introduced by long documents

I relaxing the independence assumption (the least adoptedvariant, since it appears that the binary independenceassumption seldom affects effectiveness)

I But see, for instance Hidden Naive Bayes[Zhang et al., 2005]...

21

Page 29: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

ReferencesI

Domingos, P. and Pazzani, M. J. (1996).

Beyond independence: Conditions for the optimality of the simple bayesian classifier.In International Conference on Machine Learning, pages 105–112.

Gale, W., Church, K., and Yarowsky, D. (1992).

A method for disambiguating word senses in a large corpus.Computers and the Humanities, 26:415–439.

Joachims, T. (1996).

A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization.Technical Report CMU-CS-96-118, CMU.

John, G. H. and Langley, P. (1995).

Estimating continuous distributions in Bayesian classifiers.In Besnard, Philippe and Hanks, S., editors, Proceedings of the 11th Conference on Uncertainty in ArtificialIntelligence (UAI’95), pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers.

Luz, S. (2012).

The non-verbal structure of patient case discussions in multidisciplinary medical team meetings.ACM Transactions on Information Systems, 30(3):17:1–17:24.

Luz, S. and Su, J. (2010).

Assessing the effectiveness of conversational features for dialogue segmentation in medical team meetingsand in the AMI corpus.In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. Association for ComputationalLinguistics.

22

Page 30: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

ReferencesII

McCallum, A. and Nigam, K. (1998).

A comparison of event models for naive Bayes text classification.In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press.

Mitchell, T. M. (1997).

Machine Learning.McGraw-Hill.

Robertson, S. E. and Jones, K. S. (1988).

Relevance weighting of search terms.In Document retrieval systems, pages 143–160. Taylor Graham Publishing, London.

Sebastiani, F. (2002).

Machine learning in automated text categorization.ACM Computing Surveys, 34(1):1–47.

Yang, Y. (2001).

A study on thresholding strategies for text categorization.In Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedings of the 24th AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01),pages 137–145, New York. ACM Press.

Zhang, H. (2004).

The optimality of Naive Bayes.In Proceedings of the 7th International Florida Artificial Intelligence Research Society Conference. AAAIPress.

23

Page 31: Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

ReferencesIII

Zhang, H., Jiang, L., and Su, J. (2005).

Hidden naive bayes.In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 919. Menlo Park, CA;Cambridge, MA; London; AAAI Press; MIT Press; 1999.

24