Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de

Text Classifier Induction: Naive Bayes Classifiers

ML for NLPLecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado

https://www.cs.tcd.ie/kevin.koidl/cs4062/

[email protected], [email protected]

2017

1

https://www.cs.tcd.ie/kevin.koidl/cs4062/

Defining a CSV function

I Inductive construction of a text categorization module consistsof defining a Categorization Status Value (CSV) function

I CSV for Ranking and Hard classifiers:I Ranking classifiers: for each category ci ∈ C, define a function

CSVi with the following signature:

CSVi : D → [0, 1] (1)

I Hard classifiers: one can either define CSVi as above anddefine a threshold τi above which a document is said to belongto ci , or constrain CSVi to range over {T ,F} directly.

2

Category membership thresholds

I Hard classifier status value, CSV hi : D → {T ,F}, can then be

defined as follows:

CSV hi (d) =

{T if CSVi ≥ τi ,F otherwise.

(2)

I Thresholds can be determined analytically or experimentally.

I Analytically derived thresholds are typical of TC systems thatoutput probability estimates of membership of documents tocategories

I τi is then determined by decision-theoretic measures (e.g.utility)

3

Experimental thresholds

I CSV thresholding or SCut: Scut stands for optimalthresholding on the confidence scores of category candidates:

I Vary τi on Tv and choose the one that maximises effectiveness

I Proportional thresholding: choose τi s.t. that generalitymeasure gTr (ci ) is closest to gTv (ci ).

I RCut or fixed thresholding: stipulate that a fixed number ofcategories are to be assigned to each document.

I See [Yang, 2001] for a survey of thresholding strategies.

4

ML methods for learning CSV functions

I Symbolic, numeric and meta-classification methods.

I Numeric methods implement classification indirectly: theclassification function f outputs a numerical score, hardclassification via thresholding

I probabilistic classifiers, regression methods, ...

I Symbolic methods usually implement hard classification directlyI e.g.: decision trees, decision rules, ...

I Meta-classification methods combine results from independentclassifiers

I e.g.: classifier ensembles, committees, ...

5

Probabilistic classifiers

I The CSV() of probabilistic classifiers produces an estimate ofthe conditional probability P(c |~d) = f (d , c) that an instancerepresented as ~d should be classified as c .

I Components of ~d regarded as random variables Ti

(1 ≤ i ≤ |T |)I Need to estimate probabilities for all possible representations

i.e. P(c |Ti , . . . ,Tn).

I Too costly in practice: for discrete case and m possiblenominal values that is O(mT )

I Independence assumptions help...

6

Conditional independence assumption

I Using Bayes’ rule we get

P(c |~dj) =

P(c)P(~dj |c)

P(~dj)(3)

I Naıve Bayes classifiers: assume Ti , . . . ,Tn are independent of eachother given the target category:

P(~d|c) =

|T |∏k=1

P(tk |c) (4)

I maximum a posteriori hypothesis: choose c that maximises (3)

I maximum likelihood hypothesis: choose c that maximises P(~dj |c)(i.e. assume all c ’s are equally likely)

7



P(c |~dj) =P(c)P(~dj |c)

P(~dj)(3)


P(~d|c) =

|T |∏k=1

P(tk |c) (4)



7




P(~dj)(3)


P(~d|c) =

|T |∏k=1

P(tk |c) (4)



7




P(~dj)(3)


P(~d|c) =

|T |∏k=1

P(tk |c) (4)



7

Variants of Naive Bayes classifiers

I multi-variate Bernoulli models, in which features are modelledas Boolean random variables, and

I multinomial models where the variables represent count data[McCallum and Nigam, 1998]

I Continuous models which use numeric data representation:attributes represented by continuous probability distributions

I using Gaussian distributions, the conditionals can be estimatedas

P(Ti = t|c) =1

σ√

2πe−

(t−µ)2

2σ2 (5)

I Non-parametric kernel density estimation has also beenproposed [John and Langley, 1995]

8

Some Uses of NB in NLP

I Information retrieval [Robertson and Jones, 1988]

I Text categorisation (see [Sebastiani, 2002] for a survey)

I Spam filters

I Word sense disambiguation [Gale et al., 1992]

9

CSV for multi-variate Bernoulli models

I Starting from the independence assumption

P(~d|c) =

|T |∏k=1

P(tk |c)

I and Bayes’ rule


P(~dj)

I derive a monotonically increasing function of P(c |−→d ):

f (d , c) =

|T |∑i=1

ti logP(ti |c)[1− P(ti |c)]

P(ti |c)[1− P(ti |c)](6)

I Need to estimate 2|T |, rather than 2|T | parameters.

10

Estimating the parameters

I For each term ti ∈ T , make:I nc ← the number of ~d s.t. f (~d, c) = 1I ni ← the number of ~d for which ti = 1 and f (~d, c) = 1

P(ti |c)← ni + 1

nc + 2(7)

I (sums in numerator and denominator for smoothing; see nextslides)

I nc ← the number of ~d s.t. f (~d, c) = 0I ni ← the number of ~d for which ti = 1 and f (~d, c) = 0

P(ti |c)← ni + 1

nc + 2(8)

11

An Alternative: multinomial models

I An alternative implementation of the Naıve Bayes Classifier isdescribed in [Mitchell, 1997].

I In this approach, words appear as values rather than names ofattributes

I A document representation for this slide would look like this:

~d = 〈a1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . 〉I Problem: each attribute’s value would range over the entire

vocabulary. Many values would be missing for a typical document.

12

Dealing with missing values

I what if none of the training instances with target category cjhave attribute value ai?

P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?

I Smoothing: make Bayesian estimate for P(ai |cj)

P(ai |cj)←nc + mp

n + m

where:I n is number of training examples for which C = cj ,I nc number of examples for which C = cj and A = aiI p is prior estimate for P(ai |cj)I m is weight given to prior (i.e. number of “virtual” examples)

13



P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?


P(ai |cj)←nc + mp

n + m


13



P(ai |cj) = 0, and...

P(cj)∏i

P(ai |cj) = 0

I What to do?


P(ai |cj)←nc + mp

n + m


13

Learning in multinomial models

1 NB Learn (Tr , C )2 /∗ c o l l e c t a l l t okens tha t occu r i n Tr ∗/3 T ← a l l d i s t i n c t words and o th e r tokens i n Tr4 /∗ c a l c u l a t e P(cj) and P(tk |cj) ∗/5 f o r each t a r g e t v a l u e cj i n C do6 Tr j ← s ub s e t o f Tr f o r which t a r g e t v a l u e i s cj

7 P(cj) ← |Tr j ||Tr|

8 Textj ← c on c a t ena t i o n o f a l l t e x t s i n Tr j

9 n ← t o t a l number o f tokens i n Textj10 f o r each word tk i n T do11 nk ← number o f t imes word tk oc cu r s i n Textj12 P(tk |cj)← nk+1

n+|T |13 done14 done

Note an additional assumption: position is irrelevant, i.e.:

P(ai = tk |cj) = P(am = tk |cj) ∀i ,m

14

Sample Classification Algorithm

I Could calculate posterior probabilities for soft classification

f (d) = P(c)n∏

k=1

P(tk |c)

(where n is then number of tokens in d that occur in T ) and usethresholding as before

I Or, for SLTC, implement hard categorisation directly:

1 positions ← a l l word p o s i t i o n s i n d2 t ha t c on t a i n tokens found i n T3 Return cnb , where4 cnb = argmaxci∈C P(ci )

∏k∈positions P(tk |ci )

15

Classification Performance[Mitchell, 1997]: Given 1000 training documents from each group,learn to classify new documents according to which newsgroup itcame from

comp.graphics misc.forsalecomp.os.ms-windows.misc rec.autoscomp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseballcomp.windows.x rec.sport.hockey

alt.atheism sci.spacesoc.religion.christian sci.crypt

talk.religion.misc sci.electronicstalk.politics.mideast sci.med

talk.politics.misctalk.politics.guns

Naive Bayes: 89% classification accuracy.16

Learning performance

I Learning Curve for 20 Newsgroups:

NB: TFIDF and PRTFIDF are non-Bayesian probabilisticmethods we will see later in the course. See [Joachims, 1996]for details.

17

NB and continuous variables

I Another model: suppose we want our document vectors torepresent, say, the TF-IDF scores of each term in thedocument:

~d = 〈a1 = tfidf (t1), . . . , an = tfidf (tn)〉 (9)

I How would we estimate P(c |~d)?

I A: assuming an underlying (e.g. normal) distribution:

P(c |~d) ∝n∏

i=n

P(ai |c)

=1

σc√

2πe− (x−µc )2

2σ2c (10)

µb and σ2b are mean and variance of the values taken by the

attributes for positive instances.

18

NB and continuous variables

I Another model: suppose we want our document vectors torepresent, say, the TF-IDF scores of each term in thedocument:

~d = 〈a1 = tfidf (t1), . . . , an = tfidf (tn)〉 (9)

I How would we estimate P(c |~d)?

I A: assuming an underlying (e.g. normal) distribution:

P(c |~d) ∝n∏

i=n

P(ai |c)

=1

σc√

2πe− (x−µc )2

2σ2c (10)

µb and σ2b are mean and variance of the values taken by the

attributes for positive instances.

18

Combining variables

I NB also allows you yo combine different types of variables.

I The result would be a Bayesian Network with continuous anddiscrete nodes. For instance:

...

C

a1 a2 ak an

I See [Luz, 2012, Luz and Su, 2010] for examples of the use ofsuch combined models in a different categorisation task.

19

Naive but subtle

I Conditional independence assumption is clearly false

P(a1, a2 . . . an|vj) =∏i

P(ai |vj)

I ...but NB works well anyway. Why?

I posteriors P(vj |x) don’t need to be correct; We need onlythat:

arg maxvj∈V

P(vj)∏i

P(ai |vj) = arg maxvj∈V

P(vj)P(a1 . . . , an|vj)

In othe words, error in NB classification is a zero-one loss function, oftencorrect even if posteriors are unrealistically close to 1 or 0[Domingos and Pazzani, 1996].Performance can be optimal if dependencies are evenly distributed overclasses, or if they cancel each other out [Zhang, 2004].

20

Naive but subtle

I Conditional independence assumption is clearly false

P(a1, a2 . . . an|vj) =∏i

P(ai |vj)

I ...but NB works well anyway. Why?

I posteriors P(vj |x) don’t need to be correct; We need onlythat:

arg maxvj∈V

P(vj)∏i

P(ai |vj) = arg maxvj∈V

P(vj)P(a1 . . . , an|vj)

In othe words, error in NB classification is a zero-one loss function, oftencorrect even if posteriors are unrealistically close to 1 or 0[Domingos and Pazzani, 1996].Performance can be optimal if dependencies are evenly distributed overclasses, or if they cancel each other out [Zhang, 2004].

20

Other Probabilistic Classifiers

I Alternative approaches to probabilistic classifiers attempt toimprove effectiveness by:

I adopting weighted document vectors, rather thanbinary-valued ones

I introducing document length normalisation, in order to correctdistortions in CSVi introduced by long documents

I relaxing the independence assumption (the least adoptedvariant, since it appears that the binary independenceassumption seldom affects effectiveness)

I But see, for instance Hidden Naive Bayes[Zhang et al., 2005]...

21

ReferencesI

Domingos, P. and Pazzani, M. J. (1996).

Beyond independence: Conditions for the optimality of the simple bayesian classifier.In International Conference on Machine Learning, pages 105–112.

Gale, W., Church, K., and Yarowsky, D. (1992).

A method for disambiguating word senses in a large corpus.Computers and the Humanities, 26:415–439.

Joachims, T. (1996).

A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization.Technical Report CMU-CS-96-118, CMU.

John, G. H. and Langley, P. (1995).

Estimating continuous distributions in Bayesian classifiers.In Besnard, Philippe and Hanks, S., editors, Proceedings of the 11th Conference on Uncertainty in ArtificialIntelligence (UAI’95), pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers.

Luz, S. (2012).

The non-verbal structure of patient case discussions in multidisciplinary medical team meetings.ACM Transactions on Information Systems, 30(3):17:1–17:24.

Luz, S. and Su, J. (2010).

Assessing the effectiveness of conversational features for dialogue segmentation in medical team meetingsand in the AMI corpus.In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. Association for ComputationalLinguistics.

22

ReferencesII

McCallum, A. and Nigam, K. (1998).

A comparison of event models for naive Bayes text classification.In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press.

Mitchell, T. M. (1997).

Machine Learning.McGraw-Hill.

Robertson, S. E. and Jones, K. S. (1988).

Relevance weighting of search terms.In Document retrieval systems, pages 143–160. Taylor Graham Publishing, London.

Sebastiani, F. (2002).

Machine learning in automated text categorization.ACM Computing Surveys, 34(1):1–47.

Yang, Y. (2001).

A study on thresholding strategies for text categorization.In Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedings of the 24th AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01),pages 137–145, New York. ACM Press.

Zhang, H. (2004).

The optimality of Naive Bayes.In Proceedings of the 7th International Florida Artificial Intelligence Research Society Conference. AAAIPress.

23

ReferencesIII

Zhang, H., Jiang, L., and Su, J. (2005).

Hidden naive bayes.In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 919. Menlo Park, CA;Cambridge, MA; London; AAAI Press; MIT Press; 1999.

24

Documents

Text Classifier Induction: Naive Bayes Classifierskoidlk/cs4062/09-ctinduction.pdf · De ning a CSV function I Inductive construction of a text categorization module consists of de