58
Text Classification from Labeled and Unlabeled Documents using EM - Kamal Nigam - Andrew Kachites Mccallum - Sebastian Thrun - Tom Mitchell Presented by Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran

Text Classification from Labeled and Unlabeled Documents using EM

Embed Size (px)

DESCRIPTION

Text Classification from Labeled and Unlabeled Documents using EM. Kamal Nigam Andrew Kachites Mccallum Sebastian Thrun Tom Mitchell Presented by Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran. Job Hunting?. Roadmap. Part 1 – Text Classification - PowerPoint PPT Presentation

Citation preview

Page 1: Text Classification from Labeled and Unlabeled Documents using EM

Text Classification from Labeled and Unlabeled Documents using EM

- Kamal Nigam- Andrew Kachites Mccallum- Sebastian Thrun- Tom Mitchell

Presented by

Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran

Page 2: Text Classification from Labeled and Unlabeled Documents using EM

Job Hunting?

Page 3: Text Classification from Labeled and Unlabeled Documents using EM

Roadmap

Part 1 – Text Classification

Part 2 – Incorporating Unlabeled data with EM

Part 3 – Results and Recap

Page 4: Text Classification from Labeled and Unlabeled Documents using EM

Part I Text Classification

Page 5: Text Classification from Labeled and Unlabeled Documents using EM

Text Classification – the Definition

“Text classification systems categorize documents into one (or several) of a set of pre-defined topics of interest”

Page 6: Text Classification from Labeled and Unlabeled Documents using EM

How Are Automatic Text Classifiers Created

Before: Manual construction of rule sets (Painful and time-consuming )

Present : Supervised learning to construct a classifier (efficient and successful)

Page 7: Text Classification from Labeled and Unlabeled Documents using EM

What To Provide

An algorithm with an example set of documents for each class and allow it to find a representation or decision rule for classifying future documents automatically

This approach will : - give high-accuracy classifiers - be significantly less expensive

Page 8: Text Classification from Labeled and Unlabeled Documents using EM

What Data is Available

Key difficulty : A large number of labeled training examples are required to learn accurately - What we need but don't have

One would obviously prefer algorithms that can provide accurate classifications after hand labeling only a dozen articles, rather than thousands

What other sources of information can reduce the need for labeled data?

Page 9: Text Classification from Labeled and Unlabeled Documents using EM

Unlabeled data

How unlabeled data can be used to increase classification accuracy, especially when labeled data are scarce

An intuitive example

Page 10: Text Classification from Labeled and Unlabeled Documents using EM

Goal And Merit

The goal – To demonstrate that supervised learning algorithms

can use a small number of labeled examples with a large number of unlabeled examples to create high-accuracy text classifiers

The merit – Unlabeled examples are much less expensive and

easily available

Page 11: Text Classification from Labeled and Unlabeled Documents using EM

Parametric Generative Model Overview

Assumption : a statistical process generates the documents (words and class labels)

statistical process - parametric generative model

Page 12: Text Classification from Labeled and Unlabeled Documents using EM

Incorporating Unlabeled Data withGenerative Models

Using EM to find high-probability parameters of the model given a combination of labeled and unlabeled data

Experimental evidence shows that using unlabeled data with EM can increase classification accuracy

Page 13: Text Classification from Labeled and Unlabeled Documents using EM

Assumptions In the Model

(1) Documents are generated by a mixture of multinomials model, where each mixture component corresponds to a class (1 class to 1 component)

(2) The mixture components are multinomial distributions of individual words - the words are produced independently of each other given the class

Page 14: Text Classification from Labeled and Unlabeled Documents using EM

Two Multisided Dies

Let there be |C| classes and a vocabulary of size |V|; each document d has |d| words in it.

First, we roll a biased |C|-sided die to determine the class of our document.

We roll the biased |V|-sided die that corresponds to the chosen class |d| times and write down the indicated words. These words form the generated document.

Page 15: Text Classification from Labeled and Unlabeled Documents using EM

Parametric Generative Model

- parameters for the mixture model - mixture of components - mixture weights or class probabilities - document distribution of selected class

Equation (1)

Page 16: Text Classification from Labeled and Unlabeled Documents using EM

Denotation

- the jth mixture component, as well as the jth class. - the class label for a particular document ( )

A document is considered to be an ordered list of word events,

We write for the word in position k of - a word in the vocabulary

- document length, chosen independently of the component, its own probability

Page 17: Text Classification from Labeled and Unlabeled Documents using EM

Parametric Generative Model

Expanding the Equation (1) with document length and the words in the document. Equation (2)

The words of a document are generated independently of context

Equation (3)

Combining these last two equations gives the naive Bayes expression for the probability of a document given its class

Equation (4)

Page 18: Text Classification from Labeled and Unlabeled Documents using EM

Model Parameters

Collection of word probabilities, each written

Document length is identically distributed, no need to be parameterized for classification

denoted as the mixture weights (class probabilities)

The complete collection of model parameters

Page 19: Text Classification from Labeled and Unlabeled Documents using EM

Naive Bayes Text Classification

Using a collection of labeled documents for training

Finding the most probable parameters for the statistical

model introduced

Page 20: Text Classification from Labeled and Unlabeled Documents using EM

Training A Naive Bayes Classifier With Labeled Data

Estimating the parameters of the generative model by using a set of labeled training data

(the estimate of the parameters is written )

Finding (MAP), the value of that is most probable given the evidence of the training data and a prior.

Page 21: Text Classification from Labeled and Unlabeled Documents using EM

Training A Naive Bayes Classifier With Labeled Data

The word probability estimates are given by Equation (6)

Class probabilities

Equation (7)

Page 22: Text Classification from Labeled and Unlabeled Documents using EM

Classifying New Documents with Naive Bayes

Equation (8)

If the task is to classify a test document into a single class, then the class with the highest posterior probability

is selected.

Page 23: Text Classification from Labeled and Unlabeled Documents using EM

Part ⅡIncorporating Unlabeled Data with EM

Page 24: Text Classification from Labeled and Unlabeled Documents using EM

The Problem

The case that given only labeled data is explained already. MAP– to maximize the posterior probability. Naïve Bayes– do classification of labeled data. Now the case is given both labeled and unlabeled data. Searching for a solution? – Here it is!

Page 25: Text Classification from Labeled and Unlabeled Documents using EM

Revision of EM

Recall the EM knowledge in PMR – Might be painful, but helpful

Mixture Model Hidden variable – z to active the components

Page 26: Text Classification from Labeled and Unlabeled Documents using EM

Revision of EM

EM applied to Gaussian Mixture Model Maximum Likelihood Estimation Parameters: µ andΣ E step: evaluate the responsibilities using current estimators/

parameters M step: re-estimate by using the maximum a posteriori para

meter Run the demo

Page 27: Text Classification from Labeled and Unlabeled Documents using EM

Back to the paper

Page 28: Text Classification from Labeled and Unlabeled Documents using EM

Back to the paper

Collection of labeled and unlabeled documents. MAP Try to maximize P(θ|D) Bayesian method -- P(θ|D) → P(θ) P(D| θ)

ul DDD

Page 29: Text Classification from Labeled and Unlabeled Documents using EM

Back to the paper

Log likelihood Incomplete equation

Page 30: Text Classification from Labeled and Unlabeled Documents using EM

Back to the paper

z – binary indicator variables which is set to be 1 if y = c, else zero.

Then problem of the incomplete log probability can be transferred to complete log probability of parameters.

Page 31: Text Classification from Labeled and Unlabeled Documents using EM

Back to the paper

Methods used in the paper Basic EM Augmented EM

(1) Weighting the unlabeled data

(2) Multiple mixture components per class

Page 32: Text Classification from Labeled and Unlabeled Documents using EM

Basic EM

Initialize the NB classifier using MAP parameter estimation, from only labeled dataset.

E step: estimate the component membership

by calculating its expected value generated by from only unlabeled data. M step: re-estimate the classifier for the whole data set,

using MAP, loop from E step: Look at to measure the improvement of the

parameters, decide when to stop the loop

),|( ij dcP

),|(maxarg 11 kk zDP

),|(1 kk DzEz

),|( zDlc

Page 33: Text Classification from Labeled and Unlabeled Documents using EM

Restrictions of Basic EM

Assumptions/Restrictions: Large unlabeled data set, small labeled data set → if not

true, unlabeled data will hurt the accuracy. One-to-one correspondence of components and classes →

not so accurate because subtopics exist.

Page 34: Text Classification from Labeled and Unlabeled Documents using EM

Augmented EM – weighting unlabeled data

Method: weakening the contribution of unlabeled data while the labeled set is already good enough for classification.

Equation:

))),|()|(log(

()),|()|(log(

))(log(),|(

1

1

ui

ld

Dd

C

jjijij

Dd

C

jjijij

c

cdPcPz

cdPcPz

PzDl

Page 35: Text Classification from Labeled and Unlabeled Documents using EM

Augmented EM – weighting unlabeled data

λis decided by leave-one-out cross validation. is defined to tell whether it is labeled or unlabeled.

Modified MAP parameters:

)(i

Page 36: Text Classification from Labeled and Unlabeled Documents using EM

Augmented EM -- multiple mixture components per class

Method: Relax the assumption that one-to-one correspondence of components and classes.

Many-to-one relationship between components and classes.

Page 37: Text Classification from Labeled and Unlabeled Documents using EM

Augmented EM – multiple mixture components per class

How? Decide the number of components per class by again cross-

validation. Mapping from components to classes: }1,0{),|( ja ctP

Page 38: Text Classification from Labeled and Unlabeled Documents using EM

The complete algorithm

Collections of labeled, unlabeled documents. Set λby cross-validation. Set the number of components per class. Randomly assign for mixture components. Initialize the parameters θ of NB classifier using MAP. Loop until complete log likelihood of labeled and unlabeled

data is satisfying enough. E step: estimate the component membership of each doc usi

ng θ M step: re-estimate θgiven the membership, still MAP.

ul DDD

),|( ij dcP

liyan
Page 39: Text Classification from Labeled and Unlabeled Documents using EM

Comparison

Basic EM: performs well comparing with naïve bayes classifier alone, with large unlabeled dataset and small set of labeled data

EM-λ: can apparently improve the accuracy if the assumption above doesn’t fit.

Multiple Components: dramatically outperforms than basic EM.

Page 40: Text Classification from Labeled and Unlabeled Documents using EM

Part III Results and Recap

Page 41: Text Classification from Labeled and Unlabeled Documents using EM

Experimental Results

Empirical evidence that on combining labeled with unlabeled data using EM outperforms naive Bayes.

20 Newsgroups, WebKB, Reuters

Improvements in accuracy due to unlabeled data are dramatic, especially when the number of labeled data is low.

Augmented EM can increase performance even when basic EM performs poor due to large number of unlabeled data.

Page 42: Text Classification from Labeled and Unlabeled Documents using EM

Data sets and Protocols

- 20 Newsgroup

20017 articles divided evenly among 20 different UseNet discussion groups.

Task - to classify an article into the one newsgroup to which it was posted.

Many categories fall into confusable clusters. Stop words are removed – 62258 unique words Word counts are normalized and scaled – each document

has constant length.

Page 43: Text Classification from Labeled and Unlabeled Documents using EM

Data sets and Protocols

- WebKB

8145 Web pages gathered from university computer science departments.

Choosing 4199 pages covering categories: student, faculty, course and project.

Task - to classify a web page into one of the four categories.

Stemming and stoplist are not used. Vocabulary is limited to 300 most informative words using

leave-one-out cross validation.

Page 44: Text Classification from Labeled and Unlabeled Documents using EM

Data sets and Protocols

- Reuters

12902 articles and 90 topic categories. Task - to build a binary classifier for each of the ten most

populous classes to identify the news topic. Words inside <TEXT> tags are used – REUTERS and &#

not used. Stoplist are used, but no stemming. Metrics are Recall and Precision instead of Accuracy.

Page 45: Text Classification from Labeled and Unlabeled Documents using EM

Precision-Recall breakeven point

• Standard information retrieval measure

• Recall – number of correct positive predictions

number of positive examples

• Precision - number of correct positive predictions

number of positive predictions

Page 46: Text Classification from Labeled and Unlabeled Documents using EM

Wall-clock timing

EM usually converges after 10 iterations

Less than 1 minute for the WebKB

Less than 15 minutes for 20 Newsgroups – huge vocabulary and more documents

Page 47: Text Classification from Labeled and Unlabeled Documents using EM

EM with unlabeled data increases Accuracy

Figure 1:- Accuracy versus # of Labeled Documents. (20 Newsgroups)

Page 48: Text Classification from Labeled and Unlabeled Documents using EM

Effect of varying the # of unlabeled documents

Figure 2:- Accuracy versus # of unlabeled documents. (20 Newsgroups)

Page 49: Text Classification from Labeled and Unlabeled Documents using EM

EM algorithm in action

Figure 3:- ‘Course’ class for WebKB dataset

Page 50: Text Classification from Labeled and Unlabeled Documents using EM

EM performance degradation

Figure 4:- As # of Labeled data increases, accuracy of classifier falls with more # of unlabeled data. Importance of weighting factor λ. (WebKB)

Page 51: Text Classification from Labeled and Unlabeled Documents using EM

Effects of different EM

Figure 5:- Comparison between EM, CV EM-λ and EM-λ (WebKB)

Page 52: Text Classification from Labeled and Unlabeled Documents using EM

Performance of EM on different # of mixture components

Figure 6:- Too few or too many mixture components result in poor performance. Unlabeled data is used. (Reuters)

Page 53: Text Classification from Labeled and Unlabeled Documents using EM

Precision-Recall breakeven points

Figure 7:- Comparison between NB and EM on Reuters dataset

Page 54: Text Classification from Labeled and Unlabeled Documents using EM

Related Work

EM is a well-known family of algorithms that works by treating unclassified data as incomplete.

According to Miller et al - EM on non-textual tasks using mixture of Gaussians – assumed unlabeled data to be sufficient to estimate parameter values.

Castelli and Cover - unlabeled data does not improve the classification results in the absence of labeled data.

EM can be combined with active-learning to improve performance – now only slightly more than half of labeled data was enough!

EM can be applied with other machine learning algorithms like SVM, kNN.

Page 55: Text Classification from Labeled and Unlabeled Documents using EM

Punchwords

• Text classification

• Naive Bayes

• Expectation Maximisation Algorithm

• EM-λ

• Multiple Mixture models for subclass

• Leave-one-out cross validation

• Stemming and stoplist words

• Accuracy, Precision, Recall

Page 56: Text Classification from Labeled and Unlabeled Documents using EM

Recap

A family of algorithms have been presented to address text classification using voluminous unlabeled data and scarce labeled data.

When data is consistent with the assumptions - Basic EM performs well.

When data is not consistent - 2 extensions hold valid

- EM-λ: controlling the contribution of unlabeled data.

- Multiple Mixture Components per Class: “many-to-one” constraint.

Page 57: Text Classification from Labeled and Unlabeled Documents using EM

References

Using Unlabeled Data to Improve Text Classification – May 2001 at

www.kamalnigam.com/papers/thesis-nigam.pdf Netlab toolkit - www.ncrg.aston.ac.uk/netlab/ Validation Lecture – Intelligent Sensor Systems, RicardoG

utierrez-Osuna, Wright State University

Page 58: Text Classification from Labeled and Unlabeled Documents using EM

Question Time!!

Route further questions to ...

Ryan - 0789317

Neo - 0785401

Sandhya - 0671562