Combining labeled and unlabeled data for text categorization with a large number of categories Rayid...

Combining labeled and unlabeled data for text categorization with a large number of

categories

Rayid Ghani

KDD Lab Project

Supervised Learning with Labeled Data

Labeled data is required in large quantities and can be very expensive to collect.

Why use Unlabeled data?

Very Cheap in the case of text Web Pages Newsgroups Email Messages

May not be equally useful as labeled data but is available in enormous quantities

Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories

•ECOCvery accurate and efficient for text categorization with a large number of classes

•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes

Related research with unlabeled data

Using EM in a generative model (Nigam et al. 1999)

Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &

Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)

Use a learner to learn the binary problems

Training ECOC

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

f1 f2 f3 f4 f5

X 1 1 1 1 0

Testing ECOC

The Co-training algorithm

Loop (while unlabeled documents remain): Build classifiers A and B

Use Naïve Bayes Classify unlabeled documents with A & B

Use Naïve Bayes Add most confident A predictions and most

confident B predictions as labeled training examples

[Blum & Mitchell 1998]

The Co-training Algorithm

Naïve Bayeson B

Naïve Bayeson A

Learn from labeled data

Estimate labels

Select most confident

Add to labeled data

[Blum & Mitchell, 1998]

One Intuition behind co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

ECOC + CoTraining = ECoTrain

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

SPORTSSPORTS SCIENCESCIENCEARTSARTS

HEALTHHEALTH POLITICSPOLITICS

LAWLAW

What happens with sparse data?

Percent Decrease in Error with Training size and length of code

0 20 40 60 80 100

Training Size

Datasets

Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%

Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%

Results

Dataset Naïve Bayes

(No UnLabeled Data)

EM Co-Trainin

ECOC + Co-

Training

10% Labeled

100% Labeled

10% Labeled

100% Labeled

10% Labeled

Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5

Hoovers-255

15.2 32.0 24.8 36.5 9.1 10.2 27.6

Results

0 20 40 60 80 100

Recall

ECOC + CoTrainNaive BayesEM

What Next?

Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration

Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Summary

The Co-training setting

…My advisor…

…Professor Blum…

…My grandson…

Tom Mitchell

Fredkin Professor of AI…

Avrim Blum

My research interests are…

JohnnyI like horsies!

Classifier A Classifier B

Learning from Labeled and Unlabeled Data:

Using Feature Splits

Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]

Consider this the co-training setting

Learning from Labeled and Unlabeled Data:

Extending supervised learning

MaxEnt Discrimination [Jaakkola et al. 99]

Expectation Maximization [Nigam et al. 98]

Transductive SVMs [Joachims 99]

Using Unlabeled Data with EMEstimate labels of the unlabeled documents

Use all documents to build anew naïve Bayes classifier

Naïve Bayes

Co-training vs. EM

Co-training Uses feature split Incremental labeling Hard labels

EM Ignores feature split Iterative labeling Probabilistic labels

Which differences matter?

Hybrids of Co-training and EM

Yes No

Incremental co-training self-training

Iterative co-EM EM

Uses Feature Split?

Labeling

Naïve Bayeson A

Naïve Bayeson B

Label all Learn from all

Naïve Bayes

on A & B

Label all

Add only bestLabel allLearn from all

Learning from Unlabeled Datausing Feature Splits

coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]

Intuition behind Co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

Using Unlabeled Data with EM

Estimate labels of unlabeled documents

Use all documents to rebuild naïve Bayes classifier

Naïve Bayes

[Nigam, McCallum, Thrun & Mitchell, 1998]

Initially learn from labeled

Naïve Bayeson A

Naïve Bayeson B

Estimate labels Build naïve Bayes with all data

Estimate labelsBuild naïve Bayes with all data

Use Feature Split?

EMco-EMLabel All

co-trainingLabel Few

Initialize withlabeled data

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid...

Documents

Rayid Ghani@rayidghani Training Students to do Good with Data Data Science for Social Good Fellowship Rayid Ghani University of Chicago

Rayid komjur

Ddr_sdram by Ghani

Doing and Teaching Data Science for Social Good › ccc › wp-content › uploads › sites › 2 › 2016 › 06 › ...Rayid Ghani @rayidghani Doing and Teaching Data Science for

Bootstrapping Information Extraction with Unlabeled Data

Ghani Automobile Industries Limited 3rd Quarter March 2014-15.pdfMr. Junaid Ghani Mr. Obaid Ghani Mr. Jubair Ghani Chairman Member Member Chairman Member ... 49-KM, Multan Road, (from

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs

DataScience$in$the$21 $Century$ - CRA · IntroducMons$ • Panelists:$ – David$Culler,$UC$Berkeley$ – Rahel$Jhirad,$Hearst – Rayid$Ghani, UChicago$ – Rob$Rutenbar,$UIUC$

Ghani DBMS Relational_algebra

OLHO REVELA..., O_ UMA INTRODUÇAO AO METODO RAYID DE INTERPRETAÇAO DA - Denny Johnson

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Techniques For Exploiting Unlabeled Data

Set 1 Unlabeled

Ghani Glass

Usman ghani r.a

Hazrat usman-ghani

nasir ghani

Ghani Firewall

Al-Ghani KI, Al-Ghani H. . Jessica Kingsley Publishers

Research Directions in Enterprise Knowledge Management Rayid Ghani