15
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Efficient Text Categorization with a Large Number of Categories

Rayid Ghani

KDD Project Proposal

Text Categorization

Numerous Applications•Search Engines/Portals•Customer Service•….

Domains:•Topics•Genres•Languages

$$$ Making

How do people deal with a large number of classes?

Use fast multiclass algorithms (Naïve Bayes) Builds one model per class

Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems

What happens with a 1000 class problem? Can we do better?

ECOC to the Rescue!

An n-class problem can be solved by solving log2n problems

More efficient than one-per-class Does it actually perform better?

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems

Use a learner to learn the binary problems

Training ECOC

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

f1 f2 f3 f4 f5

X 0 0 1 1 1

Testing ECOC

ECOC - Picture

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

A

DC

B

f1 f2 f3 f4 f5

ECOC - Picture

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

A

DC

B

f1 f2 f3 f4 f5

ECOC - Picture

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

A

DC

B

f1 f2 f3 f4 f5

ECOC - Picture

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

A

DC

B

f1 f2 f3 f4 f5

X 1 1 1 1 0

Classification Performance

Efficiency

NB

ECOC

Preliminary Results

This Proposal

ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost

Proposed Solutions

Design codewords that minimize cost and maximize “performance”

Investigate the assignment of codewords to classes

Learn the decoding function Incorporate unlabeled data into ECOC

Use unlabeled data

Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories

ECOC works great with a large number of classes but there is no framework for usaing unlabeled data

Use Unlabeled Data

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training (and variants of Co-Training such as Co-EM)

Summary