104
Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/4

Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

  • View
    241

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti)

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and

Information Engineering,National Cheng Kung University

2004/11/4

Page 2: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 2

Organizing knowledge Systematic knowledge structures Ontologies

•Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject

•Classification, and the US Patent subject classification

Web catalogs•Yahoo & Dmoz

Problem: Manual maintenance

Page 3: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National
Page 4: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National
Page 5: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Medical Subject Heading (MeSH)

Page 6: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National
Page 7: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National
Page 8: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 8

Topic Tagging Finding similar documents Guiding queries Naïve Approach:

•Syntactic similarity between documents Better approach

•Topic tagging

Page 9: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 9

Topic Tagging Advantages

• Increase vocabulary of classes

•Hierarchical visualization and browsing aids

Applications•Email/Bookmark organization

•News Tracking

•Tracking authors of anonymous texts E.g.: The Flesch-Kincaid index

•Classify the purpose of hyperlinks.

Page 10: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 10

Supervised learning Learning to assign objects to classes

given examples Learner (classifier)

A typical supervised text learning scenario.

Page 11: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 11

Difference with texts ML (machine learning) classification

techniques used for structured data Text: lots of features and lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class label Hierarchical relationships between

classes less systematic unlike structured data

Page 12: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 12

Techniques Nearest Neighbor Classifier

• Lazy learner: remember all training instances• Decision on test document: distribution of labels

on the training documents most similar to it• Assigns large weights to rare terms

Feature selection• removes terms in the training documents which

are statistically uncorrelated with the class labels,

Bayesian classifier • Fit a generative term distribution Pr(d|c) to each

class c of documents {d}.• Testing: The distribution most likely to have

generated a test document is used to label it.

Page 13: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 13

Other Classifiers Maximum entropy classifier:

• Estimate a direct distribution Pr(c|d) from term space to the probability of various classes.

Support vector machines:• Represent classes by numbers

• Construct a direct function from term space to the class variable.

Rule induction:• Induce rules for classification over diverse

features

• E.g.: information from ordinary terms, the structure of the HTML tag tree in which terms are embedded, link neighbors, citations

Page 14: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 14

Other Issues Tokenization

•E.g.: replacing monetary amounts by a special token

Evaluating text classifier•Accuracy

•Training speed and scalability

•Simplicity, speed, and scalability for document modifications

•Ease of diagnosis, interpretation of results, and adding human judgment and feedback

sub

ject

ive

Page 15: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 15

Benchmarks for accuracy Reuters

• 10700 labeled documents • 10% documents with multiple class labels

OHSUMED• 348566 abstracts from medical journals

20NG• 18800 labeled USENET postings• 20 leaf classes, 5 root level classes

WebKB• 8300 documents in 7 academic categories.

Industry• 10000 home pages of companies from 105 industry

sectors• Shallow hierarchies of sector names

Page 16: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 16

Measures of accuracy Assumptions

•Each document is associated with exactly one class.

•Each document is associated with a subset of classes.

Confusion matrix (M)•For more than 2 classes•M[i; j] : number of test documents

belonging to class i which were assigned to class j

•Perfect classifier: diagonal elements M[i; i] would be nonzero.

Page 17: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 17

Evaluating classifier accuracy Two-way ensemble

• To avoid searching over the power-set of class labels in the subset scenario

• Create positive and negative classes for each document d (E.g.: “Sports” and “Not sports” (all remaining documents)

Recall and precision• contingency matrix per (d,c) pair22

] coutput not doesclassier andC c [ [1,1]M

] c outputsclassier and C c [ [1,0]M

] coutput not doesclassier and C c [ [0,1]M

] c outputsclassier and C c [ [0,0]M

dcd,

dcd,

dcd,

dcd,

)( dC )( dC

Page 18: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 18

Evaluating classifier accuracy (contd.)

•micro averaged contingency matrix

•micro averaged contingency matrix

•micro averaged precision and recall Equal importance for each document

•Macro averaged precision and recall Equal importance for each class

cd

cdMM,

,

]0,1[]0,0[

]0,0[)(

MM

MprecisionM

]1,0[]0,0[

]0,0[)(

MM

MrecallM

c d

dcc MC

M ,||

1

]0,1[]0,0[

]0,0[)(

cc

cc MM

MprecisionM

]1,0[]0,0[

]0,0[)(

cc

cc MM

MrecallM

Page 19: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 19

Evaluating classifier accuracy (contd.)

•Precision – Recall tradeoff Plot of precision vs. recall: Better classifier

has higher curvature Harmonic mean : Discard classifiers that

sacrifice one for the other

precision recall

precisionrecall2 F1

Page 20: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 20

Nearest Neighbor classifiers Intuition

•similar documents are expected to be assigned the same class label.

•Vector space model + cosine similarity•Training:

Index each document and remember class label

•Testing: Fetch “k” most similar document to given

document– Majority class wins– Alternative: Weighted counts – counts of classes

weighted by the corresponding similarity measure– Alternative: per-class offset bc which is tuned by

testing the classier on a portion of training data held out for this purpose.

Page 21: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 21

Nearest neighbor classification

Page 22: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 22

Pros Easy availability and reuse of of

inverted index Collection updates trivial Accuracy comparable to best known

classifiers

Page 23: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 23

Cons Iceberg category questions

• involves as many inverted index lookups as there are distinct terms in dq,

•scoring the (possibly large number of) candidate documents which overlap with dq in at least one word,

•sorting by overall similarity, •picking the best k documents,

Space overhead and redundancy•Data stored at level of individual

documents•No distillation

Page 24: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 24

Workarounds To reducing space requirements and

speed up classification• Find clusters in the data

• Store only a few statistical parameters per cluster.

• Compare with documents in only the most promising clusters.

Again….• Ad-hoc choices for number and size of clusters

and parameters.

• k is corpus sensitive

Page 25: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 25

TF-IDF TF-IDF done for whole corpus Interclass correlations and term

frequencies unaccounted for Terms which occur relatively

frequently in some classes compared to others should have higher importance

Overall rarity in the corpus is not as important.

Page 26: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 26

Feature selection Data sparsity

•Term distribution could be estimated if training set larger than test

•Not the case however…….

•Vocabulary documents

•For Reuters, only about 10300 documents available.

Over-fitting problem • Joint distribution may fit training

instances…..

•But may not fit unforeseen test data that well

||2WW

Page 27: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 27

Marginals rather than joint Marginal distribution of each term in

each class Empirical distributions may not still

reflect actual distributions if data is sparse

Therefore feature selection•Purposes:

Improve accuracy by avoiding overfitting maintain accuracy while discarding as many

features as possible to save a great deal of space for storing statistics

•Heuristic, guided by linguistic and domain knowledge, or statistical.

Page 28: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 28

Feature selection Perfect feature selection

• goal-directed• pick all possible subsets of features• for each subset train and test a classier• retain that subset which resulted in the highest accuracy.• COMPUTATIONALLY INFEASIBLE

Simple heuristics• Stop words like “a”, “an”, “the” etc.• Empirically chosen thresholds (task and corpus sensitive) for

ignoring “too frequent” or “too rare” terms• Discard “too frequent” and “too rare terms”

Larger and complex data sets• Confusion with stop words• Especially for topic hierarchies

Greedy inclusion (bottom up) vs. Truncation (top-down)

Page 29: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 29

Greedy inclusion algorithm Most commonly used in text Algorithm:

1. Compute, for each term, a measure of discrimination among classes.

2. Arrange the terms in decreasing order of this measure.

3. Retain a number of the best terms or features for use by the classier.

• Greedy because • measure of discrimination of a term is

computed independently of other terms

• Over-inclusion: mild effects on accuracy

Page 30: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 30

Measure of discrimination• Dependent on

•model of documents

•desired speed of training

•ease of updates to documents and class assignments.

• Observations•sets included for acceptable accuracy

tend to have large overlap.

Page 31: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 31

The test• Similar to the likelihood ratio test• Build a 2 x 2 contingency matrix per class-

term pair

Under the independence hypothesis• Aggregates the deviations of observed values

from expected values

• Larger the value of , the lower is our belief that the independence assumption is upheld by the observed data.

2

t termcontaining i classin documents ofnumber k

t termcontainingnot i classin documents ofnumber k

i,1

i,0

2

ij

ijij

E

EO 22 )(

Page 32: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 32

The test

• Feature selection process•Sort terms in decreasing order of their

values,

•Train several classifier with varying number of features

•Stopping at the point of maximum accuracy.

ij

ijij

E

EO 22 )(

2

2

))()()((

)(

)Pr()Pr(

)]Pr()Pr([

0010011100011011

201100011

,

2,2

kkkkkkkk

kkkkn

mIlCn

mIlCnk

ml t

tml

It = 0 1

C = 0

k00 k01

1 k10 k11

Page 33: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 33

Mutual information• Useful when the multinomial document

model is used• X and Y are discrete random variables taking

values x,y• Mutual information (MI) between them is defined

as

• Measure of extent of dependence between random variables,• Extent to which the joint deviates from the

product of the marginals

• Weighted with the distribution mass at (x; y)

x y yx

yxyxYXMI

)Pr()Pr(

),Pr(log),Pr(),(

Page 34: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 34

Mutual Information Advantages

• To the extent MI(X,Y) is large, X and Y are dependent.

• Deviations from independence at rare values of (x,y) are played down

• Interpretations• Reduction in the entropy of Y given X.

• MI(X; Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)

• KL (Kullback-Leibler) distance between no-independence hypothesis and independence hypothesis

• KL distance gives the average number of bits wasted by encoding events from the `correct‘ distribution using a code based on a not-quite-right distribution

Page 35: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 35

Feature selection with MI• Fix a term t and let be an event associated

with that term.• E.g., for the binary model, = 0/1,• Pr( ) = the empirical fraction of documents

in the training set in which event occurred.• Pr( ,c) = the empirical fraction of training

documents which are in class c• Pr(c) = fraction of training documents

belonging to class c.• Formula:• Problem: document lengths are not

normalized.

tI

tI

tI

tI

2,1,01,0,

,

,

,

/))((

/log),(

nkkkk

nk

n

kCIMI

mmll

ml

ml

mlt

tI

Page 36: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 36

Fisher's discrimination index• Useful when documents are scaled to

constant length• Term occurrences are regarded as

fractional real numbers.• E.g.: Two class case

• Let X and Y be the sets of length normalized document vectors corresponding to the two classes.

• Let and be centroids for each class.

• Covariance matrices be

||

)(

X

xX

X

||

)(

Y

yY

Y

X

TXXX xxXS ))((|)|/1(

Y

TYYY yyYS ))((|)|/1(

Page 37: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 37

Fisher's discrimination index (contd.)

• Goal : find a projection of the data sets X and Y on to a line such that• the two projected centroids are far apart

compared to the spread of the point sets projected on to the same line.

• Find a column vector such that the ratio of

– the square of the difference in mean vectors projected onto it

– & average projected variance is maximized.

•This gives

m*

2))(( YXT

)(2

1YX

T SS

)(

))((maxarg

2*

YXT

YXT

SS

Page 38: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 38

Fisher's discrimination index• Formula

• Let X and Y for both the training and test data are generated from multivariate Gaussian distributions

• Let• Then this value of induces the optimal

(minimum error) classier by suitable thresholding on for a test point q.

• Problems• Inverting S would be unacceptably slow for

tens of thousands of dimensions.• Linear transformations would destroy already

existing sparsity.

YX SS

qT

Page 39: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 39

Solution• Recall

•Goal was to eliminate terms from consideration.

•Not to arrive at linear projections involving multiple terms

• Regard each term t as providing a candidate direction t which is parallel to the corresponding axis in the vector space model.

• Compute the Fisher's index of t

Page 40: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 40

FI : Solution (contd.)• Formula

•For two class case

•Can be generalized to a set {c} of more than two classes

• Feature selection•Terms are sorted in decreasing order of

FI(t) •Best ones chosen as features.

X YtYttXt

tYtX

tTt

YXT

yYxXStFI

t

2,

2,

2,,

2

)(|)|/1()(|)|/1(

)())(()(

c Ddtctd

c

cctctc

c

xD

tFI2

,,

2

,,,

)(||

1

)(

)( 21

21

Page 41: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 41

Validation• How to decide a cut-off rank ?• Validation approach

• A portion of the training documents are held out• The rest is used to do term ranking• The held-out set used as a test set.

• Various cut-off ranks can be tested using the same held-out set.

• Leave-one-out cross-validation/partitioning data into two

• An aggregate accuracy is computed over all trials.

• Wrapper to search for the number of features• In decreasing order of discriminative power• Yields the highest accuracy.

Page 42: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 42

Validation (contd.)• Simple search heuristic

•Keep adding one feature at every step until the classifier's accuracy ceases to improve.

A general illustration of wrapping for feature selection.

Page 43: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 43

Validation (contd.)• For naive Bayes-like classier

•Evaluation on many choices of feature sets can be done at once.

• For Maximum Entropy/Support vector machines•Essentially involves training a classier

from scratch for each choice of the cut-off rank.

•Therefore inefficient

Page 44: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 44

Validation : observations• Bayesian classifier cannot overfit

much

Effect of feature selection on Bayesian classifiers

Page 45: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 45

Truncation algorithms• Start from the complete set of terms

T 1. Keep selecting terms to drop2. Till you end up with a feature subset3. Question: When should you stop

truncation ?

• Two objectives• minimize the size of selected feature set

F.

• Keep the distorted distribution Pr(C|F) as similar as possible to the original Pr(C|T)

Page 46: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 46

Truncation Algorithms: Example

• Kullback-Leibler (KL)• Measures similarity or distance between two

distributions

• Markov Blanket• Let X be a feature in T. Let

• The presence of M renders the presence of X unnecessary as a feature => M is a Markov blanket for X

• Technically• M is called a Markov blanket for if X is

conditionally independent of given M

• eliminating a variable because it has a Markov blanket contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F).

}{\ XTM

TX )}{(\)( XMCT

Page 47: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 47

Finding Markov Blankets• Absence of Markov Blanket in practice• Finding approximate Markov blankets

• Purpose: To cut down computational complexity

• search for Markov blankets M to those with at most k features.

• given feature X, search for the members of M to those features which are most strongly correlated (using tests similar to the 2 or MI tests) with X.

• Example : For Reuters dataset, over two-thirds of T could be discarded while increasing classification accuracy

Page 48: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 48

Feature Truncation algorithm1. while truncated Pr(C|F) is reasonably close to original

Pr(C|T) do2. for each remaining feature X do3. Identify a candidate Markov blanket M:4. For some tuned constant k, find the set M of k

variables in F \ X that are most strongly correlated with X

5. Estimate how good a blanket M is:6. Estimate

7. end for8. Eliminate the feature having the best surviving

Markov blanket9. end while

xx

MM

M

KLxXxX,

MMMM )) x X|Pr(C x), X, x X|(C(Pr),Pr(

Page 49: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 49

General observations on feature selection

• The issue of document length should be addressed properly.

• Choice of association measures does not make a dramatic difference

• Greedy inclusion algorithms scale nearly linearly with the number of features

• Markov blanket technique takes time proportional to at least .

• Advantage of Markov blankets algorithm over greedy inclusion• Greedy algorithm may include features with high individual correlations

even though one subsumes the other• Features individually uncorrelated could be jointly more correlated with

the class• This rarely happens (e.g., Phrase)

• Binary feature selection view may not be only view to subscribe to • Suggestion: combine features into fewer, simpler ones• E.g.: project the document vectors to a lower dimensional space

kT ||

Page 50: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 50

Bayesian Learner• Very practical text classifier• Assumption

1. A document can belong to exactly one of a set of classes or topics.

2. Each class c has an associated prior probability Pr(c), cPr(c) =1

3. There is a class-conditional document distribution Pr(d|c) for each class.

• Posterior probability• Obtained using Bayes Rule

Parameter set consists of all P(d|c)

)|Pr()Pr(

)|Pr()Pr(

)Pr(

),Pr()|Pr(

d

cdc

d

dcdc

Page 51: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 51

Parameter Estimation for Bayesian Learner• Estimate of is based on two sources of

information:1. Prior knowledge on the parameter set before seeing any training

documents2. Terms in the training documents D.

• Bayes Optimal Classifier• Taking the expectation of each parameter over

Pr( |D)

• Computationally infeasible

• Maximum likelihood estimate• Replace the sum above with the value of the summand (Pr(c|

d, )) for arg max Pr( | D),• Works poorly for text classification

)|Pr(

),|Pr()|Pr(

),|Pr()|Pr()|Pr( D

d

cdcdc

)|Pr(),|Pr()|Pr( Ddcdc

Page 52: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 52

Naïve Bayes Classifier• Naïve

• assumption of independence between terms,

• joint term distribution is the product of the marginals.

• Widely used owing to • simplicity and speed of training, applying, and

updating

• Two kinds of widely used marginals for text• Binary model

• Multinomial model

Page 53: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 53

Naïve Bayes Models• Binary Model

• Each parameter indicates the probability that a document in class c will mention term t at least once.

• Multinomial model• each class has an associated die with |W| faces.

• each parameter denotes probability of the face turning up on tossing the die.

• term t occurs n(d; t) times in document d,

• document length is a random variable denoted L,

Ddfor account to

,,

,

,,, )1(

1)1()|Pr(

Wttc

dt tc

tc

dtWttc

dttccd

dt

tdnt

dddd tdn

lclLcldclLcd ),(

)},({)|Pr(),|Pr()|Pr()|Pr(

Wt

Page 54: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 54

Analysis of Naïve Bayes Models1. Multiply together a large number of

small probabilities,• Result: extremely tiny probabilities as

answers.

• Solution : store all numbers as logarithms

2. Class which comes out at the top wins by a huge margin

• Sanitizing scores using likelihood ration• Also called the logit function•

)|1Pr(

)|1Pr()(,

1

1)(log

)( dC

dCdLR

edit

dLR

Page 55: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 55

Parameter smoothing• What if a test document contains a term t that

never occurred in any training document in class c ?• Ans : will be zero

• Even if many other terms clearly hint at a high likelihood of class c generating the document.

• Bayesian Estimation• Estimating probability from insufficient data.

If you toss a coin n times and it always comes up heads, what is the probability that the (n + 1)th toss will also come up heads?

• posit a prior distribution on , called E.g.: The uniform distribution

• Resultant posterior distribution:

qd

)|Pr( qdc

)(

1

0)|,Pr()(

)|,Pr()(),|(

pnkpdp

nknk

Page 56: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 56

Laplace Smoothing• Based on Bayesian Estimation• Laplace's law of succession

• loss function (penalty) for picking a smoothed value as against the `true' value.

•E.g.: Loss function as the square error

•For this choice of loss,the best choice of the smoothed parameter is simply the expectation of the posterior distribution on having observed the data:

)~

,( L

~

2

1)),|((

~

n

knkE

Page 57: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 57

Laplace Smoothing (contd.)• Heuristic alternatives

•Lidstone's law of succession

• derivation for the multinomial model•there are |W| possible events where W is

the vocabulary.

2

~

n

k

dDd

Ddtc

c

c

dnW

tdn

,

, ),(||

),(1

Page 58: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 58

Performance analysis• Multinomial naive Bayes classifier

generally outperforms the binary variant

• K-NN may outperform naïve Bayes • Naïve Bayes is faster and more

compact• decision boundaries:

• regions of potential confusion

Page 59: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 59

NB: Decision boundaries • Bayesian classier partitions the

multidimensional term space into regions• Within each region, the probability of one

class is higher than others• On the boundaries, the probability of two

or more classes are exactly equal• NB is a linear classier

• it makes a decision between c = 1 and c = -1

•by thresholding the value of (b=prior) for a suitable vector bdNB .

NB

Page 60: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 60

Pitfalls• Strong bias

•fixes the policy that (tth component of the linear discriminant) depends only on the statistics of term t in the corpus.

•Therefore it cannot pick from the entire set of possible linear discriminants,

)(tNB

Page 61: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 61

Bayesian Networks• Attempt to capture statistical dependencies

between terms themselves• Approximations to the joint distribution over

terms• Probability of a term occurring depends on observation

about other terms as well as the class variable.

• A directed acyclic graph• All random variables (classes and terms) are nodes

• Dependency edges are drawn from c to t for each t.(parent-child edges)

• To represent additional dependencies between terms dependency edges (parent child) are drawn

Page 62: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 62

Bayesian networks. For the naive Bayes assumption, the only edges are from the classvariable to individual terms. Towards better approximations to the joint distribution over terms:the probability of a term occurring may now depend on observation about other terms as well as theclass variable.

Page 63: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 63

Bayesian Belief Network (BBN)• DAG • Parents Pa(X)

• nodes that are connected by directed edges to a node X

• Fixing the values of the parent variables completely determines the conditional distribution of X

• Conditional Probability tables• For discrete variables, the distribution data for X can be

stored in the obvious way as a table with each row showing a set of values of the parents, the value of X, and a conditional probability.

• Unlike Naïve Bayes• P(d|c) is not a simple product over all terms.• .

x

Xpaxx ))(|Pr()Pr(

Page 64: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 64

BBN: difficulty• Getting a good network structure.• At least quadratic time

•Enumeration of all pairs of features

• Exploited only for binary model• Multinomial model

•Prohibitive CPT sizes

Page 65: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 65

Exploiting hierarchy among topics

• Ordering between the class labels• For Data warehousing

• E.g. : high, medium, or low cancer risk patients.

• Text Class labels:• Taxonomy:

• large and complex class hierarchy that relates the class labels

• Tree structure• Simplest form of taxonomy• widely used in directory browsing,• often the output of clustering algorithms.• inheritance:

• If class c0 is the parent of class c1, any training document which belongs to c1 also belongs to c0.

Page 66: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 66

Topic Hierarchies : Feature selection

• Discriminating ability of a term sensitive to the node (or class) in the hierarchy

• Measure of discrimination of a term •Can be evaluated with respect to only

internal nodes of the hierarchy.

•`can' may be a noisy word at the root node of Yahoo!

•Help classifying documents under the sub tree of /Science/Environment/Recycling.

Page 67: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 67

Topic Hierarchies: Enhanced parameter estimation

• Uniform priors not good• Idea

• If a parameter estimate is shaky at a node with few training documents, perhaps we can impose a strong prior from a well-trained parent to repair the estimates.

• Shrinkage•Seeks to improve estimates of

descendants using data from ancestors,

Page 68: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 68

Shrinkage• Assume multinomial model• introducing a dummy class c0 as the parent of

the root c1, where all terms are equally likely.

• For a specific path c0,c1,…….cn, • `shrunk' estimate is determined by a convex

linear interpolation of the MLE parameters at the ancestor nodes up through c0

• Estimatation of mixing weights• Simple form of EM algorithm

• Determined empirically, by iteratively maximizing the probability of a held-out portion Hn of the training set for node cn.

tcn ,

~

Page 69: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 69

Shrinkage: Observation• Improves accuracy beyond

hierarchical naïve Bayes,• Improvement is high when data is

sparse• Capable of utilizing many more

features than Naïve Bayes

Page 70: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 70

Topic search in Hierarchy• By definition

•All documents are relevant to the root ‘topic’

•Pr(root|d) = 1.

• Given a test document d:•Find one or more of the most likely leaf

nodes in the hierarchy.

•Document cannot belong to more than one path,

ii dcdc )|Pr()|Pr( 0

Page 71: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 71

Topic search in Hierarchy: Greedy Search strategy

• Search starts at the root• Decisions are made greedily

•At each internal node pick the highest probability class

•Continue

• Drawback•Early errors cause compounding effect

Page 72: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 72

Topic search in Hierarchy: Best-first search strategy

• For finding m most probable leaf classes

• Find the weighted shortest path from the root to a leaf.

• Edge (c0,ci) is assigned a (non-negative) edge weight of –Pr(ci|c0,d)

• .• To make Best first search different

from greedy search•Rescale/smoothen the probabilities

)),|Pr(log())|Pr(log()|Pr(log 00 dccdcdc ii

Page 73: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 73

Using best-first search on a hierarchy can improve both accuracy and speed. Because the hierarchy has four internal nodes, the second column shows the number of features for each. These were tuned so that the total number of features for both at and best-first are roughly the same (so that the model complexity is comparable). Because each document belonged to exactly one leaf node, recall equals precision in this case and is called `accuracy'.

Page 74: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 74

The semantics of hierarchical classification

• Asymmetry• training document can be associated with any

node,• test document must be routed to a leaf,

• Routing test documents to internal nodes• none of the children matches the document• many children match the document• the chances of making a mistake while

pushing down the test document one more level may be too high.

• Research issue

Page 75: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 75

Maximum entropy learners: Motivation

• Bayesian learner• First, model Pr(d|c) at training time

• Apply Bayes rule at test time

• Two problems with Bayesian learners• d is represented in a high-dimensional term space

• => Pr(d|c) cannot be estimated accurately from a training set of limited size. (since sparse data)

• No systematic way of adding synthetic features• Such an addition may result in

• highly correlated features may “crowd out“ useful features

Page 76: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 76

Maximum entropy learners• Assume that each document has only one class

label and no duplicate

• Indicator functions fj(d,c) (features)

• Flag ‘j’th condition relating class c to document d

• Expectation of indicator fj is

• Approximate Pr(d,c) and Pr(d) with their empirical estimates• Constraint:

cd d c

jjj cdfdcdcdfcdfE,

),()|Pr()Pr(),(),Pr()(

)1

y probabilit (uniform ),()|Pr(1

),(1

),()|Pr()r(P~

),(),r(P~

ncdfdc

ncdf

n

cdfdcdcdfcd

i cijiiij

i

i cijiiiij

iii

)r(P~

),,r(P~

dcd

Page 77: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 77

Principle of Maximum Entropy• Constraints don’t determine Pr(c|d) uniquely• Principle of Maximum Entropy:

• prefer the simplest model to explain observed data.

• Choose Pr(c|d) that maximizes the Entropy of Pr(c|d)

• In the event of empty training set we should consider all classes to be equally likely,

• Constrained Optimization• Maximize the entropy of the model distribution Pr(c|d)

While obeying the constraints for all j

• Optimize by the method of Lagrange multipliers

cd j i ciijiiijj cdfdccdfdcdcddcG

, ,

)),()|Pr(),(()|Pr(log)|Pr()Pr()),|(Pr(

Page 78: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 78

Principle of Maximum Entropy• Constraints don’t determine Pr(c|d) uniquely• Principle of Maximum Entropy:

• prefer the simplest model to explain observed data.

• Choose Pr(c|d) that maximizes the Entropy of Pr(c|d)

• In the event of empty training set we should consider all classes to be equally likely,

• Constrained Optimization• Maximize the entropy of the model distribution Pr(c|d)

While obeying the constraints for all j

• Optimize by the method of Lagrange multipliers

cj

j

cdfj

dcdZe

dZdc

j

j

1)|Pr(,factor scale:)(,

,)(

1)|Pr(

),(

cd j i ci

ijiiijj cdfdccdfdcdcddcG, ,

)),()|Pr(),(()|Pr(log)|Pr()Pr()),|(Pr(

Page 79: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 79

Maximum Entropy solution• Fitting the distribution to the data

involves two steps:1. Identify a set of indicator functions derived

from the data.

2. Iteratively arrive at values for the parameters that satisfy the constraints while maximizing the entropy of the distribution being modeled.

• An equivalent optimization problemDd

d dc )|Pr(logmax

Page 80: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 80

Text Classification using Maximum Entropy Model

• Example• Pick an indicator for each (class, term) combination.

• For the binary document model,

• For the multinomial document model

• What we gain with Maximum Entropy over naïve Bayes

• does not suffer from the independence assumptions• E.g.

• If the terms t1 = machine and t2 = learning are often found together in class c

• and would be suitably discounted.

otherwise 0

and if 1),( ,'

dtc'ccdf tc

otherwise

),(

),(' if 0

),(,'

dn

tdncc

cdf tc

1,tc 2,tc

Page 81: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 81

Performance of Maximum Entropy Classifier

• Outperforms naive Bayes in accuracy, but not consistently.

• Dealing with a large of synthesized, possibly redundant features.

Page 82: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 82

tcdt

tc

j

cdfj

cdfdZdc

dZdc j

,,

),(

log),()(log)|Pr(log

,)(

1)|Pr(

Discriminative classification• Naïve Bayes and Maximum Entropy Classifiers

• “induce” linear decision boundaries between classes in the feature space.

• Discriminative classifiers• Directly map the feature space to class labels

• Class labels are encoded as numbers

• e.g: +1 and –1 for two class problem

• Two examples• Linear least-square regression

• Support Vector Machines

Page 83: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 83

Linear least-square regression• No inherent reason for going through the modeling step as in

Bayesian or maximum entropy classifier to get a linear discriminant.• Linear Regression Problem

• Look for some arbitrary such that directly predicts the label ci of document di.

• Minimize the square error between the observed and predicted class variable:

• Optimization: use gradient-descent methods• E.g., Widrow-Hoff (WH) update rule

• Scaling to norm 1• Two equivalent interpretations

• Classifier is a hyperplane• Documents are projected on to a direction

• Performance• Comparable to Naïve Bayes and Maximum Entropy

bdi

i

ii cbd 2)(

rate learning :

)(2 )1()1()(

iii

iii dcd

Page 84: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 84

Support vector machines• Assumption : training and test population are drawn from

the same distribution• Hypothesis

• Hyperplane that is close to many training data points has a greater chance of misclassifying test instances

• A hyperplane which passes through a “no-man's land”, has lower chances of misclassifications

Make a decision by thresholding Seek an which maximizes the distance of any

training point from the hyperplane

bdiSVM

SVM

class : ector,document v : ;1 1)( subject to

)|||| 2

1 (

2

1 Minimize 2

iiii cd,.....n i b dαc

Page 85: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 85

1 bd 1 bd

1 bd

Illustration of the SVM optimization problem

1 bd 0 bd

1 bd

Page 86: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 86

Support vector machines• Optimal separator

• Orthogonal to the shortest line connecting the convex hull of the two classes

• Intersects this shortest line halfway

• Distance of any training point from the optimized hyperplane (margin) It is at least ||||

1

Page 87: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 87

SVMs: non separable classes• Classes in the training data not always

separable. (single hyperplane may not enough)• Introduce fudge variables

• Equivalent dual

10 and

11)( subject to

2

1 Minimize

,........ni ξ

,....,n.i -ξbdαc

C

i

iii

ii

10 and

0 subject to

)(2

1 Maximize

,

,........ni C λ

λc

ddccλ

i

iii

jijijiji

ii

Page 88: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 88

SVMs: Complexity• Quadratic optimization problem

• Working set: refine a few at a time holding the others fixed.

• Large memory:• On-demand computation of inner-products

• Time: n documents,

Recent SVM packages• Linear time by clever selection of working

sets.

1271

timetraining

. a .

na

s'

Page 89: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 89

SVM training time variation as the training set size is increased, with and without sufficient memory to hold the training set. In the latter case, the memory is set to about a quarter of that needed by the training set.

Page 90: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 90

Performance• Comparison with other classifiers

• Among most accurate classifier for text• Better accuracy than naive Bayes and

decision tree classifier

• Interesting revelation• Linear SVMs suffice• Standard text classification tasks have

classes almost separable using a hyperplane in feature space

• Research issues• Non-linear SVMs

Page 91: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 91

Comparison of LSVM with previous classifiers on the Reuters data set (data taken from Dumais). (The naive Bayes classier used binary features, so its accuracy can be improved)

Page 92: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 92

Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB.

Page 93: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 93

Comparison between several classifiers using the Reuters collection.

Page 94: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 94

Hypertext classification• Techniques to address hypertextual

features.• Document Object Model (DOM)

• Well-formed HTML document is a properly nested hierarchy of regions in a tree-structured

• DOM tree,• internal nodes are elements (e.g., UL, LI)• some of the leaf nodes are segments of text.• other nodes are hyperlinks to other Web

pages

Page 95: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 95

Representing hypertext for supervised learning

• Paying special attention to tags can help with learning

• Keyword-based search• Assign heuristic weights to terms that occur in

specific HTML tags E.g., Title, H1, …

• Example (next slide)

Page 96: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 96

Prefixing with tags• Distinguishing between the two

occurrences of the word “surfing”,• Prefixing each term by the sequence of tags

that we need to follow from the DOM root to get to the term,

• A repeated term in different sections should reinforce belief in a class label• Using a maximum entropy classier

• Accumulate evidence from different features

• Maintain both forms of a term:• plain text and prefixed text (all path prefixes)

Page 97: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 97

Example• <resume>

<publication> <title>Web-surfing models</title> </publication> <hobbies> <item>Wind-surfing

</hobbies> </resume>

• Prefixed term• resume.publication.title.surfing• resume.hobbies.item.surfing

Page 98: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 98

Experiments• 10705 patents from the US Patent Office,

• 70% error with plain text classier,

• 24% error with path-tagged terms

• 17%. Error with path prefixes

• 1700 resumes (with naive Bayes classifier)• 53% error with flattened HTML

• 40% error with prefix-tagged terms

Simple tricks suffice to boost accuracy

Page 99: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 99

Limitations• Prefix representations

• ad-hoc

• inflexible.

• Generalizibility: • How to incorporate additional features ?

• E.g.: adding features derived from hyperlinks.

• Relations• uniform way to codify hypertextual features.

• Example: classified(A, facultyPage):- contains-text(A, professor), contains-text(A, phd), links(B, A), contains-text(B, faculty).

Page 100: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 100

Rule Induction for relational learning

• Inductive classifiers• Goal : Discover a set of predicate rules from a

collection of relations.

• Consider 2 class setting• Positive examples: D+

• Negative examples: D-

• Test instance:• Applying the predicate rules

True => positive instance Else => negative instance

Page 101: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 101

Rule induction with First Order Inductive Logic (FOIL)

• Well-known rule learner• Start with empty rule set

1. learn a new rule.

2. add conjunctive literals to the new rule (specialize) until no negative example is covered by the new rule.

3. pick a literal which increases the ratio of surviving positive to negative bindings rapidly.

4. Remove positive examples covered by any rule generated thus far.

• Till no positive instances are left

Page 102: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 102

Types of Literals Explored•

• where Q is a relation and Xi are variables, at least one of which must be already bound.

• not(L), where L is a literal of the above forms.

constant a is and variablesare where; c,XX X , X X c, X , X X X jijijiiji

)( 1 k,......XXQ

Page 103: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 103

Advantage of Relational Learning

• Can learn class labels for individual pages• Can learn relationships between labels

• member(homePage, department)• teaches(homePage, coursePage)• advises(homePage, homePage)• writes(homePage, paper)

• Hybrid approaches• Statistical classifier (Naïve Bayes)

• more complex search for literals

• Inductive learning• comparing the estimated probabilities of various

classes.

Page 104: Lecture 6: Supervised Learning for Text (Chap 5, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National

Mining the Web Chakrabarti & Ramakrishnan 104

• Recursively labeling relations• Relating page label in terms of labels of

neighboring pages

classified(A, facultyPage) :- links-to(A, B), classified(B, studentPage), links-to(A, C), classified(C, coursePage), links-to(A, D), classified(D, publicationsPage).

Advantage of Relational Learning