22
Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering IIT Bombay www.cse.iitb.ernet.in/~soumen

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

Embed Size (px)

Citation preview

Page 1: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

Machine learning for the Web:Applications and challenges

Soumen Chakrabarti

Center for Intelligent Internet ResearchComputer Science and Engineering

IIT Bombay

www.cse.iitb.ernet.in/~soumen

Page 2: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

2

Traditional supervised learning Training instance

Test instance

Independent variables x mostly continuous, maybe categorical

Predicted variable y discrete (classification) or continuous (regression)

yxxx n ;,,, 21

nxxx ,,, 21 Statisticalmodels,

inferencerules, or

separators

Learner

Learner

Prediction y

Page 3: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

3

Traditional unsupervised learning No training / testing phases Input is a collection of

records with independent attributes alone

Measure of similarity Partition or cover instances

using clusters with large “self-similarity” and small “cross-similarity”

Hierarchical partitions

nxxx ,,, 21

Large self-similarity

Small cross-similarity

Page 4: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

4

Learning hypertext models Entities are pages, sites,

paragraphs, links, people, bookmarks, clickstreams…

Transformed intosimple models and relations• Vector space/bag-of-words• Hyperlink graph• Topic directories• Discrete time series

occurs(term, page, cnt)cites(page, page)

is-a(topic, topic)example(topic, page)

Page 5: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

5

Challenges

Large feature space in raw data• Structured data sets: 10s to 100s• Text (Web): 50 to 100 thousand

Most features not completely useless• Feature elimination / selection not perfect• Beyond linear transformations?

Models used today are simplistic• Good accuracy on simple labeling tasks• Lose a lot of detail present in hypertext to

fit known learning techniques

Page 6: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

6

Challenges

Complex, interrelated objects• Not a structured tuple-like entity• Explicit and implicit connections

• Document markup sub-structure• Site boundaries and hyperlinks• Placement in popular directories like Yahoo!

Traditional distance measures are noisy• How to combine diverse features? (Or, a

link is worth a ? words)• Unreliable clustering results

Page 7: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

7

This session

Semi-supervised clustering(Rich Caruana)• Enhanced clustering via user feedback

Kernel methods (Nello Cristianini)• Modular learning systems for text and

hypertext

Reference matching(Andrew McCallum)• Recovering and cleaning implicit citation

graphs from unstructured data

Page 8: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

8

This talk: Two examples

Learning topics of hypertext documents• Semi-supervised learning scenario• Unified model of text and hyperlinks• Enhanced accuracy of topic labeling

Segmenting hierarchical tagged pages• Topic distillation (hubs and authorities)• Minimum description length segmentation• Better focused topic distillation• Extract relevant fragments from pages

Page 9: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

9

Classifying interconnected entities Early examples:

• Some diseases have complex lineage dependency

• Robust edge detection in images

How are topics interconnected in hypertext?

Maximum likelihood graph labeling with many classes

Finding edgepixels in adifferentiatedimage

? ??

?

?

?

.3 red

.7 blue

0.6 0.40.3 0.7

Page 10: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

10

Naïve Bayes classifiers

Decide topic; topic c is picked with prior probability (c); c(c) = 1

Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1

Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is

dt

tdntctdn

dncdnd ),(),(

)},({

)(]),(|Pr[

Page 11: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

11

Enhanced models for hypertext c=class, d=text,

N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to

judge my topic:Pr(d, d(N) | c)

Better recursive model:Pr(d, c(N) | c)

Relaxation labeling over Markov random fields

Or, EM formulation

?

Page 12: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

12

Hyperlink modeling boosts accuracy 9600 patents from 12

classes marked by USPTO

Patents have text and prior art links

Expand test patent to include neighborhood

‘Forget’ and re-estimate fraction of neighbors’ classes

(Even better for Yahoo)

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known

%E

rro

r

Text Link Text+Link

Page 13: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

13

Hyperlink Induced Topic Search

Radius-1 expanded graph

Response

KeywordSearchengine

Query

a = EThh = Ea‘Hubs’ and‘authorities’

h

a

h

h

ha

a

a

Page 14: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

14

“Topic drift” and various fixes Some hubs have

‘mixed’ content Authority ‘leaks’

through mixed hubs from good to bad pages

Clever: match query with anchor text to favor some edges

B&H: eliminate outlier documents

Vector-spacedocumentmodel

Centroid

×

Cut-offradius

Query term

Activationwindow

‘Thick’ links

Page 15: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

15

Document object model (DOM) Hierarchical graph

model for semi-structured data

Can extract reasonable DOM from HTML

A fine-grained view of the Web

Valuable because page boundaries are less meaningful now

<html><head><title>Portals</title></head><body><ul><li><a href=“…”>Yahoo</a></li><li><a href=“…”>Lycos</a></li></ul></body></html>

html

head body

title ul

li li

a a

Page 16: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

16

A model for hub generation Global hub score

distribution 0 w.r.t. given query

Authors use DOM nodes to specialize 0 into local I

At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores

Global distribution

Progressive‘distortion’Model

frontier

Other pages

Page 17: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

17

Optimizing a cost measure

Hv

v

Reference distribution 0

vHh

vh )|Pr(logData encoding cost is roughly

Distribution distortion cost is

1log)||(KL 0

0v

v

vv

(for Poisson distribution)

Page 18: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

18

Modified topic distillation algorithm

Will this (non-linear) system converge? Will segmentation help in reducing drift?

Initialize DOM graphLet only root set authority scores be 1Repeat until reasonable convergence:

Authority-to-hub score propagationMDL-based hub score smoothingHub-to-authority score propagationNormalization of authority scores

Segment and rank micro-hubsPresent annotated results

Page 19: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

19

Convergence

28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Rank converges within 15 iterations

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.00E-01

0 2 4 6 8 10Iterations

Me

an

au

th s

core

ch

an

ge

Page 20: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

20

Effect of micro-hub segmentation ‘Expanded’ implies

authority diffusion arrested

As nodes outside rootset start participating in the distillation…• #Expanded increases• #Pruned decreases

Prevents authority leaks via mixed hubs

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9Iterations

Sm

oo

thin

g s

tatis

tics

ExpandedPruned

Page 21: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

21

Rank correlation with B&H Positively

correlated Some negative

deviations Pseudo-

authorities downgraded by our algorithm

These were earlier favored by mixed hubs

0

0.005

0.01

0.015

0.02

0.025

0 0.005 0.01 0.015Authority score B&H

Ou

r a

uth

ori

ty s

core

(Axes not to same scale)

Page 22: Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

22

Conclusion

Hypertext and the Web pose new modeling and algorithmic challenges

Locality exists in many guises Diverse sources of information: text,

links, markup, usage Unifying models needed Anecdotes suggest that synergy can be

exploited