63
Web Information Extraction Learning based on Probabilistic Graphical Models Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong

Web Information Extraction Learning based on Probabilistic Graphical Models

Embed Size (px)

DESCRIPTION

Web Information Extraction Learning based on Probabilistic Graphical Models. Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong. Introduction. Building advanced Web mining applications requires precise text information extraction a large number of different Web sites. - PowerPoint PPT Presentation

Citation preview

Page 1: Web Information Extraction Learning based on Probabilistic Graphical Models

Web Information Extraction Learning based on Probabilistic Graphical Models

Wai Lam

Joint work with Tak-Lam Wong

The Chinese University of Hong Kong

Page 2: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 2

Introduction

Building advanced Web mining applications requires precise text information extraction a large number of different Web sites.

Substantial human effort is needed for the information extraction task. diverse layout format content variation

Page 3: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 3

Wrapper Adaptation Problem (1)

Page 4: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 4

Wrapper Adaptation Problem (2)

Learnedwrapper

Wrapperlearning

Page 5: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 5

Product Attribute Extraction and Resolution Problem (1)

The Web contains a huge number of online stores selling millions of different kinds of products.

Page 6: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 6

Product Attribute Extraction and Resolution Problem (2)

Traditional search engines typically treat every term in a Web document in a uniform fashion. Consider the digital camera domain. Suppose a user

supplies a query: “auto white balance” trying to find cameras related to the product attribute “white balance”.

Possible results: “auto ISO” which is about “light sensitivity” different from the product attribute “white balance”

Page 7: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 7

Product Attribute Extraction and Resolution Problem (3)

Another related desirable task is to resolve the extracted data according to their semantics.

This can improve indexing of product Web pages and support intelligent tasks such as product search or product matching.

Page 8: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 8

Our Approach We have investigated learning frameworks for solving each

of the Web information extraction tasks just presented. Probabilistic graphical models provide a principled

paradigm harnessing the uncertainty during the learning process.

A graphical model capturing information extraction knowledge for solving wrapper adaptation (ACM TOIT 2007).

A graphical model for unsupervised learning to extract and resolve product attributes (SIGIR 2008).

Page 9: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 9

Motivating Example

(Source: http://www.superwarehouse.com)

(Source: http://www.crayeon3.com)

Page 10: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 10

Product Attribute Extraction

To extract product attributes: In the beginning, only the

attribute “resolution” is known. Effective sensor resolution

Layout format White balance, shutter speed

Mutual cooperation Light sensitivity

Page 11: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 11

Product Attribute Resolution

Samples of extracted text fragments from a page: cloudy, daylight, etc… What do they refer to?

A text fragment extracted from another page: white balance auto, daylight,

cloudy, tungsten, … … Product attribute resolution:

To cluster text fragments of attributes into the same group Better indexing for product search Easier understanding and interpretation

Page 12: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 12

Existing Works (Supervised Learning)

Supervised wrapper learning (Chang et al., IEEE TKDE 2006) They need training examples. The wrapper learned from a Web site cannot be applied

to other sites. Template-independent extraction (Zhu et al., SIGKDD

2007) They cannot handle previously unseen attributes.

Page 13: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 13

Existing Works (Unsupervised Learning)

Handle Web pages generated from the same template (Crescenzi et al., VLDB 2001). Data may not be synchronized

• “Aug 1993 $16.38” extracted from a page

• “Paperback Feb 1985 $6.95” extracted from another page

Synchronized data extraction (Chuang et al., VLDB 2007) Requires a field model (HMM models) for each field

and it requires manually prepared training examples. Can only apply to Web pages that contain multiple records.

Page 14: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 14

Our Framework

1. Unsupervised learning framework for jointly extracting and resolving product attributes from different Web sites (SIGIR 2008).

2. Our framework consists of a graphical model which considers page-independent content information and page-dependent layout information.

3. Can extract unlimited number of product attributes (Dirichlet process prior)

4. The resolved product attributes can be used for other intelligent tasks such as product search (AAAI 2008).

Page 15: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 15

Problem Definition (1)

A product domain, E.g., Digital camera domain

A set of reference attributes, E.g., “resolution”, “white balance”, etc. A special element, , representing “not-an-attribute”

A collection of Web pages from any Web sites, , each of which contains a single product

Let be any text fragment from a Web page

Page 16: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 16

Problem Definition (2)

<TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD></TR><TR>

<TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD></TR><TR>

Line separator

Line separator

Page 17: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 17

Problem Definition (3)

Attribute information

Target informationLayout information

Content information

White balance Auto, daylight, … …

boldface, in-table

1 (related to attribute)

white balance

Page 18: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 18

Problem Definition (4)

Attribute information

Target informationLayout information

Content information

View larger image

boldface, underline

0 (irrelevant)

not-an-attribute

Page 19: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 19

Attribute extraction:

Attribute resolution:

Joint attribute extraction and resolution:

Problem Definition (5)

Attribute information

Target informationLayout information

Content information

Page 20: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 20

Graphical Models (1)

A graphical model is a family of probability distributions defined in terms of a directed or undirected graph. Nodes: Random variables Joint distribution: The products over functions defined

on the connected nodes It provides general algorithms to compute marginal

and conditional probability of interest. It provides control over the computational

complexity associated with these operations.

Page 21: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 21

Graphical Models (2)

One kind of graphical models is directed graph. Let be a directed acyclic graph

are the nodes are the edges

Denote as the parents of . Denote as the collection of random

variables indexed by the nodes. The joint probability distribution is expressed as:

Page 22: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 22

Graphical Models (3)

E.g.:

This model asserts that the variables ZN are conditionally independent and identically distributed given θ.

Z1 Z2 Z3 ZN

θ

Page 23: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 23

Graphical Models (4)

A plate is used to show the repetition of the variables. Hence, it shows the factorial and nested structures.

Zn

θ

N

Page 24: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 24

Graphical Models (5)

A generative approach to clustering: pick one of clusters from a distribution generate a data point from a cluster-specific probability

distribution. This yields a finite mixture model:

where and are the parameters, and where each cluster has the same parameterized family.

Data are assumed to be generated conditionally IID from this mixture.

Finite Mixture Model

Page 25: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 25

Graphical Models (6)

Mixture models make the assumption that each data point arises from a single mixture component. the k-th cluster is by definition the set of data points ar

ising from the k-th mixture component.

Finite Mixture Model

Page 26: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 26

Graphical Models (7)

Another way to express this: define an underlying measure

where is an atom at . And define the process of obtaining a sample from a finite mix

ture model as follows. For :

Note that each is equal to one of the underlying . indeed, the subset of that maps to is exactly the k-th clu

ster.

Finite Mixture Model

Page 27: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 27

Graphical Models (8)

θi

N

xi

G

Finite Mixture Model

Page 28: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 28

Graphical Models (9)

Define a countably infinite mixture model by taking K to infinity and hoping that means something, where

Dirichlet Process Mixture

πk

ψkG0

Zi

Nxi

α

Page 29: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 29

Our Model (1)

Our graphical model can be regarded as an extension of Dirichlet mixture model.

Each mixture component refers to a reference attribute; consists of two distributions characterizing the content

information and target information. Dirichlet process prior is employed.

It can handle unlimited number of reference attributes.

Page 30: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 30

Attribute extraction:

Attribute resolution:

Joint attribute extraction and resolution:

Our Model (2)

Attribute information

Target informationLayout information

Content information

Page 31: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 31

Our Model (3)Dirichlet Process Prior(Infinite Mixture Model) N Text Fragment S Different Web Site

Page 32: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 32

Our Model (4)

N Text Fragment

Target information

Layout information

Content information

Dirichlet Process Prior(Infinite Mixture Model)

The proportion ofthe k-th component in the mixture

Content information parameterof the k-th component

Target information parameterof the k-th component

Page 33: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 33

Our Model (5)

S Different Web Site

Site-dependent Layout format

Page 34: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 34

Our Model (6)Dirichlet Process Prior(Infinite Mixture Model)

Concentration parameter for DP

Base distribution for content info.

Base distribution for target info.

Page 35: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 35

Generation Process (1)

Page 36: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 36

Generation Process (2)

The joint probability for generating a particular text fragment given the parameters, , , , and, :

Inference:

where , , and are the set of observable variables, unobservable variables, and model parameters respectively.

Intractable

Page 37: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 37

Variational Method (1)

The inference problem is transformed into an optimization problem.

The resulting variational optimization problems admit principled approximate solutions.

The solution to variational problems is often given in terms of fixed point equations that capture necessary conditions for optimality.

In contrast to other approximation methods such as MCMC, variational methods are deterministic.

Page 38: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 38

Variational Method (2)

Finding is intractable Our goal: Transform the problem into an

optimization problem:

where D denotes KL-divergence KL-divergence must be non-negative

Page 39: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 39

Variational Method (3)

KL-divergence is zero if equals the true posterior probability . Let By maximizing w.r.t. we get:

Therefore, we have a lower bound on the desired log-marginal probability

LHS is the log-likelihood of the observable variables. .

Page 40: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 40

Variational Method (4)

The problem becomes maximizing .

Page 41: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 41

Variational Method (5)

Truncated stick-breaking process (Ishwaran and James, 2001) Replace infinity with a truncation level K

Page 42: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 42

Variational Method (6)

Mixture of tokens

Binary

A set of binary featuresConjugate priors

Page 43: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 43

Variational Method (7)

Solve by coordinate ascent algorithm One important variational parameters:

How likely does come from the k-th component? Attribute resolution!

Page 44: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 44

Variational Method (8)

Another important variational parameter:

where

How likely should be extracted? Attribute extraction!

Page 45: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 45

Variational Method (9)

Other variational parameters:

Page 46: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 46

Initialization

What should be extracted? Make use of a very small amount of prior

information about a domain. Only a few terms about the product attributes

• E.g., resolution, light sensitivity

Can be easily obtained, for example, by just highlighting the attributes of one single Web page

Initialization

Page 47: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 47

EM Algorithm for Layout Parameters

Our framework can consider the page-dependent layout format of text fragments to enhance extraction.

However, the layout information of an unseen Web page is unknown and hence we cannot predefine or estimate the values of .

E-step:Apply coordinate ascent algorithm until convergence to achieve the optimal conditions for all variational parameters.

M-step:Calculate

Page 48: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 48

Experiments

We have conducted experiments on four different domains: Digital camera: 85 Web pages from 41 different sites MP3 player: 96 Web pages from 62 different sites Camcorder: 111 Web pages from 61 different sites Restaurant: 29 Web pages from LA-Weekly Restaurant

Guide

In each domain, we conducted 10 runs of experiments. In each run, we randomly selected a Web page and

pick a few terms inside for initialization.

Page 49: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 49

Evaluation on Attribute Resolution

Baseline approach (Bilenko & Mooney SIGKDD 2003):

Agglomerative clustering Edit distance between text fragments

Evaluation metrics: Pairwise recall (R) Pairwise precision (P) Pairwise F1-measure (F)

Page 50: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 50

Results of Attribute Resolution

Page 51: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 51

Visualize the Resolved Attributes

The top five weighted terms in the ten largest resolved attributes in the digital camera domain:

Page 52: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 52

Evaluation on Attribute Extraction

Surprisingly, in the restaurant domain, our framework achieves a performance (0.95 F1-measure) which is comparable to the supervised method (Muslea et al. 2001)

Page 53: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 53

Conclusions

We investigate learning frameworks automating and adapting the extraction task based on probabilistic graphical models which provide a principled paradigm harnessing the uncertainty during the learning process.

We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages for solving the tasks of product attribute extraction and resolution from different Web sites.

An unsupervised inference algorithm based on variational method is designed.

We formally show that content and layout information can collaborate and improve both extraction and resolution performance under our model.

Page 54: Web Information Extraction Learning based on Probabilistic Graphical Models

Questions and Answers

Page 55: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 55

Variational Method (1)

Finding is intractable Our goal: Transform the problem into an

optimization problem:

Since KL divergence must be non-negative

LHS is the log-likelihood of the observable variables

Page 56: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 56

Variational Method (2)

KL divergence:

The problem becomes maximizing

Page 57: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 57

Variational Method (3)

Truncated stick-breaking process (Ishwaran and James, 2001) Replace infinity with a truncation level K

Page 58: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 58

Variational Inference (4)

Mixture of tokens

Binary

A set of binary featuresConjugate priors

Page 59: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 59

Variational Method (5)

After applying the truncated stick-breaking process:

Page 60: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 60

Variational Method (6)

Solve by coordinate ascent. Differentiate the formula and set to zero:

Page 61: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 61

Variational Method (7)

One important variational parameters:

How likely does come from the k-th component? Attribute resolution!

Page 62: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 62

Variational Method (8)

Another important variational parameter:

where

How likely should be extracted? Attribute extraction!

Page 63: Web Information Extraction Learning based on Probabilistic Graphical Models

Sept 5, 2008 The Chinese University of Hong Kong 63

Unsupervised Approach

We make use of the prior knowledge, which is in the form of a list of a few terms, denoted as , related to product attributes.

Let be the i-th term in the list. The terms are not required to be categorized into different attributes. For each ,we select the i-th component in our model and set a high

er value of if is equal to the , and zero otherwise. In particular, we set to 10 for such . Next, for these components, we set and . This essentially means that 6 out of 10 text fragments in this compon

ent will be a text fragment related to attribute values. and for other components.