Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning...

Bootstrapping Information Extraction from Semi-Structured

Web PagesAndy Carlson (Machine Learning Department, Carnegie Mellon)

Charles Schafer (Google Pittsburgh)

ECML/PKDD 2008

Semi-Structured Web Pages: Vacation Rentals

Semi-Structured Web Pages: Nobel Prize Winners

Semi-Structured Web Pages: Museum Collections

Structured Data

Structured data enables better search interfaces

Supervised Information Extraction

Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

Bootstrapping IE from Semi-Structured Web Pages

Assume that we have wrappers for a number of sites in a domain and thus many records from those sites.

Can we use what we’ve learned to automatically wrap a new site in the same domain?

From unlabeled pages to DOM trees

Unlabeled pages from new sitetexttexttext

<html>

<body>

DOM tree

texttexttext

<html>

<body>

DOM tree

From DOM trees to template tree

texttexttext

<html>

<body>

DOM tree

texttexttexttext

<html>

<body>

DOM tree

texttexttext

<html>

<body>

Template tree

Tree alignment

Supervised setting: Labels from user annotations

Learn labels from user

annotations

Generalized template

<html>

<body>

text text

Generalized extraction template

<html>

<body>

text text text

Bootstrapping setting: Labels from classifiers

Label data fields with classifiers

Generalized template

<html>

<body>

text text

Generalized extraction template

<html>

<body>

text text text

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

Bedrooms:

Framing the classification problem

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

GrillDVD PlayerHeated PoolDeckGas Grill

Boston

HoustonAtlantaTopekaPhiladelphiaNew Haven

Baltimore

San JoseTopekaSeattleLas VegasYorktown

Atlanta

Las VegasBillingsGreat FallsMissoulaBozeman

Site A Site B Site C

Amenities:Amenities:Amenities:Amenities:Amenities:Amenities:

1/1/09

6/9/087/13/087/20/089/13/085/15/08

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedroom:

2.532.53.52

Description:

Description:Description:Description:Description:Description:

717-0474835-7694845-0923934-9720663-1111646-0957

$36$14$99$13$64

Training Sites

Comparing fields: Feature types

Content:Tokens

- Split on tokens because lots of data types have some vocabulary but order is not important.

Character 3-grams- Useful for matching “fulltime” and “full-time”

Token types (all digits, all caps, etc.)- Helpful for addresses, unique IDs, other fields with a mix of token types

Context:Precontext character 3-grams

- Sites vary their wordings, but often use variants of the same words

Naïve classification attempt

Logistic Regression:• Each data field from training sites is a

labeled instance for each schema column

• Use features we just described

Problems:• Tens of training instances

• Tens of thousands of features

• Serious overfitting

Coarser Features: Distributional similarity

Treat each field as a distribution of values

Compute distributional similarity for each feature type:

Smooth and normalize to Skew Similarity

Smarter classification attempt

Stacked Skews model:• Each field from each training site is a labeled instance

• Features are distributional similarity for each feature type

• Train linear regression model

• Inspired by database schema matching by [Madhavan et al. 2005]

Now:• Tens of training instances

• One feature per feature type – just a handful

• Appropriately sized learning problem

Related work

Unsupervised wrapper induction typically doesn’t label data fields

- e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005]

DeLa system of [Wang & Lochovsky, 2003]

- Heuristic rule-based mapping of fields to labels

- Requires explicit prompts of extracted fields

[Golgher et al, 2001]

- Finds exact matches of data values and looks for consistent context

Evaluation: Vacation rentals

Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address

Evaluation: Job listings

Schema: Title, Company, Location, Date Posted, Job Type, ID

Results

Accuracy by schema column

• Significantly outperforms logistic regression baseline.• With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.

Thank You

Results by Schema Column

Results by Web Site

Feature Type Ablation Study Results

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning...

Documents

Bootstrapping your business

Bootstrapping Startup

Schafer portfolio 2012

bootstrapping 3

Bootstrapping coverage

SDL Supplier Quality Manual Rev D - Schafer Industries · SCHAFER DRIVELINE SUPPLIER QUALITY MANUAL (cont.) Revision: D 3 SCHAFER DRIVELINE POLICIES SUPPLIERS Schafer Driveline recognizes

Bootstrapping Coursepad

Bootstrapping Agile

Tde Murray Schafer

BEFORE THE LAND USE BOARD OF APPEALS - Oregon · Carlson, Richard L. Carlson, Scott D. Carlson, Jill M. Carlson, Randie S. Carlson, Toni M. Carlson, Gladys Steinlicht, Leonard Peverieri,

Bootstrapping a Smalltalk - Inria · M.Denker - Bootstrapping a Smalltalk November 2011 - Example: Bootstrapping a language X Compiler Tools Loader Language X

Bootstrapping - TDC2012 Floripa

Structural bootstrapping - A novel, generative …h2t.anthropomatik.kit.edu/pdf/Woergoetter2015.pdfStructural bootstrapping - A novel, generative mechanism for faster and ... bootstrapping

Michael Schafer portfolio

Murray Schafer

Bootstrapping - ICESI

Bootstrapping Microservices

Katie Jo Schafer

Carlson Takeoff 2007 Carlson Field 2007update.carlsonsw.com/pdf/Carlson2007_Volume5_high.pdf · 2006-12-08 · Carlson Software 2007 Volume 5 Carlson Takeoff 2007 Carlson Field 2007

Bootstrapping Quality