23
Learning to Extract Learning to Extract Form Labels Form Labels Nguyen et al. Nguyen et al.

Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Learning to Extract Form Learning to Extract Form LabelsLabels

Nguyen et al.Nguyen et al.

The ChallengeThe Challenge

We want to retrieve and integrate We want to retrieve and integrate online databasesonline databases

Most online databases are accessed Most online databases are accessed through formsthrough forms

The better we can understand the The better we can understand the forms the better we know the forms the better we know the databasesdatabases

Web formsWeb forms

Most forms on the web are very Most forms on the web are very

differentdifferent

The SolutionThe Solution

Introducing … LABELEXIntroducing … LABELEX

A learning-based approach for A learning-based approach for

automatically parsing and extracting automatically parsing and extracting

element labels of forms used by element labels of forms used by

humans humans

OverviewOverview

Basic DefinitionsBasic Definitions

Forms contain elements and labelsForms contain elements and labels Elements are textboxes, lists, etc.Elements are textboxes, lists, etc. Labels represent attributes or fieldsLabels represent attributes or fields Elements are associated with labelsElements are associated with labels Element domain is the range of Element domain is the range of

elementselements

Algorithm DescriptionAlgorithm Description

Generating candidate mappingsGenerating candidate mappings Extracting featuresExtracting features Learning to identify mappingsLearning to identify mappings Using prior knowledge to discover Using prior knowledge to discover

new labelsnew labels

Generating Mapping Generating Mapping CandidatesCandidates

Mappings between labels and Mappings between labels and elements are generatedelements are generated

We consider only text close to the We consider only text close to the elementelement

Generating Mapping Generating Mapping CandidatesCandidates

ExampleExample

Extracting FeaturesExtracting Features

Form Elements and LabelsForm Elements and Labels Elements: TypeElements: Type Labels: Font and Size Labels: Font and Size

Label-Element SimilarityLabel-Element Similarity Uses internal name and default value (LCS)Uses internal name and default value (LCS)

Spatial FeatureSpatial Feature Topological features: Top, Bottom, left, etcTopological features: Top, Bottom, left, etc Label element distance (Normalized).Label element distance (Normalized).

Extracting FeaturesExtracting Features

Identifying MappingsIdentifying Mappings We need to prune firstWe need to prune first

We choose a classifier to prune We choose a classifier to prune mappingsmappings

Learning MappingsLearning Mappings

We choose a classifier for selecting We choose a classifier for selecting correct mappingscorrect mappings

The Reconciliation processThe Reconciliation process

A vocabulary is created to reconcile A vocabulary is created to reconcile ambiguous mappingsambiguous mappings

Terms with high frequency might be Terms with high frequency might be labelslabels

Ex: “Save $220” and “From”Ex: “Save $220” and “From” Two tables for single terms and Two tables for single terms and

multiple onesmultiple ones

Experimental EvaluationExperimental Evaluation

DatasetsDatasets

ResultsResults

Best configurationBest configuration

ResultsResults

Domain specific (DSCE)Domain specific (DSCE)

ResultsResults

DSCE vs GenericDSCE vs Generic

ResultsResults

Comparison to state of the art: HSP, Comparison to state of the art: HSP, IEXPIEXP

StrengthsStrengths

Lots of experimentsLots of experiments

Good chartsGood charts

Well explainedWell explained

WeaknessesWeaknesses

One typoOne typo

Their approach is layout dependentTheir approach is layout dependent

Future WorkFuture Work

Handle N:M mappingsHandle N:M mappings

Go beyond the naïve approachGo beyond the naïve approach

Consider other features for Consider other features for

classificationclassification

?’s?’s