View
214
Download
1
Tags:
Embed Size (px)
Citation preview
The ChallengeThe Challenge
We want to retrieve and integrate We want to retrieve and integrate online databasesonline databases
Most online databases are accessed Most online databases are accessed through formsthrough forms
The better we can understand the The better we can understand the forms the better we know the forms the better we know the databasesdatabases
The SolutionThe Solution
Introducing … LABELEXIntroducing … LABELEX
A learning-based approach for A learning-based approach for
automatically parsing and extracting automatically parsing and extracting
element labels of forms used by element labels of forms used by
humans humans
Basic DefinitionsBasic Definitions
Forms contain elements and labelsForms contain elements and labels Elements are textboxes, lists, etc.Elements are textboxes, lists, etc. Labels represent attributes or fieldsLabels represent attributes or fields Elements are associated with labelsElements are associated with labels Element domain is the range of Element domain is the range of
elementselements
Algorithm DescriptionAlgorithm Description
Generating candidate mappingsGenerating candidate mappings Extracting featuresExtracting features Learning to identify mappingsLearning to identify mappings Using prior knowledge to discover Using prior knowledge to discover
new labelsnew labels
Generating Mapping Generating Mapping CandidatesCandidates
Mappings between labels and Mappings between labels and elements are generatedelements are generated
We consider only text close to the We consider only text close to the elementelement
Extracting FeaturesExtracting Features
Form Elements and LabelsForm Elements and Labels Elements: TypeElements: Type Labels: Font and Size Labels: Font and Size
Label-Element SimilarityLabel-Element Similarity Uses internal name and default value (LCS)Uses internal name and default value (LCS)
Spatial FeatureSpatial Feature Topological features: Top, Bottom, left, etcTopological features: Top, Bottom, left, etc Label element distance (Normalized).Label element distance (Normalized).
Identifying MappingsIdentifying Mappings We need to prune firstWe need to prune first
We choose a classifier to prune We choose a classifier to prune mappingsmappings
Learning MappingsLearning Mappings
We choose a classifier for selecting We choose a classifier for selecting correct mappingscorrect mappings
The Reconciliation processThe Reconciliation process
A vocabulary is created to reconcile A vocabulary is created to reconcile ambiguous mappingsambiguous mappings
Terms with high frequency might be Terms with high frequency might be labelslabels
Ex: “Save $220” and “From”Ex: “Save $220” and “From” Two tables for single terms and Two tables for single terms and
multiple onesmultiple ones
StrengthsStrengths
Lots of experimentsLots of experiments
Good chartsGood charts
Well explainedWell explained
WeaknessesWeaknesses
One typoOne typo
Their approach is layout dependentTheir approach is layout dependent
Future WorkFuture Work
Handle N:M mappingsHandle N:M mappings
Go beyond the naïve approachGo beyond the naïve approach
Consider other features for Consider other features for
classificationclassification