27
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University Supported by NSF

Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Table Interpretation

by Sibling Page Comparison

Cui Tao & David W. Embley

Data Extraction Group Department of Computer Science

Brigham Young University

Supported by NSF

Table Interpretation(in context) Context: Table Understanding

Table Recognition Table Interpretation Table Conceptualization Table Understanding

Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community

knowledge Challenging Conceptual-Modeling Work

Table Interpretation(in context) Context: Table Understanding

Table Recognition Table Interpretation with Sibling Pages: Table Conceptualization Table Understanding

Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community

knowledge Challenging Conceptual-Modeling Work

TISP

TISP: Table Recognition and Interpretation

Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

Recognize Tables

Data Table

Layout Tables

(discard)

NestedData Tables

Locate Table LabelsExamples: Identification.Gene model(s).Protein Identification.Gene model(s).2

Locate Table LabelsExamples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2

12

Locate Table Values

Value

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

Conceptual Table Interpretation

Wang Notation [Wang96];(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

Table Ontology

Interpretation Technique: Sibling Page Comparison

Interpretation Technique: Sibling Page Comparison

Same

Interpretation Technique: Sibling Page Comparison

Almost Same

Interpretation Technique: Sibling Page Comparison

Different

Same

Technique Details

Unnest tables Match tables in sibling pages

“Perfect” match (table for layout discard ) “Reasonable” match (sibling table)

Determine/Use Table-Structure Pattern Discover pattern Pattern usage Dynamic pattern adjustment

Table Unnesting

Match Based on DOM Tree

Simple Tree Matching Algorithm

Labels

Values

[Yang91]

Match Score Categorization: Exact/Near-Exact, Sibling-Table, False

Table Structure Patterns

Regularity Expectations:

• (<tr><(td|th)> {L} <(td|th)> {V})n

• <tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

• …

Pattern combinations are also possible.

Pattern Usage

(Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data](Location.Genomic Position) = X:13518823..13515773 bp

Dynamic Pattern Adjustment

<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+

<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+ | <tr>(<(td|th)> {L})6 (<tr>(<(td|th)> {V})6)+

TISP Evaluation Applications

Commercial: car ads Scientific: molecular biology Geopolitical: US states and countries

Data: > 2,000 tables, 275 sibling tables, 35 web sites Evaluation

Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition?

Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?

Experimental Results

Table recognition: correctly discarded 157 of 158 layout tables

Pattern recognition: correctly found 69 of 72 structure patterns

Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct

Discovered Difficulties

Abundance of null entries Multiple tables as a single table

Recognize and group Use box model

[Gatterbauer07] Factored labels

Table Understanding Table Recognition

Data table vs. table for layout Adjust (group table components, defactor labels, …)

Table Interpretation Populate table ontology Additional table-ontology elements (title, footnotes, …)

Table Conceptualization Capture table semantics Reverse engineer as a conceptual model

Table Understanding Embed within a community ontology Alternatively, augment community knowledge

fleck velter

gonsity (ld/gg)

hepth(gd)

burlam 1.2 120

falder 2.3 230

multon 2.5 400

repeat:1. recognize table2. interpret table3. conceptualize table4. merge5. adjustuntil ontology developed

Knowledge Generation

velter

hepth

gonosity

fleck1

has1:*

1has 1:*

velter

hepth

gonosity

fleck1

has1:*

1has 1:*

TANGO (Table Analysis for Generating Ontologies) repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology.

GrowingOntology

Conclusionsand Future Opportunities

Conclusions Table Interpretation: overall F-measure of 94.5% Can successfully apply sibling-page technique

Future Opportunities Table understanding Knowledge generation Challenging conceptual-modeling work

www.deg.byu.edu