30
From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

From Tessellations to Table Interpretation

Ramana C. Jandhyala

DocLab, RPI

Page 2: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Introduction

• Novel aspects of our work– Focus on computer-constructed web tables– Using commercial software– Describing tables using XY trees– Extracting relationship of headers to content cells

• Formalizes the 200 table-experiment conducted by Raghav. These tables were imported from 10 websites into Excel and manually edited into a form that can be processed algorithmically.

• Average editing time – 104 sec. • Average table size – 587 cells.• Augmentations not considered!

Page 3: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Rectangular Tessellations

• Rectangular Tiling/Discrete Rectangular Tessellation – Partition of an isothetic rectangle into rectangles– Geometry uniquely defined by locations and types of junction

points– Number Nall(m) increases exponentially with table size.

• XY Tessellations– Special case of rectangular tessellations– Got by successive horizontal and vertical cuts– Number of XY tilings Nxy(m) decrease rapidly (Klarner-

Magliveras), i.e.

Lim Nxy(m) / Nall(m) = 0 m->inf

Page 4: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Taxonomy of web tables

• All tables have a stub, row headings, column headings and data cells.

• Some common layouts – admissible tessellations

Page 5: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Taxonomy of web tables (contd.)

• Human-understandable tables - NT,S,xy(m), mathematically indefinable and unknown number

• Convert them to smaller set of admissible tables – NA,S,xy(m)

• Layout-equivalent tables enough for algorithmic analysis.

Page 6: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Taxonomy of web tables (contd.)

• Number of different layout-equivalent admissible candidates - NL,S,xy(m)

• For now, NL,S,xy(m) < NA,S,xy(m)

• Context-free grammars – characterize entire families of layout-equivalent tables

Page 7: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Logical Structure of Tables

• XY trees only capture physical layout• To understand a table – need to analyse logical

structure, i.e. relationship between header cells and content cells [Wang].

• Wang notation – consists of category trees (headings) and delta cells (content). – Number of category trees – dimensionality of the table– Cartesian product of category trees lead to delta cells.– Size of table – product of number of rows and

columns of delta cells

Page 8: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Logical Structure of Tables (contd.)

• Well-formed tables – Labeled table candidates for which Wang Notation exists

• Most tables not well-formed, but easily convertible into well-formed format using virtual headers.

• Analyzing logical structure not sufficient for table understanding!

Page 9: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

• Our project – front end for creating narrow-domain ontologies by combining information from web tables

• Our work based on following inequalitiesNL,S,xy(m)< NA,S,xy(m) < NT,S,xy(m) << NS,xy(m) << Nxy(m) << Nall(m)

• Examples of each class shown in next slide.

Page 10: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI
Page 11: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Tessellations to XY trees

• Horizontally and vertically ordered lists of junction points – not sufficient for reconstructing XY tree!

• Do not capture the adjacency topology.

• Need coordinates and junction types (NE-corner, T-junction, crossing etc.)

Page 12: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Table to XY tree – EX2XY

• Applicable to any tessellation for which XY tree exists. • Input – Excel Table• Output – XY tree (parenthesized notation)• Algorithm:

– CutV(R) – cuts a rectangle R vertically and returns leftmost sub-rectangle.

– CutH(R) – cuts R horizontally and returns topmost sub-rectangle.

– Both used in a pair of procedures P1 and P2, which call each other recursively.

– P1 cuts given rectangle vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2.

– Main procedure calls P1 for vertical cuts, and P2 for horizontal cuts.

Page 13: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI
Page 14: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Example – Original HTML table

Page 15: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Example (contd.) – After import into Excel

Page 16: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Example – After Editing

Page 17: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

A snippet of the output (both parenthetical and XML outputs)

Parenthetical version of the output

([{::15,2:15,2::16,2:16,2Real gross domestic product, expenditure-based, by province and territory (millions of

chained (2002) dollars)::17,2:30,2}{::15,3:15,3::16,3:16,3Canada::17,3:17,3Newfoundland and Labrador::18,3:18,3Prince Edward Island::19,3:19,3Nova Scotia::20,3:20,3New Brunswick::21,3:21,3Quebec::22,3:22,3Ontario::23,3:23,3Manitoba::24,3:24,3Saskatchewan::25,3:25,3Alberta::26,3:26,3British Columbia::27,3:27,3Yukon::28,3:28,3Northwest Territories::29,3:29,3Nunavut::30,3:30,3}{Year::15,4:15,8[2004::16,4:16,42005::16,5:16,52006::16,6:16,62007::16,7:16,72008::16,8:16,8]...

XML version of the output

.

.<block id='1.1.2.1' range='17,2:30,2'><content>Real gross domestic product, expenditure-based, by province and

territory (millions of chained (2002) dollars)</content></block>

<block id='1.1.2.2' range='17,3:30,3'><content></content></block>

<block id='1.2.2.1' range='16,4:16,4'><content>2004</content></block>

<block id='1.2.2.2' range='16,5:16,5'><content>2005</content></block>

<block id='1.2.2.3' range='16,6:16,6'><content>2006</content></block>

<block id='1.2.2.4' range='16,7:16,7'><content>2007</content></block>...

Page 18: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Grammar for tables

• The grammar uses nested parenthetical notation (P-notation).

• P-notation has 1:1 correspondence with general trees.

• For above table, the XY tree sentence is:Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}

(neglecting the textual labels)

Page 19: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Grammar• Grammar for parsing the column headers of all such layout-

equivalent tessellations:– S := A (Rule 1)– A := {B} (Rule 2)– B := c [X] B | c [X] (Rules 3 and 4)– X := c X | A X | A | c (Rules 5, 6, 7 and 8)

• where • S – start symbol • A – nonterminal that generates all admissible strings for column headers• B – generates >=1 instances of categories in the form c[X]• Each c becomes a root category and X generates its subcategory tree• X generates strings of size >=1 with arbitrary occurrences of c and A.

• The derivation for the previous example using a LALR parser is shown on the next slide

Page 20: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI
Page 21: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

• Example demonstrates both power and limitation of grammars.

• A grammar can recognize broad classes.

• But grammars cannot check that headings are properly labels for well-formed tables

• If accepted by the grammar, need additional geometric alignment and lexical checks to verify Wang notation.

Page 22: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

XY tree to Wang Notation

• XY2WANG converts an XY tree generated from a restricted family of admissible tables to Wang Notation.

• Example:

• Uses an indented table-of-contents format as a data structure.

Page 23: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

XY2WANG

• Input – XY trees with arbitrary number of categories and arbitrary nesting.

• Output – XML version of Wang Notation• For a table T = (C, d),

– Category Notation: C = { (A,{(A1,phi),(A2,phi)}),(B,{(B1,phi),(B2,phi),(B3,phi)}) }

– Delta mappingsδ({A.A1,B.B1}) = d11δ({A.A1,B.B2}) = d12

Page 24: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

XY2WANG: Algorithm

• Algorithm:– First locate 4 principal regions – stub,

row/column headers and content cells.– Extract Wang labeled domains under

assumption that each spanning cell is the header of smaller cells either to its right (row headers) or bottom (column headers).

– Compute Cartesian product of category paths and match each key to the content of a delta cell.

Page 25: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

XY2WANG: Table-of-contents data structure

• Example of a table and its corresponding table-of-contents data structure is shown

Page 26: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

• XY2WANG also handles more complex scenarios like:– Higher Wang dimensionality– Deeper nesting of headers– Repetitive headers– Detection of not well-formed tables

• These are included in the following pseudocode

Page 27: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Conclusion

• Hierarchical structure of categories and flat structure of data cells is recovered from XY trees.

• Geometric and topological equivalence classes on tessellations and their XY trees are defined.

• Commonly encountered tables are examples of such classes.

• These tables are identified by parsing XY trees with a grammar.

• Assuming the header labels are consistent, Wang category notation is extracted.

Page 28: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Future work

• Account for aggregates – major component of web tables.

• Need to integrate other augmentations (footnotes, units, captions etc.)

• Expand on the grammar: current version accounts only for column headers.

• Automate the conversion from imported web tables to standard formats.

• Semantic interpretation of groups of conceptually overlapping tables based on precise representation of layout-invariant syntax.

Page 29: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Current Work

• Converting web tables to standard formats for ease of processing. – Internal conventions: A’, A’’, hybrids

• Learning from XY trees using tree edit distance– Learning from existing manipulations.– Ex: The user modifies table T1 to a standard

format T1’. The steps are all recorded. Now use this information to predict the standard format of a new table T2.

Page 30: From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Current work (contd.)

• Relation of tree-edit distance to pre-order and post-order string edit distance– Some interesting results and conjectures, but still half-boiled!– (Result) Pre- and post- order traversals enough for

reconstructing a general tree.– (Conjecture) For 2 XY trees, distances between corresponding

pre- and post-order strings equal, but not for general trees! – (Conjecture) For 2 XY trees, tree-edit distance equal to pre/post

order distances– Are tables with same content, but different layouts, collinear (in

terms of string/tree edit distance)?

• Developing software to calculate tree edit distances, should clear many things. (Any suggestions?)