TAIPAN: Automatic Property Mapping for Tabular Data

TAIPAN: Automatic Property Mapping for

Tabular Data by Ivan Ermilov and Axel-Cyrille Ngonga Ngomo

November 22nd, 2016

1

Web Scale Data Mining from Web Tables

Web Data CommonsDresden Table Dataset

Other tables

The Web

TAIPAN

● Structured● Schemaless● Not using standards*

● SPARQL● RDFS● OWL

2

TAIPAN Approach Overview

Identify Subject Column

Atomize a Table

Identify Property for Each Table

Step 1 Step 2 Step 3 Step 4

Return Mappings

3

TAIPAN Approach Overview (example)1

2

3

4

The Core of TAIPAN

Subject Column Identification

● Unsupervised ML● Structural features● Semantic features

○ Support of a column○ Connectivity

● Retrieve seed entities● Rank entities● Return top entity

Property Mapping

5

Experimental setup

For T2K: 128GB, 4 Cores, Ubuntu 14.04

For TAIPAN: 16GB, 4 Cores Ubuntu 14.04

Dataset 1: curated T2D gold standard (T2D)

Dataset 2: DBpedia table dataset (DBD)

6

Subject Column Identification Experiments

Rule-based approach achieves only 51.72% accuracy

Using support and connectivity increase precision

Observations

Can be further improved using ML techniques

7

Property Mapping Experiments

TAIPAN achieves better recall, but lower precision than T2D

On the DBD dataset T2K could match only 1 property

Observations

Overall TAIPAN performs better than the state of the art

8

Conclusions & Future Work

Curated T2D & DBD datasets

Novel TAIPAN approach

Open Table Extraction

Table Extraction Benchmark (HOBBIT)

Integration of TAIPAN into GEISER project9

Thank you! Follow us on twitter :)

Ivan Ermilov <[email protected]>

@hobbit_project

10

mailto:[email protected]

Engineering

TAIPAN: Automatic Property Mapping for Tabular Data