View
217
Download
0
Category
Tags:
Preview:
Citation preview
TANGOTable ANalysis for
Generating Ontologies
Yuri A. Tijerino*, David W. Embley*,
Deryle W. Lonsdale* and George Nagy**
* Brigham Young University** Rensselaer Polytechnic Institute
List of contents Motivation Applications Table understanding Concept matching Ontology merging/growing Example Future direction
Motivation Semi-automated ontological engineering through
Table Analysis for Generating Ontologies (TANGO) Keyword or link analysis search not enough to
search for information in tables Structure in tables can lead to domain knowledge
which includes concepts, relationships and constraints (ontologies)
Tables on web created for human use can lead to robust domain ontologies
TANGO Applications Extraction ontologies (generation) Data integration Semantic web Multiple-source query processing Document image analysis for documents
that contain tables
Table understanding What is a table? Why table normalization? What is table understanding? What is mini-ontology generation?
Table understanding:What is a table? “…a two-dimensional assembly of cells
used to present information…” Lopresti and Nagy
Normalized tables (row-column format) Small paper (using OCR) and/or electronic
tables (marked up) intended for human use
Table understanding:What is table normalization?
Table normalization means to take any table and produce a standard row-column table with all data cells containing expanded values and type information
Country GDP/PPP GDP/PPP Per
Capita
Real-Growth Rate
Inflation
Afghanistan $21,000,000,000 $800 ? ?
Albania $13,200,000,000 $3,800 7.3% 3.0%
Algeria $177,000,000,000 $5,600 3.8% 3.0%
Andorra $1,300,000,000 $19,000 3.8% 4.3%
Angola $13,300,000,000 $1,330 5.4% 110.0%
Antigua and Barbuda
$674,000,000 $10,000 3.5% 0.4%
… … … … …
Raw table
Normalizedtable
Table understanding:What is table normalization??? Population Population
Growth rate
PopulationDensity
BirthRate
DeathRate
MigrationRate
LifeExpectanc
yMale
LifeExpectancyFemale
InfantMortality
Afghanistan 25,824,882 3.95% 39.88 persons/km2
4.19% 1.70% 1.46% 47.82 years
46.82 years 14.06%
Albania 3,364,571 1.05% 122.79 persons/km2
2.07% 0.74% -0.29% 65.92 years
72.33 years 4.29%
Algeria 31,133,486 2.10% 13.07 persons/km2
2.70% 0.55% -0.05% 68.07 years
70.46 years 4.38%
American Samoa 63,786 2.64% 320.53 persons/km2
2.65% 0.40% 0.39% 71.23 years
79.95 years 1.02%
Andorra 65,939 2.24% 146.53 persons/km2
1.03% 0.55% 1.76% 80.55 years
86.55 years 0.41%
Angola 11,510 2.84% 8.97 persons/km2
4.31% 1.64% 0.16% 46.08 years
50.82 years 12.92%
… … … … … … … … … …
Western Sahara 239,333 2.34% 0.90 persons/km2
4.54% 1.66% -0.54% 47.98 years
50.57 years 13.67%
World 5,995,544,836
1.30% 14.42 persons/km2
2.20% 0.90% ? 61.00 years
65.00 years 5.60%
Yemen 16,942,230 3.34% 32.09 persons/km2
4.33% 0.99% 0.00% 58.17 years
61.88 years 6.98%
Zambia 9,663,535 2.12% 13.05 persons/km2
4.45% 2.26% 0.08% 36.72 years
37 21 years 9.19%
Zimbabwe 11,163,160 1.02% 28.87 persons/km2
3.06% 2.04% ? 38.77 years
38.94 years 6.12%
Table understanding:Information useful for normalization Captions – in vicinity of table (above,
below etc) Footnotes – on annotated column labels or
data cells Embedded information – in rows, columns
or cells {e.g., $, %, (1,000), billions, etc} Links to other views of the table, possibly
with new information
What is table understanding? Normalize table Take a table as an input and produce standard records in the form of
attribute-value pairs as output Discover constraints among columns Understand the data values
Country GDP/PPP GDP/PPP Per Capita
Real-Growth Rate
Inflation
Afghanistan $21,000,000,000 $800 ? ?
Albania $13,200,000,000 $3,800 7.3% 3.0%
Algeria $177,000,000,000 $5,600 3.8% 3.0%
Andorra $1,300,000,000 $19,000 3.8% 4.3%
Angola $13,300,000,000 $1,330 5.4% 110.0%
Antigua and Barbuda
$674,000,000 $10,000 3.5% 0.4%
… … … … …
{has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),has(Country,Real-growth rate*), has(Country, Inflation*)
Left-most, primary key
Dollar amount(from data frame)
Percentage(from data frame)
Country names(from data frame)
{<Country: Afghanistan>, <GDP/PPP: $21,000,000,000>, <GDP/PPP per capita: $800>, <Real-growth rate: ?>, <Inflation: ?>}
Example:Creating a domain ontology
Has associateddata frames
Includes proceduralknowledge
Distances
Duration betweenTime zones
Name Geopolitical Entity
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
HasGMT
Example:Table understanding to mini-ontology generation
Agglomeration Population Continent Country
Tokyo 31,139,900 Asia Japan
New York-Philadelphia
30,286,900 The Americas United States of America
Mexico 21,233,900 The Americas Mexico
Seoul 19,969,100 Asia Korea (South)
Sao Paulo 18,847,400 The Americas Brazil
Jakarta 17,891,000 Asia Indonesia
Osaka-Kobe-Kyoto 17,621,500 Asia Japan
… … … …
Niigata 503,500 Asia Japan
Raurkela 503,300 Asia India
Homjel 502,200 Europe Belarus
Zunyi 501,900 Asia China
Santiago 501,800 The Americas Dominican Republic
Pingdingshan 501,500 Asia China
Fargona 501,000 Asia Uzbekistan
Kirov 500,200 Europe Russia
Newcastle 500,000 Australia /Oceania
AustraliaAgglomeration Population
Country Continent
Example:Concept matching to ontology Merging
Merge
Results
Agglomeration Population
Country Continent
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
Name Geopolitical Entity
Continent
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
CityAgglomerationCountry
HasGMT
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
Name Geopolitical Entity
HasGMT
Concept matching We use exhaustive concept matching
techniques to match concepts from different mini-ontologies, including: Lexical and Natural Language Processing Value Similarity Value Features Data Frame Comparison Constraints
Concept Matching (Lexical & NLP) Lexical
Direct comparisons (substring/superstring) WordNet (Synonyms, Word Senses,
Hypernyms/Hyponyms) Natural Language Processing
Phrases in column headers Footnotes (for columns, rows, values) Explanations of symbols, rows, columns Titles and subtitles
Concept Matching (Value Similarity) Compute overlap for string values
comparing data sets Compute overlap for numeric values
comparing Gaussian Probability Distributions
Compute similarity of numeric values using regression
Concept Matching (Value Similarity)
Afghanistan
Albania
Algeria
Andorra
…
Yemen
Zambia
Zimbabwe
Afghanistan
Albania
Algeria
American Samoa
…
World
Yemen
Zambia
Zimbabwe
A B
In B not in A
In A not in B
In B not in A
Real-world exampleTotal of 193 cells in ATotal of 267 cells in B
77 fields in B not in A3 fields in A not in B
190 total matches
Proportion of matches withrespect to A = 190/193 = 98%
Proportion of matches withrespect to B = 190/267 = 71%
Concept Matching (Value Similarity)
31,900,600
30,521,550
25,335,200
12,300,555
…
3,567,203
2,300,531
1,400,112
31,500,900
30,400,111
25,500,100
21,000,900
…
7,000,000
3,500,050
2,300,000
1,500,000
A B
In B not in A
In A not in B
In B not in A
Total of 170 cells in ATotal of 240 cells in B
50 fields in B not in A2 fields in A not in B
168 total matches
Proportion of matches withrespect to A = 168/170 = 99%
Proportion of matches withrespect to B = 168/240 = 70%
Gaussian PDF
Concept Matching (Value Features) We can also compute similarities from
value characteristics such as: Character/numeric length, ratio Numeric values mean, variance, standard
deviation
Concept Matching (Data frames) Snippets of real-world knowledge about
data (type, length, nearby keywords, patterns [as in regexps], functional, etc)
We have used data frames to Recognize data types Include recognizers for values (dates, times,
longitude, latitude, countries, cities, etc) Provide conversion routines Match headers, labels, footnotes and values Compose or split columns (e.g., addresses)
Concept Matching (Constraints) Keys in tables (as well as nonkeys) Functional relationships 1-1, 1-*, *-1 or *-* correspondences Subset/superset of value sets Unknown and null values
Ontology merging/growing Direct merge (no conflicts)
Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)
Conflict resolution Interactively identify evidence and counter evidence of
functional relationships among mini-ontologies using constraint resolution
IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make
Example: Another mini-ontology generation
Place
Longitude Latitude
Elevation
USGS Quad
Area
MineReservoirLakeCity/town
Country
State
Place Name
⊎
Example: Another mini-ontology generation
Place
Longitude Latitude
Elevation
USGS Quad
Area
MineReservoirLakeCity/town
Country
State
Place Name
⊎
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
CityAgglomerationCountry
Merge
Continent
Time
hasnameshasGMT
Example: Concept Mapping to Ontology Merging
Place
Elevation
USGS Quad
Area
MineReservoirLake
Country
State
⊎
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
AgglomerationCountryContinent
Time
hasnameshasGMT
GeopoliticalEntity with population
City/town
Future direction Start with multiple tables (or URLs) and
generate mini-ontologies Identify most suitable mini-ontologies to
merge by calculating which tables have most overlap of concepts
Generate multiple domain ontologies Integrate with form-based data extraction
tools (smarter Web search engines)
Recommended