BIG DATA Conceptual Modeling to the Rescue David W. Embley with special thanks to Stephen W. Liddle...

Preview:

Citation preview

BIG DATAConceptual Modeling to the Rescue

David W. Embleywith special thanks to Stephen W. Liddle

and the Data Extraction Research Group at Brigham Young University

ER 2013 Keynote 2

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 3

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 4

BIG DATA

• Volume: typically exceeding terabytes

• Variety: heterogeneous sources; diverse needs

• Velocity: phenomenal rate of acquisition

• Veracity: trustworthiness & uncertainty

ER 2013 Keynote 5

Volume: Kilobyte (103)A paragraph of text

ER 2013 Keynote 6

Volume: Megabyte (106)A small novel

ER 2013 Keynote 7

Volume: Gigabyte (109)Sound wave of Beethoven’s Fifth Symphony

ER 2013 Keynote 8

Volume: Terabyte (1012)All the X-ray images in a large hospital

ER 2013 Keynote 9

Volume: Petabyte (1015)10 billion Facebook photos

ER 2013 Keynote 10

Volume: Exabyte (1018)1/5 of the words ever spoken

ER 2013 Keynote 11

Volume: Zettabyte (1021)Grains of sand on all the world’s beaches

ER 2013 Keynote 12

Volume: Yottabyte (1024)Atoms in 7,000 human bodies

NSA data site – purportedly designed to store yottabytes of data.

ER 2013 Keynote 13

Variety: Heterogeneous Sources& Diverse Needs

Radiology Report

(John Doe, July 19, 12:14 pm)

ER 2013 Keynote 14

VelocityAstronomers expect to be processing 10 petabytes

of data every hour from the SKA telescope.

Square Kilometer Array Telescope

ER 2013 Keynote 15

VelocityOne minute on the Internet:

640TB data transferred, 100k tweets,204 million e-mails sent

ER 2013 Keynote 16

Veracity: Uncertainty

• An age-old question: “What is truth?”

• Einstein: “The pursuit of truth and beauty is a sphere of activity in which we are permitted to remain children all our lives.”

• Of one thing we can be certain:

ER 2013 Keynote 17

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 18

Conceptual Modeling & BIG DATA

• Main thrust: organizing data [Chen, TODS’76]• And, that’s one of the challenges of BIG DATA …

but– Volume: too big– Variety: too much– Velocity: too fast– Veracity: too uncertainty

ER 2013 Keynote 19

Looking Backward

select PART-NO, QUANTITY-ON-HANDwhere …

ER 2013 Keynote 20

Looking Forward• Conceptualization of the Web

– Semantic search as well as keyword search– World-wide knowledge sharing

• Examples:– DB-pedia– Conceptual Graphs

• Google’s Knowledge Graph• Yahoo!’s Web of Objects• Facebook’s Graph Search• Microsoft’s/Bing’s Satori Knowledge Base

– Metaweb– FamilySearch

• Conceptual Modeling should apply!

ER 2013 Keynote 21

SELECT ?name ?description_en ?description_de ?musician WHERE { ?musician <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:German_musicians> . ?musician foaf:name ?name . OPTIONAL { ?musician rdfs:comment ?description_en . FILTER (LANG(?description_en) = 'en') . } OPTIONAL { ?musician rdfs:comment ?description_de .

ER 2013 Keynote 22

Google’s Knowledge Graph

ER 2013 Keynote 23

Yahoo!’s Web of ObjectsYahoo!’s image answer to: What is a food?

ER 2013 Keynote 24

Facebook’s Graph Search

ER 2013 Keynote 25

Satori Knowledge Base

ER 2013 Keynote 26

Metaweb

Boston ?

ER 2013 Keynote 27

Metaweb

Don’t forget to take Wendy to Boston’s birthday party at 2:00.

ER 2013 Keynote 28

Metaweb

Don’t forget to take Wendy to Boston’s birthday party at 2:00.

ER 2013 Keynote 29

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 30

Visitors per day: 85,000+Pages viewed per day: 5M+

A service provided by The Church of Jesus Christ of Latter-day Saints. © 2013 by

Intellectual Reserve, Inc. All rights reserved.

A free family history web site

ER 2013 Keynote 31

& the WoK-HD Project • FamilySearch– Volume:

• 1.8PB+ online (1.2B records along with 900M 2MB jpeg images)• 42PB+ offline (1.2B 30–40MB tiff images)

– Velocity:• 500M+ images in 2013• 200K+ volunteer indexers

• WoK-HD scanned-book project (within FamilySearch)– Volume: 100,000 books (3.5TB expected)– Velocity: 25,000 books / year

ER 2013 Keynote 32

WoK-HD(A Web of Knowledge Superimposed over Historical Documents)

… …

… …

ER 2013 Keynote 33

WoK-HD(A Web of Knowledge Superimposed over Historical Documents)

… …

grandchildren of Mary Ely

… …

ER 2013 Keynote 34

WoK-HD(A Web of Knowledge Superimposed over Historical Documents)

… …

… …

grandchildren of Mary Ely

ER 2013 Keynote 35

WoK-HD(A Web of Knowledge Superimposed over Historical Documents)

… …

grandchildren of Mary Ely

… …

ER 2013 Keynote 36

grandchildren of Mary Ely

WoK-HD(A Web of Knowledge Superimposed over Historical Documents)

… …

… …

ER 2013 Keynote 37

WoK-HD Construction

• Mitigating Velocity, Variety, & Volume– CM-based information extraction• PatternReader (for semi-structured text)• OntoSoar (for unstructured text)

– Automated information harvesting & organization• Assuring Veracity– CM-based query processing (with links and

reasoning chains for extracted information)– Automated analysis with evidence-based CMs

ER 2013 Keynote 38

PatternReaderTHE ELY ANCESTRY. 419SEVENTH GENERATION.241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warnerand Azubah Tully; m. 1850, Joel M. Gloyd (who was connected withChief Justice Waite's family),24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b.1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie.West Indies, who was b. 1812, d. 1839.(The widow is unable to give the names of her husband's parents.)Their children1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882,son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte BrackettJennings, New York City, who was b. 1818, dau. of Nathan TilestoneJennings and Maria Miller. Their children:1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865,son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss,992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge CalebHalstead Andruss and Emma Sutherland Goble. Mrs. Lathrop diedat her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4,1898. The funeral services were held at her residence on Monday, Nov.7, 1898, at half-past two o'clock P. M. Their children:1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.Miss Emma Goble Lathrop, official historian of the New York Chapter of theDaughters of the American Revolution, is one of the youngest members to holdoffice, but one whose intelligence and capability qualify her for such distinction.Miss Lathrop is not without experience; in her present home and native city, Newark,N. J., she has filled the positions of secretary and treasurer to the Girls'Friendly Society for nine years, secretary and president of the Woman's Auxiliaryof Trinity Church Parish, treasurer of the St. Catherine's Guild of St. BarnabasHospital, and manager of several of Newark's charitable institutions which hergrandparents were instrumental in founding. Miss Lathrop traces her lineageback through many generations of famous progenitors on both sides. Her maternalancestors were among the early settlers of New Jersey, among them John Ogden,who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was

ER 2013 Keynote 39

PatternReaderTHE ELY ANCESTRY. 419SEVENTH GENERATION.241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warnerand Azubah Tully; m. 1850, Joel M. Gloyd (who was connected withChief Justice Waite's family),24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b.1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie.West Indies, who was b. 1812, d. 1839.(The widow is unable to give the names of her husband's parents.)Their children1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882,son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte BrackettJennings, New York City, who was b. 1818, dau. of Nathan TilestoneJennings and Maria Miller. Their children:1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865,son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss,992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge CalebHalstead Andruss and Emma Sutherland Goble. Mrs. Lathrop diedat her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4,1898. The funeral services were held at her residence on Monday, Nov.7, 1898, at half-past two o'clock P. M. Their children:1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.Miss Emma Goble Lathrop, official historian of the New York Chapter of theDaughters of the American Revolution, is one of the youngest members to holdoffice, but one whose intelligence and capability qualify her for such distinction.Miss Lathrop is not without experience; in her present home and native city, Newark,N. J., she has filled the positions of secretary and treasurer to the Girls'Friendly Society for nine years, secretary and president of the Woman's Auxiliaryof Trinity Church Parish, treasurer of the St. Catherine's Guild of St. BarnabasHospital, and manager of several of Newark's charitable institutions which hergrandparents were instrumental in founding. Miss Lathrop traces her lineageback through many generations of famous progenitors on both sides. Her maternalancestors were among the early settlers of New Jersey, among them John Ogden,who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was

…1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

ER 2013 Keynote 40

PatternReader

…1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

ER 2013 Keynote 41

PatternReader

…1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

OCR Error

“ Twins” (lost in OCR)}

ER 2013 Keynote 42

PatternReader

…#. Aaaa Aaaa, b, 18##, d. 18##.#. Aaaa Aaaa, b. 18##.…#. Aaaa Aaaa, b. 18##, d. 18##.#. Aaaa Aaaa, b. 18##. ) .#. Aaaa AaAa, b. 18##, d. 18##. ]#. Aaaa Aaaa, b. 18##.#. Aaaa Aaaa, b. 18##.…#. Aaaa Aaaa, b. 18##, d. 18##.#. Aaaa Aaaa, b. 18##, d. 18##.#. Aaaa Aaaa, b. i8##.#. Aaaa Aaaa, b. 18##.

^(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)$

^(\d)\.\s(([A-Z][a-z][A-Z][a-z]{5})|([A-Z][a-z]{3,7}))\s([A-Z][a-z]{4,9}),\sb[.,]\s(18\d\d)\sd.\s(18\d\d)\.$

Conflate symbols and induce grammar

ER 2013 Keynote 43

PatternReader

…1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

ER 2013 Keynote 44

PatternReader

…1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

ER 2013 Keynote 45

Conceptual Modeling—the Backbone

ER 2013 Keynote 46

Conceptual Modeling—the Backbone

(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)

ER 2013 Keynote 47

Conceptual Modeling—the Backbone

(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)

ER 2013 Keynote 48

Conceptual Modeling—the Backbone

ER 2013 Keynote 49

Extraction OntologiesLinguistically Grounded Conceptual Models

ER 2013 Keynote

Lexical Object-Set Recognizers

50

BirthDate external representation: \b[1][6-9]\d\d\b left context: b\.\s right context: [.,] …

ER 2013 Keynote

Non-lexical Object-Set Recognizers

51

Person object existence rule: {Name} …Name

external representation: \b{FirstName}\s{LastName}\b …

ER 2013 Keynote

Relationship-Set Recognizers

52

Person-BirthDate external representation: ^\d{1,3}\.\s{Person},\sb\.\s{BirthDate}[.,] …

ER 2013 Keynote

Ontology-Snippet Recognizers

53

ChildRecord external representation: ^(\d{1,3})\.\s+([A-Z]\w+\s[A-Z]\w+) (,\sb\.\s([1][6-9]\d\d))?(,\sd\.\s([1][6-9]\d\d))?\.

54

HMM Recognizers

ER 2013 Keynote 55

OntoSoar Recognizers

ER 2013 Keynote 56

OntoSoar Recognizers

+---------------------------------Xp------------------------------+| +--------Ost--------+ +-----Js-----+ |+-Wd-+-Ss-+ +-----A-----+--Mp---+ +---DG--+ || | | | | | | | |^ Emma was.v official.a historian.n of the NYCDAR .

“of”(x1,x2)“NYCDAR”(x2)“Emma”(x1)“historian”(x1)“official”(x1)

Name(“Emma”)Officer(“historian”)Organization(“NYCDAR”)Person–Name(y1,“Emma”)

OntoESSoar

Person-Officer-Organization(y1,“official historian”,“NYCDAR”)

ER 2013 Keynote 57

Beyond Extraction• Canonicalization• Reasoning– Extraction of implied assertions– Generation of implied assertions– Object identity resolution

• Free-form query processing• Form-based advanced query processing

All based on Conceptual Modeling

58

Canonicalization for Lexical Object Sets

• “Easter 1832” JulianDate(1832113)• JulianDate(1832113) 22 Apr 1832• “Sam’l” and “Geo.” “Samuel” and “George”• “Boonton, N.J.” “Boonton, NJ, USA”

• Operations:– before(Date1, Date2): Boolean– probabilityMale(Name): 0.0..1.0

ER 2013 Keynote

ER 2013 Keynote 59

Implied AssertionsAuthor’s View Desired View

Maria Jennings … daughter of …William Gerard Lathrop

Gender: Female

Name: GivenName: Maria Jennings Surname: Lathrop

ER 2013 Keynote 60

Implied Assertions

Maria Jennings Lathrop …child of …William Gerard Lathrop …son of …Mary Ely … Female

Mary Ely … grandmother of… Maria Jennings Lathrop

ER 2013 Keynote 61

Object Identity Resolution

0.032081

0.032081

0.995030

ER 2013 Keynote 62

Free-Form Query ProcessingPersons born in 1838

ER 2013 Keynote 63

Free-Form Query ProcessingPersons born in 1838

born

Person(s)?

ER 2013 Keynote 64

Free-Form Query ProcessingPersons born in 1838

= 1838

Person Name BirthDatePerson11 Gerard Lathrop McKenzie 1838Person18 Maria Jennings Lathrop 1838

born

Person(s)?

ER 2013 Keynote 65

Free-Form Query ProcessingPersons born in 1838

Person Name BirthDatePerson11 Gerard Lathrop McKenzie 1838Person18 Maria Jennings Lathrop 1838

“Gerard Lathrop McKenzie” because:Person(Person11) has GivenName (“Gerard Lathrop”)and Child(Person11) of Person(Person9)and Person(Person9) has Gender(“Male”)and Person(Person9) has Surname(“McKenzie”)

ER 2013 Keynote 66

Form-Based Advanced Query ProcessingCousins of Donald Lathrop who died before he was born or were born after he died.

Cousin

ER 2013 Keynote 67

Form-Based Advanced Query ProcessingCousins of Donald Lathrop who died before he was born or were born after he died.

… 1. Mary Ely, b, 1836, d. 1859.2. Gerard Lathrop, b. 1838.…1. Maria Jennings, b. 1838, d. 1840.2. William Gerard, b. 1840. ) .3. Donald McKenzie, b. 1840, d. 1843. ]4. Anna Margaretta, b. 1843.5. Anna Catherine, b. 1845.…1. Charles Halstead, b. 1857, d. 1861.2. William Gerard, b. 1858, d. 1861.3. Theodore Andruss, b. i860.4. Emma Goble, b. 1862.

ER 2013 Keynote 68

Veracity

• Knowledge– Populated conceptual model– Plato: “justified true belief”

• FamilySearch– Conceptual model of reality– Constraint violation (discovery)– Assertion verification (evidence)

• Conceptual modeling for veracity

Mitigating Uncertainty with Conceptual Modeling

ER 2013 Keynote 69

Veracity: “Justified True Belief”Persons born in 1838

Person Name BirthDatePerson11 Gerard Lathrop McKenzie 1838Person18 Maria Jennings Lathrop 1838

“Gerard Lathrop McKenzie” because:Person(Person11) has GivenName (“Gerard Lathrop”)and Child(Person11) of Person(Person9)and Person(Person9) has Gender(“Male”)and Person(Person9) has Surname(“McKenzie”)

FamilySearch:Wiki-like Updates + Uncertain Information

Sources of error:1. Incorrect person merges2. Incorrect parent-child

relationship assertions

Cyclic Pedigree:

FamilySearch: Useful More ExpressiveConceptual Model Specifications

1:*1:2.1:*

x2 Nov 1846

1 Nov 1845

p = 0.79

p = 0.35

Evidence-Based Conceptual Modeling(1) Model Reality, (2) Allow/Discover Discrepancies, (3) Add Evidence

1:*1:2.1:*

x2 Nov 1846

1 Nov 1845

p = 0.79

p = 0.35

ER 2013 Keynote 73

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 74

Principles that guide the use of Conceptual Modeling in BIG DATA

• Harvest wrt a conceptual model– Extraction ontologies– And …

• Organize wrt a conceptual model– Rich conceptualizations– And …

• Analyze wrt a conceptual model– Evidence-based reasoning– And …

ER 2013 Keynote 75

More Examples of Conceptual Modelingin BIG DATA Applications

• Knowledge Bundle Building for Research Studies (KBB)• Multi-Lingual Query Processing (ML-OntoES)• Table Understanding (TISP, Table Ontology)• Automating Ontology Creation (TANGO)

• Automated Reading (OntoSoar)• Homeland Security• Twitter Suicide Study• Human Genome Project

Dream!Think Big!

Contribute!

ER 2013 Keynote 76

Knowledge Bundle Building(i.e., Construct and Populate CMs)

• Objective: Study the association of:– TP53 polymorphism and– Lung cancer

• Task: locate, gather, organize data from:– Single Nucleotide Polymorphism database– Medical journal articles– Medical-record database– Radiology images and reports

Example: Bio-Medical Research

ER 2013 Keynote 77

Form-Based Extraction Ontologies Gather SNP Information from the NCBI dbSNP Repository

ER 2013 Keynote 78

Linguistically Grounded Conceptual Models Search PubMed Literature

ER 2013 Keynote 79

Reverse-Engineer Human Subject Information from INDIVO into a Conceptual Model

ER 2013 Keynote 80

Add Annotated Images into the Conceptual Knowledge Bundle

Radiology Report(John Doe, July 19, 12:14 pm)

ER 2013 Keynote 81

Query and Analyze Data in Knowledge the Bundle

ER 2013 Keynote 82

Q 한국어Honda moins de 8000 en «excellent état»

marque prix mots de clé Honda 7826€ Honda (2)

자동차

색상주행거리

제조사

모델 등급 액세서리 변속기

차 종

모델등급

엔진

특징

연식

가격

8000€

français

+

Multi-Lingual Query Processing

ER 2013 Keynote 83

Q 한국어Honda moins de 8000 en «excellent état»

marque prix mots de clé Honda 7826€ Honda (2)

자동차

색상주행거리

제조사

모델 등급 액세서리 변속기

차 종

모델등급

엔진

특징

연식

가격

8000€

français

+

ER 2013 Keynote 84

Table Understanding

• Tables on the web– 14.1 billion HTML tables [Cafarella et al. 08]

• Most are tables for layout• 154 million high-quality relational tables

– 50 million spreadsheet tables [Adelfio & Samet 13]• Web table complexity (sampling statistics) [ibid]– Simple relational table: 25% (spreadsheet) 68% (HTML)– Multiple header rows: 15% (spreadsheet) 7% (HTML)– More complex: 60% (spreadsheet) 25% (HTML)

ER 2013 Keynote 85

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

ER 2013 Keynote 86

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

ER 2013 Keynote 87

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

ER 2013 Keynote 88

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

ER 2013 Keynote 89

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

ER 2013 Keynote 90

Table Understanding

A B C D ELess than 100 100-299 p 300 pupils or more

1 Schools 2003/04 36.2 39 24.82 2004/05 35.2 39 25.83 2005/06 35.2 39 25.84 2006/07 34.3 40 25.75 2007/08 34 39.6 26.46 2008/09 33.3 40 26.77 2009/101 32 40.7 27.38 Pupils 2003/04 8.7 39.3 529 2004/05 8.7 38.3 5310 2005/06 8.8 38.3 52.911 2006/07 8.4 39 52.612 2007/08 8.3 38.2 53.512 2008/09 8.1 38.2 53.712 2009/101 7.7 38.2 54.1

Automating Ontology Creationwith TANGO

Agglomeration Population Continent Country

Tokyo 31,139,900 Asia Japan

New York-Philadelphia

30,286,900 The Americas United States of America

Mexico 21,233,900 The Americas Mexico

Seoul 19,969,100 Asia Korea (South)

Sao Paulo 18,847,400 The Americas Brazil

Jakarta 17,891,000 Asia Indonesia

Osaka-Kobe-Kyoto 17,621,500 Asia Japan

… … … …

Niigata 503,500 Asia Japan

Raurkela 503,300 Asia India

Homjel 502,200 Europe Belarus

Zunyi 501,900 Asia China

Santiago 501,800 The Americas Dominican Republic

Pingdingshan 501,500 Asia China

Fargona 501,000 Asia Uzbekistan

Kirov 500,200 Europe Russia

Newcastle 500,000 Australia /Oceania

Australia

Agglomeration Population

Country Continent

Merge

Results

Agglomeration Population

Country Continent

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

Continent

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

CityAgglomerationCountry

HasGMT

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

HasGMT

Automating Ontology Creationwith TANGO

ER 2013 Keynote 93

Automated Reading: NELL

http://rtw.ml.cmu.edu

ER 2013 Keynote 94

Automated Reading: OntoSoar• Populate conceptual model from text– Directly– By inference

• Augment conceptual model and populate

ER 2013 Keynote 95

Homeland Security: Terrorist Example

ER 2013 Keynote 96

Homeland Security: Terrorist Example

ER 2013 Keynote 97

Homeland Security: Terrorist Example

Abu Aziz

?

White House

White House

ER 2013 Keynote 98

Homeland Security: Terrorist Example

Abu Aziz

?

White House

White House

ER 2013 Keynote 99

Homeland Security: Terrorist Example

Abu Aziz

?

White House

White House

What If!

ER 2013 Keynote 100

Twitter Suicide Study

Tweets could warn of a suicide risk,BYU study says

Oct. 10 2013

… Over three months, the computer scientists screened millions of tweets and identified 37,717 that were "genuinely troubling" from 28,088 unique users …

ER 2013 Keynote 101

Conceptual Modeling for Studying theHuman Genome

ER 2013 Keynote 102

Conceptual Modeling for Studying theHuman Genome

ER 2013 Keynote 103

Roadmap• What is BIG DATA?• Why should Conceptual Modeling apply?• Examples to show how Conceptual Modeling

can “come to the rescue”• Summary (and take-home message):– Principles that guide the use of Conceptual

Modeling in BIG DATA applications– Challenges and Research Opportunities

ER 2013 Keynote 104

Principles that guide the use of Conceptual Modeling in BIG DATA

• Harvest wrt a conceptual model– Extraction ontologies– And: table understanding, automated reading, …

• Organize wrt a conceptual model– Rich conceptualizations– And: KBs for research studies, multilingual web, …

• Analyze wrt a conceptual model– Evidence-based reasoning– And: “what-if”, warning signs search, DNA, …

ER 2013 Keynote 105

Summary & Challenge• Conceptual Modeling Applies to BIG DATA

(perhaps more than you might have thought)

• Challenge: find ways to use conceptual modeling to “rescue”—resolve BIG DATA issues

BYU Data Extraction Research Groupwww.deg.byu.edu

ER 2013 Keynote 106

Summary & Challenge• Conceptual Modeling Applies to BIG DATA

(perhaps more than you might have thought)

• Challenge: find ways to use conceptual modeling to “rescue”—resolve BIG DATA issues

BYU Data Extraction Research Groupwww.deg.byu.edu

Recommended