54
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004

Crossing the Structure Chasm

  • Upload
    charo

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Crossing the Structure Chasm. Alon Halevy University of Washington, Seattle UBC, January 15, 2004. The Structure Chasm. Authoring. Writing text. Creating a schema. Using someone else ’ s schema. Querying. keywords. Data sharing. Easy. Committees, standards. - PowerPoint PPT Presentation

Citation preview

Page 1: Crossing the Structure Chasm

Crossing the Structure Chasm

Alon HalevyUniversity of Washington, Seattle

UBC, January 15, 2004

Page 2: Crossing the Structure Chasm

The Structure ChasmAuthoring Creating a

schemaWriting text

Querying keywords Using someone else’s schema

Data sharing Easy Committees, standardsBut we can pose

complex queries

Page 3: Crossing the Structure Chasm

Why is This a Problem?Databases used to be isolated and administered only by experts.Today’s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources)

The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web

Fundamental problem: reconciling different models of the world.

Page 4: Crossing the Structure Chasm

OutlineOther benefits of structure: (Semantic) email Personal data management

A tour of recent data sharing architectures Data integration systems Peer-data management systems

The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Page 5: Crossing the Structure Chasm

Adding Structure to Email Email is often used for lightweight data management tasks: Organizing a PC meeting + dinner. Arranging a ‘balanced’ potluck Giving away opera tickets Announcing an event and associated reminders.

Some specialized tools/services: Outlook scheduling, evite.com

Can we delegate some email tasks easily?

Page 6: Crossing the Structure Chasm

Constraints

Check OK

bringingemail

jane@cs Entree

Semantic Email ProcessesOriginator RecipientsProcess Database

“Start a potluck process”

“Here is whateveryone isbringing…”

“What willyou bring?”

john@cs Dessert“I’ll bringa dessert”

mary@ee Appetizer “I’ll bringan appetizer”

jayant@u Dessert“I’ll bringa dessert”

“I’ll bringa dessert”

“I’ll bringan entree”

“Too many desserts.Appetizer or entrée?”

STOP

“I’ll bringa dessert”

Page 7: Crossing the Structure Chasm

Semantic Email[Etzioni, McDowell, (Ha)Levy]

Creating the structure?We’ll help with template interfaces

Incorporating additional knowledge? I always bring desserts I don’t schedule morning meetings Another data sharing challenge.

But it’s free: (and cross platform) www.cs.washington.edu/research/semweb

Page 8: Crossing the Structure Chasm

Personal Data Management

HTMLMail &

calendar

Cites

EventMessag

e

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

Author

Data is organized by application

[Semex: Sigurdsson, Nemes, H.]

Papers Files Presentations

Page 9: Crossing the Structure Chasm

Finding Publications

Person: A. HalevyPerson: Dan SuciuPerson: Maya RodrigPerson: Steven GribblePerson: Zachary Ives

Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa

Page 10: Crossing the Structure Chasm

Publication

Bernstein

Following Associations (1)

Page 11: Crossing the Structure Chasm

“A survey of approaches to automatic schema matching”

“Corpus-based schema matching”

“Database management for peer-to-peer computing: A vision”

“Matching schemas by learning from others”

“A survey of approaches to automatic schema matching”

“Corpus-based schema matching”

“Database management for peer-to-peer computing: A vision”

“Matching schemas by learning from others”

Publication

Bernstein

Following Associations (2)

Page 12: Crossing the Structure Chasm

Publication

Bernstein

Cited by

Publication

Citations

Following Associations (3)

Page 13: Crossing the Structure Chasm

Cited Authors

Bernstein

Publication

Following Associations (4)

Page 14: Crossing the Structure Chasm

Structure for Personal Data

High-level concepts are given, but laterextend and personalize concept hierarchy,share (parts) of our data with others, incorporate external data into our view.

Concepts are populated automatically with instancesNeed Instance level reconciliation:

Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Page 15: Crossing the Structure Chasm

Outline Other benefits of structure:

(Semantic) email Personal data management

A tour of recent data sharing architectures Data integration systems Peer-data management systems

The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Page 16: Crossing the Structure Chasm

Data Integration

Goal: provide a uniform interface to a set of autonomous data sources.First step towards data sharing. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, LSD

Recent industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM

Page 17: Crossing the Structure Chasm

Relational DBMS RefresherSchema: the template for data.

Queries:

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

Students: Takes:

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

Courses:

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Page 18: Crossing the Structure Chasm

Data Integration: Higher-level Abstraction

Mediated Schema

Q

Q1 Q2 Q3SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

… …

Semantic mappings

Page 19: Crossing the Structure Chasm

Mediated Schema

OMIM Swiss-ProtHUGO GO

Gene-Clinics EntrezLocus-

Link GEO

Entity

Sequenceable EntityGenePhenotype Structured

Vocabulary Experiment

Protein Nucleotide Sequence

Microarray Experiment

Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork

Page 20: Crossing the Structure Chasm

Semantic Mappings

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Differences in: Names in schema Attribute grouping

Coverage of databases Granularity and format of attributes

Page 21: Crossing the Structure Chasm

Issues for Semantic Mappings

Mediated Schema

Q

Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

… …

Semantic mappings

Formalism for mappings Reformulation algorithms

How will we create them?

Page 22: Crossing the Structure Chasm

Beyond Data IntegrationMediated schema is a bottleneck for large-scale data sharing

It’s hard to create, maintain, and agree upon.

Page 23: Crossing the Structure Chasm

Peer Data Management Systems

UW

Stanford

DBLP

UBC Waterloo

CiteSeer

TorontoQ

Q1

Q2Q6

Q5

Q4

Q3Mappings specified locallyMap to most convenient nodesQueries answered by traversing semantic paths.

Piazza: [Tatarinov, H., Ives, Suciu, Mork]

Page 24: Crossing the Structure Chasm

PDMS-Related Projects

Hyperion (Toronto)PeerDB (Singapore)Local relational models (Trento)Edutella (Hannover, Germany)Semantic Gossiping (EPFL Zurich)Raccoon (UC Irvine)Orchestra (Ives, U. Penn)

Page 25: Crossing the Structure Chasm

A Few Comments about CommerceUntil 5 years ago: Data integration = Data warehousing.

Since then: A wave of startups:

Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products.

Success: analysts have new buzzword – EII New addition to acronym soup (with EAI).

Lessons: Performance was fine. Need management tools.

Page 26: Crossing the Structure Chasm

Data Integration: Before

Mediated Schema

SourceSource Source Source Source

Q

Q’ Q’ Q’ Q’ Q’

Page 27: Crossing the Structure Chasm

XML Query

User Applications Lens™ File InfoBrowser™ Software

Developers KitNIMBLE™ APIs

Front-End

XML

Lens Builder™

Management Tools

Integration Builder

Security Tools

Data Administrator

Data Integration: After

Concordance Developer

Integration

Layer

Nimble Integration Engine™Compiler Executor

MetadataServerCache

Relational Data Warehouse/ Mart

Legacy Flat File Web Pages

Common XML View

Page 28: Crossing the Structure Chasm

Sound Business ModelsExplosion of intranet and extranet information80% of corporate information is unmanagedBy 2004 30X more enterprise data than 1999The average company: maintains 49 distinct

enterprise applications spends 35% of total IT

budget on integration-related efforts

1995 1997 1999 2001 2003 2005

Enterprise Information

Source: Gartner, 1999

Page 29: Crossing the Structure Chasm

Outline Other benefits of structure:

(Semantic) email Personal data management

A tour of recent data sharing architectures Data integration systems Peer-data management systems

The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Page 30: Crossing the Structure Chasm

Languages for Schema Mapping

Mediated Schema

SourceSource Source Source Source

Q

Q’ Q’ Q’ Q’ Q’

GAV LAV GLAV

Page 31: Crossing the Structure Chasm

Local-as-View (LAV)

Book: ISBN, Title, Genre, Year

R1 R2 R3 R4 R5

Author: ISBN, Name

R1(x,y,n) :- Book(x, y, z, t), Author(x, n), t < 1970R5(x,y) :- Book(x,y,”Humor”)

Books before 1970 Humor books

Page 32: Crossing the Structure Chasm

Query Reformulation

Book: ISBN, Title, Genre, Year

R1 R2 R3 R4 R5

Author: ISBN, Name

Books before 1970 Humor books

Query: Find authors of humor books

Plan: R1 Join R5

Page 33: Crossing the Structure Chasm

Query Reformulation

Book: ISBN, Title, Genre, Year

R1 R2 R3 R4 R5

Author: ISBN, Name

ISBN, Title, Name ISBN, Title

Find authors of humor books before 1960

Plan: Can’t do it!(subtle reasons)

Page 34: Crossing the Structure Chasm

Query Reformulation

Query is posed on mediated schema that contains no data.Sources are answers to queries (views).Problem: answering queries using views (Conceptually) Need to invert query

expression.

Traditional databases also use this:Can you reuse previously cached results?

Page 35: Crossing the Structure Chasm

Answering Queries Using Views

NP-Complete for basic queries [LMSS, PODS 95].Results depend on:Query language used for sources and

queries,Open-world vs. Closed-world assumptionAllowable access patterns to the sources

A lot of beautiful theory!

Page 36: Crossing the Structure Chasm

Theory?

A lot of beautiful theory.

“There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.”

Milan KunderaThe Book of Laughter and Forgetting

Page 37: Crossing the Structure Chasm

Practical Query ReformulationA lot of nice theory.But also very practical algorithms:MiniCon [Pottinger and H., 2001]: scales to

thousands of sources.Every commercial DBMS implements some

version of answering queries using views.

See [Halevy, 2001] for survey.

Page 38: Crossing the Structure Chasm

Reformulation in PDMS

UW

Stanford

DBLP

UBC Waterloo

CiteSeer

Toronto

Can’t follow all paths naivelyPruning techniques

[Tatarinov, H.]Can we pre-compute some paths?

Need to compose mappings [Madhavan, H.,

VLDB-2003]

Page 39: Crossing the Structure Chasm

Open PDMS Research Issues

UW

Stanford

DBLP

UBC Waterloo

CiteSeer

Toronto

Managing large networks of mappings:

• Consistency• Trust

Improving networks: finding additional mappings

Indexing:Heterogeneous data across the networkCaching:Where? What?

Page 40: Crossing the Structure Chasm

Outline Other benefits of structure:

(Semantic) email Personal data management

A tour of recent data sharing architectures Data integration systems Peer-data management systems

The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Page 41: Crossing the Structure Chasm

Semantic Mappings

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Need mappings in every data sharing architecture

“Standards are great, but there are too many.”

Page 42: Crossing the Structure Chasm

Why is it so Hard?Schemas never fully capture their intended meaning:Schema elements are just symbols.We need to leverage any additional information

we may have.

‘Theorem’: Schema matching is AI-Complete.Hence, a human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.

Page 43: Crossing the Structure Chasm

Matching HeuristicsMultiple sources of evidences in the schemas Schema element names

BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation

ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book

Data types, data instances DateTime Integer, addresses have similar formats

Schema structure All books have similar attributes

Use domain knowledge

All these techniques consider only the two schemas.

In isolation, techniques are incomplete or brittle:Need principled combination.

Page 44: Crossing the Structure Chasm

Using Past ExperienceMatching tasks are often repetitive Humans improve over time at matching. A matching system should improve too!

LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03]

Doan: 2003 ACM Distinguished Dissertation Award.

Mediated Schema

data sources

Mediated Schema

Page 45: Crossing the Structure Chasm

listed-price $250,000 $110,000 ...

address price agent-phone description

Example: Matching Real-Estate Sources

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 46: Crossing the Structure Chasm

Learning Source Descriptions

We learn a classifier for each element of the mediated schema.Training examples are provided by the given mappings.Multi-strategy learning:Base learners: name, instance, descriptionCombine using stacking.

Accuracy of 70-90% in experiments.

Page 47: Crossing the Structure Chasm

Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?Can a corpus of schemas and matches be a general purpose resource?Information Retrieval and NLP progressed by using corpora –Can the same be done for structured data?

Page 48: Crossing the Structure Chasm

Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?

CDs Categories Artists

Items

Artists

Authors Books

Music

Information

Litreture

Publisher

Authors

Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas

Learn general purpose knowledge

Classifier for every corpus element

Data InstancesLearnerName Learner

Data TypeLearner

DescriptionLearner

StructureLearner

Meta Learner

multi-strategy learning

Page 49: Crossing the Structure Chasm

The Corpus vs. Other MatchersInventory Domain

0

0.2

0.4

0.6

0.8

1

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Recall

MKB BASIC COMB

Page 50: Crossing the Structure Chasm

Exploiting Previous Experience

Shipping Domain

-15

-10

-5

0

5

10

15

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Avg Number of Matches

Only MKB Only BASIC

Page 51: Crossing the Structure Chasm

Corpus Challenges

What exactly should we learn?Generalizing with few training examplesBalancing previous experience with other cluesSize and scope of the corpus

Page 52: Crossing the Structure Chasm

Other Corpus Based Tools

Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample

data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it

gets reformulated appropriately. Now we can cross the structure chasm.

Page 53: Crossing the Structure Chasm

ConclusionVision: data authoring, querying and sharing by everyone, everywhere. Structure is useful in our daily tasks. Key challenge: reconciling semantic heterogeneity

CorpusOf

schemas

schemamapping

Page 54: Crossing the Structure Chasm

Some References

www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002

Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01.

Teaching integration to undergraduates: SIGMOD Record, September, 2003.