View
216
Download
0
Tags:
Embed Size (px)
Citation preview
The Structure Chasm
Authoring Creating a schema
Writing text
Querying keywordsUsing someone else’s schema
Data sharing Easy Committees, standardsBut we can pose
complex queries
Why is This a Problem?Databases used to be isolated and administered only by experts.Today’s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources)
The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web
Fundamental problem: reconciling different models of the world.
OutlineTwo motivating scenarios: A web of structured data Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Large-Scale Scientific Data Sharing
UW
UW Microbiology
UCLA GeneticsUW Genome Sciences
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
OMIMHUGO
Swiss-Prot
GeneClinics
Non-urgent Applications
UW
California IRS
Employer Tax Reports
1040 DBIRS
Fidelity
County real-estate DB
B of A
NY IRS
Personal Data Management
HTMLMail &
calendar
Cites
Event
Message
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
Author
Data is organized by application
[Semex: Sigurdsson, Nemes, H.]
Papers Files Presentations
Finding Publications
Person: A. HalevyPerson: Dan SuciuPerson: Maya RodrigPerson: Steven GribblePerson: Zachary Ives
Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa
“A survey of approaches to automatic schema matching”
“Corpus-based schema matching”
“Database management for peer-to-peer computing: A vision”
“Matching schemas by learning from others”
“A survey of approaches to automatic schema matching”
“Corpus-based schema matching”
“Database management for peer-to-peer computing: A vision”
“Matching schemas by learning from others”
Publication
Bernstein
Following Associations (2)
PIM Data Sharing Challenges
Need to combine data from multiple applications/ sources.
After initial set of concepts are given, extend and personalize concept hierarchy,share (parts) of our data with others, incorporate external data into our view.
Need also Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!
OutlineTwo motivating scenarios: A web of structured data Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Data Integration
Goal: provide a uniform interface to a set of autonomous data sources.New abstraction layer over multiple sources. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, BioMediator Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD),…
Recent “Enterprise Information Integration” industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM
Relational Abstraction LayerSchema: the template for data.
Queries:
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
Students: Takes:
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
Courses:
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
Data Integration: Higher-level Abstraction
Mediated Schema
Q
Q1 Q2 Q3SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Semantic mappings
Mediated Schema
OMIMSwiss-Prot
HUGO GO
Gene-Clinics
EntrezLocus-Link
GEO
Entity
Sequenceable Entity
GenePhenotypeStructured Vocabulary
Experiment
ProteinNucleotide Sequence
Microarray Experiment
Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?
www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork
Semantic Mappings
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Differences in: Names in schema Attribute grouping
Coverage of databases Granularity and format of attributes
Key Issues
Mediated Schema
Q
Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Formalism for mappings Reformulation algorithms
How will we create them?
Beyond Data Integration
Mediated schema is a bottleneck for large-scale data sharing
It’s hard to create, maintain, and agree upon.
Peer Data Management Systems
UW
Stanford
DBLP
UCLA UCSD
CiteSeer
UC BerkeleyQ
Q1
Q2Q6
Q5
Q4
Q3Mappings specified locally
Map to most convenient nodes
Queries answered by traversing semantic paths.
Piazza: [Tatarinov, H., Ives, Suciu, Mork]
PDMS-Related Projects
Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Zurich)
Raccoon (UC Irvine)
Orchestra (U. Penn)
A Few Comments about Commerce
Until 5 years ago: Data integration = Data warehousing.
Since then: A wave of startups:
Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products.
Success: analysts have new buzzword – EII New addition to acronym soup (with EAI).
Lessons: Performance was fine. Need management tools.
XML Query
User Applications
Lens™ File InfoBrowser™Software
Developers Kit
NIMBLE™ APIs
Front-End
XML
Lens Builder™Lens Builder™
Management Tools
Management Tools
Integration Builder
Integration Builder
Security T
ools
Data Administrator
Data Administrator
Data Integration: After
Concordance Developer
Concordance Developer
Integration
Layer
Nimble Integration Engine™
Compiler Executor
MetadataServerCache
Relational Data Warehouse/ Mart
Legacy Flat File Web Pages
Common XML View
Sound Business Models
Explosion of intranet and extranet information80% of corporate information is unmanagedBy 2004 30X more enterprise data than 1999The average company: maintains 49 distinct
enterprise applications spends 35% of total IT
budget on integration-related efforts
1995 1997 1999 2001 2003 2005
Enterprise Information
Source: Gartner, 1999
OutlineTwo motivating scenarios: A web of structured data Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems Query reformulation Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Languages for Schema Mapping
Mediated Schema
SourceSource Source Source Source
Q
Q’ Q’ Q’ Q’ Q’
GAV LAV GLAV
GLAV Mappings
Book: ISBN, Title, Genre, Year
R1aR1b R2 R3 R4
Author: ISBN, Name
R1a(isbn, title,n), R1b(isbn, genre,n) Book(isbn, title, genre, year), Author(isbn, n), year < 1970
Books before 1970
R5
Query Reformulation
Book: ISBN, Title, Genre, Year
R1 R2 R3 R4 R5
Author: ISBN, Name
Books before 1970 Humor books
Query: Find authors of humor books
Plan: R1 Join R5
R5(x,y) :- Book(x,y,”Humor”)
Answering Queries Using Views
Formal Problem: can we use previously answered queries to answer a new query? Challenge: need to invert query expression.
Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources MiniCon [Pottinger and H., 2001]: scales to
thousands of sources.
Every commercial DBMS implements some version of answering queries using views.
Some Open Research Issues Managing large networks of mappings:
• Consistency• Trust
Improving networks: finding additional mappings
Indexing:Heterogeneous data across the networkCaching:Where? What?
UW
Stanford
DBLP
UCLA UCSD
CiteSeer
UC Berkeley
OutlineTwo motivating scenarios: A web of structured data Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems Query reformulation Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Semantic Mappings
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Need mappings in every data sharing architecture
“Standards are great, but there are too many.”
Why is it so Hard?Schemas never fully capture their intended meaning:We need to leverage any additional information
we may have.
A human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.
Two cases for schema matching:Find a map to a common mediated schema.Find a direct mapping between two schemas.
Typical Matching HeuristicsWe build a model for every element from multiple sources of evidences in the schemas Schema element names
BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation
ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book
Data types, data instances DateTime Integer, addresses have similar formats
Schema structure All books have similar attributes
Models consider only the two schemas.
In isolation, techniques are incomplete or brittle:Need principled combination.
Using Past ExperienceMatching tasks are often repetitive Humans improve over time at matching. A matching system should improve too!
LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03]
Doan: 2003 ACM Distinguished Dissertation Award.
Mediated Schema
data sources
Mediated Schema
listed-price $250,000 $110,000 ...
address price agent-phone description
Example: Matching Real-Estate Sources
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
Learning Source Descriptions
We learn a classifier for each element of the mediated schema.Training examples are provided by the given mappings.Multi-strategy learning:Base learners: name, instance, descriptionCombine using stacking.
Accuracy of 70-90% in experiments.Learning about the mediated schema.
Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, H.]
Can we use previous experience to match two new schemas?
Learn about a domain?
CDs Categories Artists
Items
Artists
Authors Books
Music
Information
Litreture
Publisher
Authors
Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas
Learn general purpose knowledge
Classifier for every corpus element
Exploiting The Corpus
Given an element s S and t T, how do we determine if s and t are similar?
The PIVOT Method: Elements are similar if they are similar to the same
corpus concepts
The AUGMENT Method: Enrich the knowledge about an element by
exploiting similar elements in the corpus.
Pivot: measuring (dis)agreement
Pk= Probability (s ~ ck )
Interpretation I(s) = element s Schema S
Compute interpretations w.r.t. corpus
# concepts in corpus
Similarity(I(s), I(t))
I(s)
I(t)s t
S T
Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.
Augmenting element models
Search similar corpus concepts Pick the most similar ones from the interpretation
Build augmented models Robust since more training data to learn from
Compare elements using the augmented models
s
S
Schema
ElementModel
Name:Instances:Type:…
M’s
Search similar corpus concepts
Build augmented models
s e ff
e
Corpus of known schemas and mappings
Experimental Results
Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas
Performance measure: F-Measure:
Precision and recall are measured in terms of the matches predicted.
recallprecision
recallprecisionf
+=
**2
Comparison over domains
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto real estate invsmall inventory nameaddr
Average FMeasure
direct augment pivot
Corpus based techniques perform better in all the domains
“Tough” schema pairs
Significant improvement in difficult to match schema pairs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto real estate invsmall inventory nameaddr
Average F-Measure
direct augment pivot
Mixed corpus
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto + re + invsmall difficult auto + invsmall
Average F-Measure
direct augment pivot
Corpus with schemas from different domains can also be useful
Other Corpus Based ToolsA corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP?
Back to the structure chasm: Authoring and querying.
Auto-complete: I start creating a schema (or show sample data),
and the tool suggests a completion.
Formulating queries on new databases: I ask a query using my terminology, and it gets
reformulated appropriately.
ConclusionVision: data authoring, querying and sharing by everyone, everywhere.
Need to make it easier to enjoy the benefits of structured data.
Challenge: reconciling semantic heterogeneity
CorpusOf
schemas
schemamapping
Some References
www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002
Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01.
Teaching integration to undergraduates: SIGMOD Record, September, 2003.