Upload
adrian-roberts
View
229
Download
0
Tags:
Embed Size (px)
Citation preview
IS432: Semi-Structured Data
Dr. Azeddine Chikh
1. Semi Structured Data Object Exchange Model
Introduction
From a database perspective : the Web has generated an enormous demand for
recently developed database architectures for database integration such as
data warehouses and mediation systems
The Web has led to the development of semistructured data model with languages adapted to this model.
3
Introduction
The emergence of XML as a standard for data representation on the Web is expected greatly to facilitate the publication of electronic data by providing a simple syntax for data that is both human and machine readable
4
Introduction
Although the document and database viewpoints were, until quite recently, irreconcilable, there is now a convergence in technologies brought about by the development of XML for data on the Web and the closely related development of
semistructured data in the database community
5
Unstructured Data
6
data can be of any type not necessarily following any format or sequence does not follow any rules is not predictable
examples include text video sound images
Structured Data
7
data is organized in semantic chunks (entities) similar entities are grouped together (relations or
classes) entities in the same group have the same
descriptions (attributes) descriptions for all entities in a group (schema)
have the same defined format have a predefined length are all present and follow the same order
Semi-Structured Data
8
idea predates XML but not HTML data is available electronically in
database systems file systems, e.g., bibliographic data, Web data data exchange formats, e.g., EDI, scientific data
attempt to reconcile database and document "worlds" semi-structured data
organized in semantic entities similar entities are grouped together entities in same group may not have same attributes order of attributes not necessarily important not all attributes may be required size of same attributes in a group may differ type of same attributes in a group may differ
Example of Semi-Structured Data
9
name: Azeddine CHIKH email: [email protected], [email protected] name:
first name: Mourad last name: Benchikh
email: [email protected] name: Ashraf Youcef affiliation: IS Department
Semi-Structured Data Models
10
based on labelled graphs rather than labelled trees used for data exchange among, and integration of,
heterogeneous data sources
schema information is in the edge labels sometimes called schemaless or self-describing data stored at the leaves
Graph Terminology (1)
11
a (directed) graph G = (N,E) consists of a set N of nodes and a set E of edges
each edge in E is an (ordered) pair of nodes (x,y), where x is the source and y is the target
a path from x1 to xn is a sequence of edges(x1, x2), (x2, x3), ... , (xn-1, xn)
the length of a path is the number of edges in it a node r is a root for graph G if there is a path from r
to every other node in G a cycle is a path from a node to itself a graph with no cycles is called acyclic
Graph Terminology (2)
12
a graph is rooted if it has a single root a tree is a rooted graph G in which there is a unique
path from the root to every other node in G a node is a leaf if it is not the source of any edge graphs can have node labels and/or edge labels in an edge-labelled graph G = (N,E,FE), FE is an
edge labelling function that maps each edge to a label
in a node-labelled graph G = (N,E,FN), FN is a node labelling function that maps each node to a label
Object Exchange Model (OEM)
13
original OEM used only node labels we use a variant in which the edges are labelled an OEM data graph is a rooted, labelled, directed
graph its edge labels map to strings only its leaf nodes have labels which map to data
values no ordering of edges leaving a node
OEM Syntax
14
example may be written as { book: { author: "Coetzee", title: "Disgrace", year: 1999} }
simple label-value pairs labels can be repeated, e.g., for multiple authors this is a serialization syntax for the graph what about graphs that are not trees? introduce object identifiers (oids) for nodes
Example of OEM Data Graph (1)
15
Example of OEM Data Graph (2)
16
Example of OEM Syntax
17
bib: &1
{ paper: &2 { ... },
book: &3 { ... },
paper: &4 { author: &10
{ firstname: &15 "Serge", lastname: &16 "Abiteboul”},
author: &11 { ... }
title: &12 { ... }
pages: &13
{ first: &17 122,
last: &18 133 },
references: &2,
references: &3
}
}
Characteristics of SSData
18
structure is irregular: missing or additional attributes (labels)
parts of data lack structure, e.g., images some may yield little structure, e.g., plain text a-priori schema vs a-posteriori dataguide
db: fix the schema, then populate the db web: design pages, then design schema to facilitate access
schema is large schema is often ignored, e.g., information retrieval
queries schema is rapidly evolving
Schema Graphs
19
given some semi-structured data, might want to extract a schema that describes it
useful for browsing the data by types optimizing queries by reducing the number of paths
searched improving storage of data
schema graph specifies what edges are permitted in a data graph every path in the data graph occurs in the schema graph
Example of a Schema Graph
20
Data Graph Satisfying a Schema G.
21
given data graph D and schema graph S D is an instance of S (or D satisfies S) if
there exists a simulation R from D to S such that (root(D), root(S)) is in R
a simulation is a relation R between nodes: if (u,v) is in R and (u,x) labelled l is in D
then there exists (v,y) labelled l in S such that (x,y) is in R for our example:
node &1 in D related under R to node at target of edge labelled bib in S
&2 and &4 related to node at target of edge labelled paper &3 related to node at target of edge labelled book note that above two cases need to satisfy requirements of edges
labelled references as well &10 and &11 related to node at target of edge labelled author
A Less Specific Schema Graph
22
Data Guides
23
Data guide is a concise and accurate summary of a data graph accurate: every path in the data occurs in the data guide, and
vice versa concise: every path in the data guide occurs exactly once data guide is the most specific schema graph for a given data
graph i.e., there is a simulation from the data guide to every other schema graph the data graph satisfies
Example of a Data Guide (1)
24
Example of a Data Guide (2)
25
References
26
www.cis.upenn.edu/~db/tutorials.html Tutorial on semi-structured data by Peter Buneman from Symposium on Principles of Database Systems, 1997
www-db.stanford.edu/lore/research/data.html Abiteboul S., Buneman P., Suciu D., «Data on the
Web - From Relations to Semistructured Data and XML», Morgan Kaufmann Publishers, San Francisco, California