Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Semistructured Data and Mediation
Prof. Letizia Tanca Dipartimento di Elettronica e Informazione
Politecnico di Milano
SEMISTRUCTURED DATA FOR THESE DATA THERE IS SOME FORM OF STRUCTURE, BUT IT IS NOT AS
– PRESCRIPTIVE – REGULAR – COMPLETE
AS IN TRADITIONAL DBMSs EXAMPLES
– WEB DATA – XML DATA – BUT ALSO DATA DERIVED FROM THE INTEGRATION OF
HETEROGENEOUS DATASOURCES
1
AN EXAMPLE OF SEMISTRUCTURED DATA
a page produced from a database
2
AN EXAMPLE OF SEMISTRUCTURED DATA
3
AN EXAMPLE OF SEMISTRUCTURED DATA
4
GRAPH-BASED REPRESENTATION: THE IRREGULAR DATA STRUCTURE APPEARS MORE CLEARLY
CONCRETE VALUES
ABSTRACT OBJECTS
COURSE
PROFESSOR
DATABASES
37 YEARS AGE
TEACHES
45 KING
NAME NAME
LAST NAME
PROFESSOR
TEACHES AGE
NAME
NAME?
A SIMPLE XML DOCUMENT WITH ITS GRAPH BASED REPRESENTATION
5
<producer> <mn-name>Mercury</mn-name> <year>1999</year> <model> <mo-name>Sable LT</mo-name> <front-rating>3.84</front-rating> <side-rating>2.14</side-rating> <rank>9</rank> </model> ……… </producer>
PRODUCER
MODEL
mo-name
front-rating
year mn-name
Mercury 1999
Sable LT
3.84
side-rating
2.14
9
rank
…..
INFORMATION SEARCH IN SEMISTRUCTURED DATABASES
• WE WOULD LIKE TO: – INTEGRATE – QUERY – COMPARE
DATA WITH DIFFERENT STRUCTURES ALSO WITH SEMISTRUCTURED DATA, JUST AS IF THEY WERE ALL STRUCTURED
6
DYNAMIC INTEGRATION OF SEMISTRUCTURED DATABASES
AN OVERALL DATA REPRESENTATION SHOULD BE PROGRESSIVELY BUILT, AS WE DISCOVER AND EXPLORE NEW INFORMATION SOURCES
7
SEMISTRUCTURED DATA MODELS
• BASED ON – TEXT – TREES – GRAPHS
• LABELED NODES • LABELED ARCS • BOTH
• THEY ARE ALL DIFFERENT AND DO NOT LEND THEMSELVES TO EASY INTEGRATION
8
Recall: MEDIATORS
The term mediation includes: • the processing needed to make
the interfaces work • the knowledge structures that
drive the transformations needed to transform data to information
• any intermediate storage that is needed (Wiederhold)
9
TSIMMIS
10
DATASOURCE 1 (RDBMS)
DATASOURCE 2 (LOTUS NOTES)
DATASOURCE 3 (WWW)
WRAPPER 1 WRAPPER 2 WRAPPER 3
MEDIATOR 1 MEDIATOR 2
APPLICATION 1 APPLICATION 2
(OEM)
(LOREL) (LOREL)
(OEM)
Mediator-based approach
IN TSIMMIS: • UNIQUE, GRAPH-‐BASED DATA MODEL • DATA MODEL MANAGED BY THE MEDIATOR • WRAPPERS FOR THE MODEL-‐TO-‐MODEL TRANSLATIONS
11
OEM (Object Exchange Model) (TSIMMIS)
• Graph-based • Does not represent the schema • Directly represents data : self-descriptive
<temp-in-farenheit,int,80>
12
Nested structure
13
<Object-id,label,type,value>
Object structure
OEM (Object Exchange Model) (TSIMMIS)
14
Typical complications when integrating
• Each mediator is specialized into a certain domain (e.g. weather forecast), thus
• Each mediator must know domain metadata , which convey the data semanWcs
• On-‐line duplicate recogniWon and removal (no designer to solve conflicts at design Wme here)
15
Query formulation “Find books authored by Aho”
select library.book.title where library.book.author = “Aho” from library (if more than one root is available) OK, but if this query must be produced at run-time, how
does the user (or the system, if a transformation has to be applied) know that: – A node library exists, which contains nodes book, which in turn
contain fields author and title – TSIMMIS uses the Dataguide: a-posteriori schema,
progressively built while exploring the data sources
16
TSIMMIS’s language is LOREL
• Lightweight Object REpository Language • Object-based • Similar to OQL, with some modifications
appropriate for semistructured data
select library.book.title where library.book.author = “Aho” from library
17
WRAPPERS (translators)
• Convert queries into queries/commands which are understandable for the specific data source
– they can extend the query possibilities of a data source
• Convert query results from the source’s format to a format which is understandable for the application
18
WRAPPERS
19
… S. Abiteboul, P.Buneman, D.Suciu
Data on the Web
... R. Elmasri, S. Navathe
Database Systems
... A. Tannenbaum Computer Networks
... J. Graham The HTML Sourcebook
Editor Author BookTitle
database table(s) (or XML docs) HTML page
Wrapper
Example: estraction of information from
HTML docs • Information extraction
– Source Format: plain text with HTML tags (no semantics)
– Target Format: relational table (possibly nested, NF2) or XML (we add structure, i.e. semantics )
• Wrapper – Software module which performs an extraction step – Intuition: use extraction rules which exploit the marking
tags
20
A complex extraction process
21
20-30KB IN HTML
>10 attributes with nesting
Problems
• Web sites change very frequently • A layout change may affect the extracWon rules • Human-‐based maintenance of an ad-‐hoc wrapper is very expensive
• Be]er: automa&c wrapper genera&on
22
Automa'c wrapper genera'on...
• We can only use them when pages are regular to some extent
• OK when: – Many pages sharing the same structure – e.g. pages are dynamically generated from a DB
à data intensive web sites
23
Online library
24
… …
Paul Jones #a2
John Smith #a1 Name Id
Authors
… … … …
… Java Script #a2 #b5
… HTML and Script #a2 #b4
… XML at Work #a2 #b3
… Computer Systems #a1 #b2
… Database Primer #a1 #b1
Description Title Author Id Books
… … … … … null 50$ 2000 #b5 #e7 Second null First … First … Second … First … Details
45$ 1999 #b4 #e6 30$ 1993 #b4 #e5 30$ 1999 #b3 #e4 40$ 1995 #b2 #e3 30$ 2000 #b1 #e2 20$ 1998 #b1 #e1 Price Year Book Id
Editions
… … … … … …
50$ 2000 null A must in … JavaScripts
45$ 1999 Second Edition, Hard Cover
30$ 1993 null
An useful HTML … HTML and Scripts
30$ 1999 First Edition, Paperback A comprehensive … XML at Work
Paul Jones
40$ 1995 First Edition, Paperback An undergraduate … Computer Systems
30$ 2000 Second Edition, Hard Cover
20$ 1998 First Edition, Paperback This book … Database Primer
John Smith
Price Year Details Editions
Description Title
Books Name
Source Dataset
HTML Pages
The ROAD RUNNER project
• Page Class – It is the collec'on of all pages generated by the same script from a common dataset
• Schema DerivaWon – Given a set of HTML sample pages, belonging to the same class, find the underlying dataset structure (database schema)
à SoluWon: Wrapper Generator • Underlying dataset structure • ExtracWon rules
25
26
Wrapper Output
… … … … … …
50$ 2000 null A must in … JavaScripts
45$ 1999 Second Edition, …
30$ 1993 null An useful HTML …
HTML and Scripts
30$ 1999 First Edition, …
A comprehensive …
XML at Work
Paul Jones
40$ 1995 First Edition, …
An undergraduate …
Computer Systems
30$ 2000 Second Edition, …
20$ 1998 First Edition, … This book … Database
Primer John Smith
B A E D C H G F
<HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> <TD><A>John Smith</A></TD> </TR> <TR> <TD><FONT>Database Primer<FONT></TD> <TD><B>First Edition, Paperback</B></TD>, …
<HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> <TD><A>Paul Jones</A></TD> </TR> <TR> <TD><FONT>XML at Work<FONT></TD> <TD><B>First Edition, Paperback</B></TD>, …
Input HTML Sources
<HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> <TD><A> #PCDATA</A></TD></TR> (<TR> ( <TD><FONT> #PCDATA <FONT></TD> ( <TD><B> #PCDATA </B> )? </TD> ( <TR><TD><FONT> #PCDATA <FONT> <B> #PCDATA </B> </TD></TR> )+ )+…
Wrapper = Grammar
SET ( TUPLE (A : #PCDATA; B : SET ( TUPLE ( C : #PCDATA; D : #PCDATA; E : SET ( TUPLE ( F: #PCDATA; G: #PCDATA; H: #PCDATA)))
Target Schema
Model Management approach (Atzeni, Bernstein, others…)
• Given two different data models (e.g. OO and relational, or XML and OO, etc.) (when datasources are at least semistructured)
• Define general mappings from one model into another, which allow to – Map SQL schema to XML schema – Map data source to data warehouse – Map OO classes to data source tables, …
• To this end, one possibility is to use a metamodel
OBSERVATION: constructs in the various models are similar
• Only few categories of basic constructs (see Hull & King 1986): – Lexical: set of printable values (domain) – Abstract (entity, class, …) – Aggregation: a construction based on (subsets of) – Cartesian products (relationship, table) – Function (attribute, property) – Hierarchies – …
• A metamodel uses these basic constructs to create the constructs of the model we need to represent
28
METAMODEL
• A METAMODEL IS AN ABSTRACT MODEL FOR THE SPECIFICATION OF CONCRETE MODELS
• TWO TYPES OF METAMODELS: 1. GENERAL ENTITIES WHOSE SPECIALIZATIONS BECOME OBJECTS
IN THE TARGET MODEL, E.G. GSMM
2. ENTITIES DESCRIBING THE OBJECTS OF THE TARGET MODEL, E.G.GEOGRAPHIC DATA FILES (GDF)
29
Using a metamodel for integrated data representation
• TRANSLATION OF DIFFERENT MODELS INTO A
UNIQUE FORMALISM • EASY A-PRIORI COMPARISON BETWEEN THE
DIFFERENT MODEL’S FEATURES • AUTOMATIC TRANSLATION DICTATED BY THE
REPRESENTATION RULES OF THE CONCRETE MODEL INTO THE METAMODEL
30
Data IntegraWon based on a Meta-‐model
31
Q (In the metamodel language) Meta-‐model-‐based schema
DS1 DS2 DS3
DSn
…
Q1 (In DS1 language) Q2
(In DS2 language)
Q3 (In DS3 language)
Qn (In DSn language)
GSMM (General Semistructured Meta-‐Model)
• GRAPH-‐BASED • DOES NOT REPRESENT THE SCHEMA: SELF-‐DESCRIPTIVE
AS TSIMMIS • GSSM REPLACES THE CONCEPT OF SCHEMA WITH THAT
OF CONSTRAINT • INTRODUCTION OF FLEXIBILITY IN DATA REPRESENTATION
32
An XML document…
33
<producers> <producer> <mn-name>Mercury</mn-name> <year>1999</year> <model> <mo-name>Sable LT</mo-name> <front-rating>3.84</front-
rating> <side-rating>2.14</side-rating> <rank>9</rank> </model> …… </producer> … </producers>
PRODUCER
MODEL
mo-name
front-rating
year mn-name
Mercury 1999
Sable LT
3.84
side-rating
2.14
9
rank
…..
PRODUCERS …..
Its representation in GSMM
34
<PRODUCER,element,1, ⊥>
<MODEL,element,3, ⊥>
…..
….. <PRODUCERS,root,0,⊥>
<mn-name,element,1, Mercury>
<year,element,2, 1999>
<mo-name,element,1, SableLT>
<front-rating,element,2, 3.84>
<side-rating,element,3, 2.14>
<rank,element,4, 9>
sub-element of
sub-element of sub-element of
sub-element of
sub-element of
sub-element of sub-element of
sub-element of
APPLICATION OF THE GSMM METAMODEL
• THE METAMODEL ALLOWS THE CONSTRUCTION OF A GENERIC GRAPH WHICH REPRESENTS AN INSTANCE OF A CONCRETE MODEL
• PLUS CONSTRAINTS (also represented by means of graphs): – HIGH LEVEL, OR META- CONSTRAINTS – these dictate the
sintactic rules of the object data model – LOW LEVEL CONSTRAINTS – as usual, these dictate the
application domain semantics
• IT IS A METAMODEL OF THE FIRST KIND
35
ModelGen (Atzeni et al.)
A model management operator: • given two data models M1 and M2, and a
schema S1 of M1 (the source schema and model)
• generate a schema S2 of M2 (the target schema and model), corresponding to S1
• and, for each database D1 over S1, generate an equivalent database D2 over S2
36
Geographic Data Files (GDF)
• GDF is a standard, used for describing roadmaps – CITY STREET REGISTER – STREET SIGN ARCHIVE – TRAFFIC CONTROL SYSTEMS – CAR NAVIGATOR SYSTEMS (GPS) – …..
• IT IS A METAMODEL OF THE SECOND KIND
37
GDF STRUCTURE • DATASET OF 82 ASCII CHARACTERS RECORDS
– ENTITIES – ATTRIBUTES – RELATIONS
• 11 INTEREST THEMES – ROADS AND FERRIES – BRIDGES AND TUNNELS – RAILROADS – RIVERS – PUBLIC TRANSPORTATION – ADMINISTRATION AREAS – ……..
• 3 DESCRIPTION LEVELS FOR EACH THEME
38
GDF • LEVEL 0 – TOPOLOGY
– A PLANAR GRAPH: • POINT • ARC • NODE • POLYGON
– EACH ELEMENT IS UNIQUELY IDENTIFIED BY AN ID
– TOPOLOGICAL RULES GOVERN INTERNAL STRUCTURE AND RELATIONSHIPS AMONG ELEMENTS
39
GDF LEVELS
• LEVEL 1 – ELEMENTARY ENTITIES THIS IS THE BASIS FOR THE CITY STREET
REGISTER – ROAD ELEMENT – JUNCTION – TRAFFIC AREA
• LEVEL 2 – COMPLEX ENTITIES THIS IS THE BASIS FOR THE GEOGRAPHIC
INFORMATION SYSTEMS – STREET – INTERSECTION
40
GDF LEVELS
41
CL203 CS102 CS103
CL
234
CL
235
L321
L322
L52
5
L52
4
S867
S869 S868
S870
C621
C622
C623
C624
C625 C631
C627
C629
C628
C626
C630 L811
L812 L814
L813
L816
L815 L81
7
L82
0
L81
8
L81
9
L82
1
CITY MAP
LEVEL 2
LEVEL 1
LEVEL 0
VIALE ABRUZZI
Letizia Tanca DB design and integration 42
SUMMARY
Bibliography on Data IntegraWon
43