25
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

Embed Size (px)

Citation preview

Page 1: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

1

Semi-structured data

Patrick Lambrix

Department of Computer and Information Science

Linköpings universitet

Page 2: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

2

Semi-structured data

Data is not just text, but is not as well-structured as data in databases

Occurs often in web databanks Occurs often in integration of databanks

Page 3: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

3

Semi-structured data - properties irregular structure implicit structure partial structure a posteriori ’data guide’

versus a priori schema large data guides

Page 4: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

4

Semi-structured data - properties It should be possible to ignore the data

guide upon querying Data guide changes fast object can change type/class difference between data guide and data

is blurred

Page 5: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

5

Semi-structured data - model

network of nodes object model (oid) query: path search in the network

Page 6: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

6

OEM (Object Exchange Model)

Graph Nodes: objects

oid

atomic or complex

- atoms: integer, string, gif, html, …

- value of a complex object is a set of

object references (label, oid) Edges have labels OEM is used by a number of systems (ex. Lorel)

Page 7: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

7

OEM example

1 2 G uide

1 9 3 5 5 4 7 7

1 7 1 3 1 4

go urm et C hef C hu

4 4 1 5 1 6

El C am ino R eal P alo A lto 92310

1 8 2 3 2 56 6 5 5 7 9 8 0

V ietnam es e S aigo n M o untainV iew

M enlo P ark c heap fas t fo o d S and ra

92310

res taurant res taurant c afe

nearb y

nearb y

nearb y

c atego ry nam e ad d res s

s treet c ity zip c o d e

zip c o d e

c atego ry nam e ad d res s ad d res s p ric e p ric e c atego ry nam e

Restaurant Guide

Page 8: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

8

Lorel query language1. Find all places to eat Vietnamese food

select P

from RestaurantGuide.% P

where P.category grep “ietnamese”

2. Find the names and streets of all restaurants in Palo Alto

select R.name, A.street

from RestaurantGuide.restaurant{R}.address A

where A.city = “Palo Alto”

Page 9: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

9

3. Find all restaurants to eat with zipcode 92310 select RestaurantGuide.restaurant where

RestaurantGuide.restaurant(.address)?.zipcode = 92310

Wildcards and variables ? - 0 or 1 path - object variables + - 1 or more paths select P from Guide.% P * - 0 or more paths select A from #.address{A} # - any path - path variables % - 0 or more chars select Guide.#@P.name

Lorel query language

Page 10: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

10

Data Guides A structural summary over a data source

that is used as a dynamic schema Is used in query formulation and

optimization Is often created a posteriori Properties:

concise accurate convenient

Page 11: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

11

Data Guides - definitions

Label path: sequence of labels L1.L2. … .Ln

Data path: alternating sequence of labels and oid:s L1.o1.L2.o2. … .Ln.on

Data path d is an instance of label path l if the sequences of labels are identical in l and d.

Page 12: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

12

Data Guides - definitions

A data guide for object s is an object d such that every label path of s has exact one data path instance in d, and each label path in d is a label path of s.

Page 13: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

13

Data Guides

A data source can have several data guides

Minimal data guidesthe smallest data guides

Page 14: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

14

Data Guides - example

1

2 3 4

A B B

5 6 7

8 9 10

C C C

D D D

18

19

20

21

C

D

A B

(a) (c)

Data model minimal Data Guide

Page 15: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

15

Minimal Data Guides

Concise

May be hard to maintain Example: child node for 10 with label E

Page 16: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

16

Strong Data Guides

Intuitively:

”label paths that reach the same set of objects in the data model = label paths that reach the same objects in the data guide”

Page 17: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

17

Strong Data Guides - definitions

An object o can be reached from s via l if there is a data path of s that is an instance of l and that has o as last oid

(L1.o1.L2.o2. … Ln.o)

The target set for label path l in object s is the set of objects that can be reached from s via l. Notation: T(s,l)

L(s,l): set of label paths of s that have the same target set in s as l.

Page 18: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

18

Definition:d is a strong data guide for s if for all label paths l of s it holds that L(s,l) = L(d,l)

There is a 1-1-mapping between target sets in the data model and nodes in a strong data guide.

Strong Data Guides - definitions

Page 19: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

19

Data Guides - example

1

2 3 4

A B B

5 6 7

8 9 10

C C C

D D D

11

12 13

A B

14 15

16 17

C C

D D

18

19

20

21

C

D

A B

(a) (b) (c)

Data model strong Data Guide minimal Data Guide

Page 20: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

20

Strong Data Guides - algorithmImplementation:- Traverse data model depth-first.- Each time you find a new target set for

label path l, create a new object in the data guide.

If the target set is already represented in the data guide, do not create a new object, but link to the existing object.

Page 21: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

21

Strong Data Guides - use

Easier to maintain Used as path index for query

optimization

Page 22: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

22

Semi-structured data-exercises

Page 23: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

23

Represent the relations below using the OEM data model.

Exercise 1

r_id name

r1 Hamletr2 Normandier3 Mc Donald's

c _id name

c 1 Linkopingc 2 Norkoping

r_id c _id street

r1 c 1 Storgatanr2 c 1 St.Larsgatanr3 c 2 Kungsgatan

R es taurantsC ities

R es taurants & C ities

Page 24: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

24

Using the data model from the previous question, formulate the following queries using Lorel:

find all the restaurants that are located in Linkoping

find the address (city and street) of the “Hamlet” restaurant

list the restaurants by city (equivalent of GROUP BY)

Exercise 2

Page 25: 1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet

25

Draw the strong Data Guide forthe restaurant guide data model below.

Exercise 3

1 G uide

2 3 4

5 6 7

go urm et C hef C hu

1 6 1 7 1 8

El C am ino R eal P alo A lto 92310

9 1 0 1 18

S aigo n M enlo P ark

res taurant res taurant c afe

nearb y

c atego ry nam e ad d res s

s treet c ity zip c o d e

c o ntac t nam e ad d res s

nearb y

nearb y

1 9

m anager

2 6

p ho ne

71-72-73

c o ntac t

2 0

res ervatio n

2 7

p ho ne

11-12-13

1 2 1 3 1 4

fas t fo o d S and ra

2 1 2 2 2 3

R yd s vagen L inko p ing 58435

1 5

c atego ry nam e ad d res s

s treet c ity zip c o d e

c o ntac t

2 5

m anager

2 9

p ho ne

2 4

res ervatio n

2 8

p ho ne

31-32-33 34-35-36

RestaurantGuide