30
Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Embed Size (px)

Citation preview

Page 1: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Part OneXML and Databases

Soumen Chakrabarti

CSE, IIT Bombay

Page 2: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Form and content• The Web today

– HTML generated by hand, wysisyg editors, ‘webified’ databases

– HTML specifies rendering for human reading– Screen scraping required to consolidate data

• The Web in the future– Common interchange format (XML)– Concentrate on content, not form– Represent data class broader than relations

Page 3: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Role of databases• Contribute

– Data storage and indexing– Query processing and optimization– Views, transformations, integration

• Adopt– Search modalities– Content-based approximate search– Linguistic analysis

Page 4: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Features of semi-structured data• No explicit schema, or volatile schema

• Schema size comparable to data size

• Structure changes without notice

• Heterogeneous, deeply nested, irregular

• Has nature of documents rather than tables

Page 5: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Semi-structured data model example

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

title publisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object Exchange Model (OEM)

complex object

atomic object

Page 6: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Syntax

{ paper: { author: “Abiteboul”,

author: { firstname: “Victor”,

lastname: “Vianu”},

title: “Regular path queries …”,

page: { first: 122, last: 133 }

}

}

Page 7: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Some observations• Missing or additional attributes

• Multiple attributes

• Different types in different objects

• Heterogeneous collections

Page 8: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Object ID’s and references

<person id=“o555”><name>Jane </name></person>

<person id=“o456”><name>Mary</name><children idref=“o123 o555”/></person>

<person id=“o123” mother=“o456”><name>John</name></person>

o555

o456

o123

children childrenmother

Page 9: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Names and acronyms• OEM (Object Exchange Model): a semi-

structured data model from Stanford, 1995

• Lore: a system for storing data adhering to the OEM

• Lorel: a query language for Lore

• XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML

• XML-QL: Query language for XML

Page 10: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Lorel query examples

select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year >1995

select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year >1995

select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X

select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X

Alternative

Transitive closureNavigating partially

known structures

Page 11: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

XML-QL query examples

where <book language=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author></book> in “www.a.b.c/bib.xml”construct $a

where <book language=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author></book> in “www.a.b.c/bib.xml”construct $a

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result><author>$a</><lang>$l</></>

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result><author>$a</><lang>$l</></>

Page 12: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

XML storage in ternary relation

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

• Too many joins

• Label name storage redundant

Page 13: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Storage optimization through mining

paperpaper paper

paper

authorauthor author author author

titletitle title title

year

fn fn fn fn lnlnlnln

a u t h o r t i t l eX X

f n 1 l n 1 f n 2 l n 2 t i t l e y e a r

X X X X X -X X - - X XX X - - X -

Paper1

Paper2

• Inline common cases

• Tolerate a few nulls

Page 14: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Schema extraction• Schema: a template for type/semantics

specification

• Conformance– Does that data conform to a given schema ?

• Classification– If so, which objects belong to what

classes/types?

• Applications– Storage and query optimization

Page 15: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Graph simulationGiven two edge-labeled graphs G1 and G2, a

simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R

x1 x2

a

R

G1 G2

y1

a

Ry2

Page 16: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Upper and lower bound schema• Lower bound schema

– Conformance: find simulation R from S to D– Classification: check if (c,x) in R– Used in storage optimization

• Upper bound schema (data guides)– Conformance: find simulation R from D to S– Classification: check if (x,c) in R– Used in path index generation and query

optimization

Page 17: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Sample data

&r

&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c

company

employeeemployee

employeeemployee employee employee

employeeemployee

worksfor

worksfor

worksforworksforworksfor

worksforworksfor

worksfor

manages

manages

manages

manages

managedby

managedbymanagedby

manages

managedby

managedby

Page 18: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Lower bound schema

Root&r

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company employee

manages

managedby

worksfor

worksfor

employee

Page 19: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Storage using lower bound schema

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …

Employee

Store rest inoverflow graph

Lower-bound schema

Page 20: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Upper bound schema (DataGuides)

Root&r

Employees&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company

employee

managesmanagedby

manages

managedby

worksfor

worksfor

worksfor

Page 21: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Query optimization issues

Select x from A.B x where exists y in x.C: y=5

D D B

C C C

A

5 5 5

B B B

C C C

A

4 4 5

B B B

C C C

A

4 4 5

B

B

D

D

Page 22: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

What makes the problem difficult• Selectivity estimation

• Index selection

• Access cost models

• Clustering choices

Page 23: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Part Two Information Retrieval and

Databases

Soumen Chakrabarti

CSE, IIT Bombay

Page 24: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Information retrieval (IR)• Search

– ‘Inverted’ index– Boolean match– Relevance ranking

• Classification– Learn topics from examples

• Clustering– Discover topics from a document collection

• Never done inside a relational database

cat

dog

D5: 3, 37, 50D7: 9, 20

D7: 7, 90, 400D20: 22, 533

Page 25: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Current style of loose integration• RDBMS provides hooks

• Declare some columns as textual with keyword index

• Inserts, updates, and deletes trigger external program, e.g., Verity search engine

• Search engine maintains separate indices

• Simple query rewriting to combine relational and text-match where-clauses

Page 26: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Reasons• Space

– BLOB vs. pure relational representation– Average English word is only 5 bytes

• Time– Most text engines are resigned to flexible (i.e.,

no) model for data consistency– Much faster read-only access than relational

database lookups

Page 27: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

New features desired• Operations that are more complex than

keyword search can benefit from tighter coupling with RDBMS

• Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99)– Misspelling book title, author name common– Variant of OEM edge label (author/writer/poet)

• Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)

Page 28: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Case study: generalized ‘like’• SQL has limited string matching constructs

– like ‘%x’, ‘x%’, ‘%x%’– x must be exact match

• Need more lenient match– Applications: LDAP, IR

• String edit distance is not suitable– “Given query, order strings in database in

increasing order of edit distance and pick top 5”

Page 29: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Sliding-window matching

nascent pascal

nas asc sce cen ent pas sca cal

• Given a query, scan to get a set of 3-grams

• Similarity of string in database to query = number of shared 3-grams

rascal

ras

Page 30: Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Issues• Minimally disruptive architecture

• Low storage overheads

• Fast query processing

• Good selectivity estimates

• Combining with other predicates for ranking

• Efficiently handling updates