58
Lore: A Database Management System for Semistructured Data

Lore: A Database Management System for Semistructured Data

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Lore: A Database Management System for Semistructured Data

Lore: A Database Management System for Semistructured Data

Page 2: Lore: A Database Management System for Semistructured Data

LORE

• Lore : Lightweight Object Repository

• http://WWW-DB.Stanford.EDU/lore/demo/

Page 3: Lore: A Database Management System for Semistructured Data

Lore - motivation• The data may be irregular and thus not conform

to a rigid schema.

• Relational data model has null values, and OO models have inheritance and complex objects. Both have difficulties in designing schemas to incorporate irregular data.

• It may be difficult to decide in advance on a single, correct schema, The structure of the data may evolve rapidly, data elements may change types, or data not conforming to previous structure may be added.

Page 4: Lore: A Database Management System for Semistructured Data

• Thus, there is a need for management of

semi-structured data!• Lore system manages semi-structured data. The

data managed by Lore is not confined to a schema and it may be irregular or incomplete.

• OEM is the Lore’s data model. OEM - object Exchange Model - graph based self-describing object instance model where nodes are objects and edges are labeled with attribute names and leaf nodes have atomic values

• Lore is light weight object repository and Lorel is Lore’s query language.

Page 5: Lore: A Database Management System for Semistructured Data

• Path expression queries

• Automatic type coercion

• Use of Data guides – Structural summary of the database

Page 6: Lore: A Database Management System for Semistructured Data

• Objects with Oid• Atomic objects – no outgoing edge and contain a

value from one of the basic atomic types such as Integer, real, string, gif, java, audio etc.

• Complex objects – may have outgoing edges• Names – Sp. Labels that serves as aliases for

objects and as entry points to the database.

Page 7: Lore: A Database Management System for Semistructured Data

Object Exchange Model - OEM• Motivation - information exchange and extraction

•Each value exchanged is given an explicit label.

Object temp-in-Fahrenheit, integer, 80 - “temp-in-Fahrenheit” is the label. Each object is self-describing, with a label, type and value.

set-of-temps, set, {cmpnt1, cmpnt2} cmpnt1 is temp-in-Fahrenheit, integer, 80 cmpnt2 is temp-in-Celsius, integer, 20

Page 8: Lore: A Database Management System for Semistructured Data

Labels• Plays two roles

– identifying an object (component)– identifying the meaning of an object (component)

• Person-name both identifies cmpnt1 and coveys its meaning.

person-record, set, {cmpnt1, cmpnt2, cmpnt3} cmpnt1 is person-name, string, ``Fred’’ cmpnt2 is office-num-in-bldg-5, integer, 333 cmpnt3 is department, string, ``toy’’

• In relational data this corresponds to ….

Page 9: Lore: A Database Management System for Semistructured Data

Labels - Issues• What does the label mean?

– Database of labels – Ontology of labels - within each source

•Labels are relative (more specific) to the source of the data object. •Similar labels from different sources need to be resolved.

• Labels provide the flexibility in representing

object structure

Page 10: Lore: A Database Management System for Semistructured Data

Self-describing data models• Have been in existence for a long time? Why

additional interest now?

•Use the ``nature’’ of self-describing data model for information exchange, and to extend the model to include object nesting.

•To provide an appropriate object request language (query facility)

Page 11: Lore: A Database Management System for Semistructured Data

OEM - Specification• Each object in OEM has the following structure:

– Label: A variable character string describing what the object represents.

– Type: The data type of the object’s value. Each is either an atom type, or type set.

– Value: A variable-length value of the object.– Object-ID: A unique variable-length identifier

for the object or null.

Label Type Value Object-ID

Page 12: Lore: A Database Management System for Semistructured Data

OEM - Summary• OEM is an information exchange model. It does not specify how

objects are stored at source.

• OEM does specify how objects are received at a client, but after objects are received they can be stored in any way the client likes.

• Note the schema-less nature of OEM is particularly useful when a client does not know in advance the labels or structure of OEM objects.

• Each source has a distinguished object with lexical identifier ``root’’.

Page 13: Lore: A Database Management System for Semistructured Data

Example• <biblio,set,{doc1,doc2,…,docn}>• doc1 is <doc, set, {auths1, topic1, call-

no1}>• auths1 is <auth-set,set {auth11}>• auth11 is <auth-ln, string, ``Ullman’’>• topic1 is <topic, string,``Databases’’>• call-no1 is <internal-call-no, integer, 25>• doc2 is <doc, set, {auths2, topic2, call-

no2}>• auths2 is <auth-set,set {auth21, auth22,

auth23}>• auth21 is <auth-ln, string, ``Aho’’> • auth22 is <auth-ln, string,

``Hopcroft’’> • auth23 is <auth-ln, string, ``Ullman’’> •

• topic2 is <topic, string,``Algorithms’’>

• call-no1 is <dewey-decimal, string, ``BR273’’>

• docn is <doc, set, {authsn, topicn, call-non}>

• authsn is <auth,string, ``Crichton’’>

• topic1 is <topic, string,``Dinosaurs’’>

• call-no1 is <fictional-call-no, integer, 95>

• biblio is the root object.

Page 14: Lore: A Database Management System for Semistructured Data

OEM - QLSELECT Fetch-expression

FROM Object

WHERE Condition

• The result of this query is itself an object, with special label ``answer’’:

answer, set, {obj1, obj2, …, objn}

• Each returned obji is a component of object specified in the From clause of the query, where the component is located by the Fetch-expression and satisfies the Condition.

Page 15: Lore: A Database Management System for Semistructured Data

Path• The notion of path is used in both Fetch-Expression in

the Select clause and the condition in the Where clause.

• Path describes traversals through an object using subobject structure and labels.

• Example: ``biblio.doc.auth’’

• Paths are used in Fetch-Expression to specify which components are are returned in the answer object.

• Paths are used in the condition to qualify the fetched objects or other (related) components in the same object structure.

Page 16: Lore: A Database Management System for Semistructured Data

Queries - Simple• Retrieve the topic of each document for which

``Ullman’’ is one of the authors:SELECT biblio.doc.topic

FROM root

WHERE biblio.doc.auth-set.auth-ln = ``Ullman’’

• Intuitively, the query’s where clause finds all paths through subobject structure with the sequence of labels [biblio,doc,auth-set,auth-ln] such that the object at the end of the path has value ``Ullman.’’

<answer, set, {obj1, obj2}>

obj1 is <topic, string, ``Databases’’>

obj2 is <topic, string, “Algorithms”>

Page 17: Lore: A Database Management System for Semistructured Data

Queries - ``wild-cards’’• Retrieve all documents with internal call number:

SELECT biblio.?.topic

FROM root

WHERE biblio.?.internal-call-no

• ``?’’ label matches any label. For this query, the doc labels can be replaced by any other strings and query would produce the same result. By convention, two occurrences of ? In the same query must match the same label unless variables are used.

<answer, set, {obj1}>

obj1 is <topic, string, ``Databases’’>

Page 18: Lore: A Database Management System for Semistructured Data

Queries - ``wild-paths’’• Retrieve all documents with internal call number:

SELECT *.topic

FROM root

WHERE *.internal-call-no

• Symbol ``*’’ matches any path of length one or more. The use of * followed by a single label is a convenient and common way to locate objects with a certain label in complex structure. Similar to ?, two occurrences of * in the same query must match the same sequence of labels, unless variables are used.

<answer, set, {obj1}>

obj1 is <topic, string, ``Databases’’>

Page 19: Lore: A Database Management System for Semistructured Data

Queries - variables• Retrieve each document for which both ``Hopcroft’’

and ``Aho’’ are co-authors:SELECT biblio.doc

FROM root

WHERE biblio.doc.auth-set.auth-ln(a1)=``Aho’’ and

biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’

• Here, the query finds all the paths with structure [biblio, doc, auth-set], and with two distinct path completions with label auth with values ``Aho’’ and ``Hopcroft’’

<answer, set, {obj1}>

obj1 is the complete doc2

Page 20: Lore: A Database Management System for Semistructured Data

OEM (Cont.)

• Examples:– Object &3 is complex, and its subobjects are

&8, &9, &10, and &11.– Object &7 is atomic and has value “Clark”.

• DBGroup is a name that denotes object &1.(Names are entry points into the database).

Page 21: Lore: A Database Management System for Semistructured Data

DBGroup

&1&1

&2&2 &3&3 &4&4 &5&5 &6&6

&11&11&8&8 &9&9 &10&10 &12&12 &13&13 &14&14&7&7 &15&15 &16&16

&17&17 &18&18 &19&19 &20&20

Member

Member

MemberMember

OfficeAge

Name Project

ProjectProject

Project

Name

Name AgeOffice

Office

RoomBuilding Building Room

“Clark” “Smith”

46 “Gates 252” “Lore” “Tsimmis”

“Jones” 28

An OEM Database

“CIS” “411” “CIS” 252

Page 22: Lore: A Database Management System for Semistructured Data

Lorel Queries - Simple Path Expression

•Retrieve the offices of members with age greater than 30 years:

Query SELECT DBGroup.Member.OfficeWHERE DBGroup.Member.Age > 30

Result Office “Gates 252”

Office

Building “CIS”

Room “411”

Page 23: Lore: A Database Management System for Semistructured Data

Lorel Query Rewrite

• Previous query rewritten to:– select O

from DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

• Comparison on age transformed to existential condition.– Since all properties are set-valued in OEM.– A user can ask DBGroup.Member.Age < 30

regardless of whether Age is single valued, set valued, or unknown.

Page 24: Lore: A Database Management System for Semistructured Data

Lorel Query Rewrite

• Why?– Breaking query into simple path expressions

necessary for query optimization.– Need to explicitly handle coercion.

• Atomic objects and values. 0.5 < “0.9” should return true

• Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.

Page 25: Lore: A Database Management System for Semistructured Data

Queries - General Path Expression

Query SELECT DBGroup.Member.NameWHERE DBGroup.Member.Office(.Room

%|.Cubicle)?

Like “%252”

Result Name “Jones”

Name “Smith”

•Room% matches all labels starting from Room, like Room68. “|” stands for disjunction. “?” indicates that the label pattern is optional. “like %252” specifies that the data value should end with string “252”.

Page 26: Lore: A Database Management System for Semistructured Data

Queries - SubQueriesRetrieve Lore project members who work on other

projects

Query SELECT M.Name,

( SELECT M.Project.Title

WHERE M.Project.Title != “Lore”)

FROM DBGroup.Member M

WHERE M.Project.Title = “Lore”

Result Member

Name “Jones”

Title “Tsimmis”

Page 27: Lore: A Database Management System for Semistructured Data

Data Guides• A DataGuide is a concise and accurate summary

of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog).

• Why?– No explicit schema, how do we formulate meaningful

queries?

– Large databases (can’t just view graph structure).

– What if a path expression doesn’t exist (waste).

• Each possible path expression is encoded once.

Page 28: Lore: A Database Management System for Semistructured Data

{9, 13}

Page 29: Lore: A Database Management System for Semistructured Data

DataGuides As Histograms

• Each object in the dataguide can have a link to its corresponding target set.– A target set is a set of oids reachable by that

path.• TS of DBGroup.Member.Age is {9, 13}.

– This is a path index. Can find set of objects reachable by a particular path.

– Can store statistics in DataGuide • For example, the # of atomic objects of each type

reachable by p.

Page 30: Lore: A Database Management System for Semistructured Data

Conclusions• Takes advantage of the structure where it exists.• Handles lack of structure well (data type coercion,

general path expressions).• Query language allows users to get and update

data from semistructured sources.– DataGuide allows users to determine what paths exist,

and gives useful statistical information

• Lore does facilitate query and updates on semi-structural databases

Page 31: Lore: A Database Management System for Semistructured Data

OEM vs. XML

• OEM’s objects correspond to elements in XML• Sub-elements in XML are inherently ordered.• XML elements may optionally include a list of

attribute value pairs.• Graph structure for multiple incoming edges

specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

Page 32: Lore: A Database Management System for Semistructured Data

OEM to XML• Example:

– <Member project=“&5 &6”><name>Jones</name><age>46</age><office>

<building>gates</building><room>252</room>

</office></member>

• This corresponds to rightmost member in the example OEM, where project is an attribute.

Page 33: Lore: A Database Management System for Semistructured Data

Query Optimization for Semistructured Data

Page 34: Lore: A Database Management System for Semistructured Data
Page 35: Lore: A Database Management System for Semistructured Data

External Data Manager

• Enables retrieval of information from other data sources, transparent to the user.

• An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source.

• The spec for an external object includes:– Location of a wrapper program to fetch and convert

data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.

Page 36: Lore: A Database Management System for Semistructured Data

Layer 0

• Access to Lore – API, Applications

• Parser – textual query as input and parse tree as output

• Preprocessor – input parse tree to OQL like query

• Query plan and query optimizer (index etc)

Page 37: Lore: A Database Management System for Semistructured Data

Layer 1

• Object manager – translation layer between OEM to lower level files, compare objects,

• Query operator – execute query plan, perform simple coercion, iterating over subobjects of a complex object

Page 38: Lore: A Database Management System for Semistructured Data

QUERY select M.Name, (select M.Project.Title where M.Project.Title != "Lore" ) from DBGroup.Member M where M.Project.Title = "Lore"

RESULT Member

Name "Jones" Title "Tsimmis"

Page 39: Lore: A Database Management System for Semistructured Data

update P.Member += ( select DBGroup.Member where DBGroup.Member.Name = "Clark" )

from DBGroup.Project P where P.Title = "Lore" or

P.Title = "Tsimmis"

P.member += specifies to add member edges between P and every object returned by subquery.

Page 40: Lore: A Database Management System for Semistructured Data

Query Execution Plan

• Iterative approach in query processing

• Execution begins at the top of the query plan, with each node in the plan requesting a tuple at a time from its children and performing some operation on the tuple.

• After a node completes its operation, it passes a resulting tuple to its parent.

Page 41: Lore: A Database Management System for Semistructured Data

• Object assignment – OA is a simple data structures containing slots corresponding to range variables in the query and some additional slots

• Each slot within an OA will hold the oid of a vertex on a data path currently being considered by the query engine

• Example – OA1 holds oid for member “smith”, then OA2 and OA3 can hold the oids for one of smith’s office subobjects and one of his age subobjects resp.

Page 42: Lore: A Database Management System for Semistructured Data

Query Operators• Each operator takes a number of arguments with

the last argument being OA slot that will contain the result of the operation.

• Select , project has no target slot.• Scan returns all oids that are subobjects of a given

object• Scan (starting OA slot, path exp., Target OA slot)• Scan until no subobjects that satisfies path

expression• Scan (OA1, “office”, OA2) : place into slot OA2

one at a time all office subobjects appearing in slot OA1.

Page 43: Lore: A Database Management System for Semistructured Data

• Join/select/project – nearly identical to RDBMS• Project is to limit which objects should be

returned• Select applied predicate to the object identified by

the oid in the OA slot specified• Aggregate – implements quantification and

aggregation node calls its child exhaustively , storing the results temp. or computing aggregation

• A new object is created when no more valid OAs•

Page 44: Lore: A Database Management System for Semistructured Data

select O from DBGroup.Member M, M.Office Owhere exists A in M.Age : A > 30

Page 45: Lore: A Database Management System for Semistructured Data
Page 46: Lore: A Database Management System for Semistructured Data

Primary Operators• Setop, ArithOp, Create Set and Groupby• Setop handles union, intersect,except• Arithop – addition, multiplication • Createset – is to package the results of an arbitrary

subquery before proceeding: its called its child exhaustively , storing each oid returned as part of a newly created complex object.

• It stores oid for the new set of objects within target slot

• Group by handles subquery that includes a groupby expression

Page 47: Lore: A Database Management System for Semistructured Data

select M.Name, count(M.Publication) from DBGroup.Member M where M.Dept = "CS"

select (select N from M.Name N), count(select P

from M.Publication P) from DBGroup.Member M where exists D in M.Dept : D = "CS"

Page 48: Lore: A Database Management System for Semistructured Data
Page 49: Lore: A Database Management System for Semistructured Data

Indexing

• Value Index – vindex

• Lindex – a link (edge) index

• Lindex – (oid, label)

and returns the oid’s of all parents via the specified label (provides parent pointers)

• Vindex (label,operator,value) – returns all atomic objects having an incoming edge with the specified label and a value satisfying the specified operator (<)

Page 50: Lore: A Database Management System for Semistructured Data

Index Query Plan

– Bottem – Up

– Locate all objects with desired values and appropriately labeled incoming edges via vindex

– Using Lindex then traverse up from these objects to match the path exp

Page 51: Lore: A Database Management System for Semistructured Data

New Query Operators

1. Vindex

2. Lindex

3. Once

4. Named-obj

Page 52: Lore: A Database Management System for Semistructured Data

Indexes

• Vindex(l, op, value, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l.– Vindex(“Age”, <, 30,y) places into y objects with

age < 30.

• Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l.– Lindex(x, “Age”, y) places into x all parents of y

via label “Age”.

Page 53: Lore: A Database Management System for Semistructured Data

Indexes (cont.)

• Bindex(l, x, y) finds all parent-child object pairs connected by a label l.– Bindex(“Age”, x, y) locates all parent-child pairs

with label Age.

• Pindex(PathExpression, x) placed into x all objects reachable via the path expression.– Pindex(“A.B x, x.C y”, y) places into y all

objects reachable by going from A to B to C.– Uses DataGuide.

Page 54: Lore: A Database Management System for Semistructured Data

• Vindex – as B+ tress

• Lindex – Linear Hashing

• String Vindex – index entries for all string based atomic values

• Real Vindex – index entries for all numeric based atomic values

• String-coerced-to real VindexContains all strings values that can be coerced

into an integer or real

Page 55: Lore: A Database Management System for Semistructured Data

• If the value is of type string then– Lookup in string index– If the value can be coerced to a real then lookup

in the coerced value in real vindex

• If the value is of type real (integer) then – Lookup in real vindex– Also lookup in the string-coerced to real vindex

Page 56: Lore: A Database Management System for Semistructured Data
Page 57: Lore: A Database Management System for Semistructured Data

Simple Query

• select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

• Possible plans:– Top-down (similar to pointer-chasing, nested-loops

join)– Use Vindex to check y < 30, traverse backwards from

child to parent using Lindex(bottom-up).

– Hybrid, both top down and bottom up. Meet in middle.

Page 58: Lore: A Database Management System for Semistructured Data

Coercion for Basic op.

Arg2 String Real Int

Arg1

String - Stirng Real both Real

Real Stirng Real - Int Real

Int both Real Int Real -