IS432 Semi-Structured Data Lecture 1: SSD & XML Dr. Gamal Al-Shorbagy

IS432Semi-Structured Data

Lecture 1:

SSD & XML

Dr. Gamal Al-Shorbagy

In this lecture

• What semi structured data is.– Why we need it

– How it is represented and processed

– Related technologies

• What is XML– XML syntax

– XML Query data model

– Comparison of XML with semistructured data

Papers:– XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems.

– W3C XML Query Data Model Mary Fernandez, Jonathan Robie.

2

The Data

A Document/Page for a common user Type of data ?

Difficult to identify.

Is there any order ?No particular format or sequence

Does it follow any rules ? Can we predict about data ?

Management and Representation Unmanageable by nature Often found as; text , video, sound , images

Query and Search Brute force, finding a needle in the haystack.

3

The data

A table for organizations Data follows certain model e.g. Relational;

Entities, Same Attributes, Order and Relations Schema Data separation

First Schema then Data Data elements are strongly “typed” and “Ordered” Corporate Ownership

Management and Representation Specialized DBMS Engine

Management, Storage, Query Formulation Represented as ; Entity - Tuples, Class - Objects

Query and Search Optimized via indexes, trees …

4

group of tables

5

The data

• name: Some Body• email: [email protected], [email protected] • ------------------------------------------------------------------• name:

• first name: Ceaser• last name: The Great

• email: [email protected] • ------------------------------------------------------------------• name: Ranjeet Singh

• affiliation: Punjab 6

the Data

A graph/web for advanced users Structure Data (mixture) Schema-Less, Self Describing (Prescription Vs. Description) Schema may evolve overtime Schema may be larger than the data itself Irregular, Incomplete, Evolving Structure Entities may have different/missing attributes(Example; Person) Ownership is often shared among organizations

Management and Representation Data Representation & Exchange on WWW Labeled Directed Graph Representation

Query and Search Getting better …

7

Kinds of Data

Title Author FN

Author LN

Publisher Page

A D Edd IEEE 233

B Ted Hee ACM 553

StructuredUnstructured

• Semi Structured

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

titlepublisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

8

Why Semi structured data is important?

Scenario An organization A publishes movie data on its

web pages (HTML), generated from DBMS.A second organization B wants some movie

information; can access only web data.

DBMS

A BHTML

When we want to treat Web sources like a database, but can’t constrain these sources with a schema

9

Why Semi structured data is important?

Scenario; Electronic Data Interchange Standard

computer-to-computer interchange of strictly formatted messages http://www.itl.nist.gov/fipspubs/fip161-2.htm

When we want as flexible format for data exchange between disparate systems/databases;

Electronic Data Interchange ISO Standard

10

Semi Structured Data-(Pros&Cons)

Advantages No need to update schema continuouslyEasy to discover new data and load it Easy to integrate heterogeneous data Easy to query without knowing data types

Disadvantages The type information lossHarder Storage/Query Optimization/Management

11

Managing Semi structured Data

How do we model it? (directed labeled graphs).How do we query it? (many proposals, all include

regular path expressions).Optimize queries? (beginning to understand).Store the data? (looking for patterns)Integrity constraints, views, updates,…,

12

Semi Structured Data: OEM

Object Exchange Model Data in OEM is schema-less and self-describing, can be thought of as labeled

directed graph where nodes are objects, consisting of: unique object identifier (for example, &7), descriptive textual label (street), type (string), a value (“22 Deer Rd”).

Objects: atomic and complex: atomic object contains value for base type (e.g., integer or string) and in

diagram has no outgoing edges. All other objects are complex objects whose types are a set of object

identifiers.

Lore: OEM Confirming Data Storage System http://infolab.stanford.edu/lore/ Lorel: Lore Query Language

13

Semi-structured data model example&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

title publisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object Exchange Model (OEM)

complex object

atomic object

Nodes are objects; labels on the arcs are attribute names. 14

Querying Semi structured Data

Important features:Ability to navigate the data (regular path expressions),Querying the attribute names (arc variables),Create new structures,Type coercion.

Languages: Lorel (Stanford) http://infolab.stanford.edu/pub/papers/lorel96.ps UnQL (U. Penn), http://www.unqlspec.org/display/UnQL/Home

15

17.2 Semistructured data

Lore and Lorel

Lore (Lightweight Object Repository) A DBMS Has external data manager

Lorel (Lore language): Returning meaningful results even when some data absent To operate uniformly over single-valued and set-valued data Accepts data with different types Can return heterogeneous objects Allows the object structure to be partially known.

Example: Find all properties with annual rent.SELECT DreamHomes.PropertyForRent FROM DreamHome.PropertyForRent.annualRent

Answer: PropertyForRent &6, street &14 “18 Dale Rd”, type &15 1, annualRent &16 7200 OverseenBy &4

16

Data Models Timeline

• Network Data Models (1964)

• Hierarchical Data Models (1968)

• Relational Data Models (1970)

• Object-oriented Data Models (~ 1985)

• Object-relational Data Models (~ 1990)

• Semi-structured Data Models (XML 1.0) (~1998)

17

XML

• a W3C standard to complement HTML

• origins: structured text SGML

• motivation:– HTML describes presentation– XML describes content

• • http://www.w3.org/TR/2000/REC-xml-20001006 (version

2, 10/2000)

SGMLXMLHTML4.0

18

XML – An Embodiment of Semi structured Data

Meta-language

A de-facto language to Represent Semi-Structured Data To create new languages (WAP, VoiceXML, MathML)

Extensibility

Create new elements Create new languages (WML, WAP)

Markup

Text Markup Element = Data + Markup Document = Nested Elements

<note>

<to>Rana </to>

<from>Tunga </from>

<heading>Hello </heading>

<body>What’s up ! </body>

</note>

19

From HTML to XML

HTML describes the presentation20

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 199921

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

…

</bibliography>

XML describes the content22

XML VS. HTML

XML and HTML were designed with different goals:

XML to describe data and to focus on what data is.

HTML was designed to display data and to focus on how data looks.

It is important to understand that XML is not a replacement for HTML.

23

XML Data Model

Several competing models:• Document Object Model (DOM):

– http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001)

– class hierarchy (node, element, attribute,…)– objects have behavior– defines API to inspect/modify the document

• XSL data model• Infoset

– PSV (post schema validation)

• XML Query data model (next)

24

Why XML

Portability Language neutrality Platform independence Program-Data Decoupling

Logic and NotationData and MetadataInformation and StructureContent and Form

25

Why XML

Data Evolution: Schema update not required

Integration: A prior knowledge of schema is not necessary

Sharing between incompatible formats Interoperability without rebuilding the systems.

Report Concrete Examples

26

How computers understand xml

Parsers; Software to understand XMLRemoves Markup and Retrieves Data

Document Object Model (DOM)Model a document as a Tree

Simple API for XML (SAX)Sequential access

27

What XML is not

A little hard to understand, but XML does not DO anything. XML is created to structure, store and send information.

<note>

<to>Rana </to>

<from>Tunga </from>

<heading>Hello </heading>

<body>What’s up ! </body>

</note>

The note; a header, a message body, sender and receiver information. But still, this XML document does not DO anything.

Just information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.

28

XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element

well formed XML document: if it has matching tags

29

More XML: Attributes

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>


…

<year> 1995 </year>

</book>

attributes are alternative ways to represent data30

Parsers and Well-formed XML Documents

• XML parser– Processes XML document

• Reads XML document

• Checks syntax

• Reports errors (if any)

• Allows programmatic access to document’s contents

31

Parsers and Well-formed XML Documents (cont.)

• XML document syntax– Considered well formed if syntactically correct

• Single root element

• Each element has start tag and end tag

• Tags properly nested

• Attribute (discussed later) values in quotes

• Proper capitalization– Case sensitive

32

Parsers and Well-formed XML Documents (cont.)

• XML parsers support– Document Object Model (DOM)

• Builds tree structure containing document data in memory

– Simple API for XML (SAX)• Generates events when tags, comments, etc. are

encountered– (Events are notifications to the application)

33

Parsing an XML Document with msxml

• XML document– Contains data

– Does not contain formatting information

– Load XML document into Internet Explorer 5.0• Document is parsed by msxml.

• Places plus (+) or minus (-) signs next to container elements– Plus sign indicates that all child elements are hidden– Clicking plus sign expands container element

» Displays children– Minus sign indicates that all child elements are visible– Clicking minus sign collapses container element

» Hides children

• Error generated, if document is not well formed

34

XML document shown in IE5.

35

Characters

• Character set– Characters that may be represented in XML

document• e.g., ASCII character set

– Letters of English alphabet

– Digits (0-9)

– Punctuation characters, such as !, - and ?

36

Character Set

• XML documents may contain– Carriage returns– Line feeds– Unicode characters (Section 5.5.4)

• Enables computers to process characters for several languages

37

Characters vs. Markup

• XML must differentiate between– Markup text

• Enclosed in angle brackets (< and >)– e.g,. Child elements

– Character data• Text between start tag and end tag

– e.g., Fig. 5.1, line 7: Welcome to XML!

38

White Space, Entity References and Built-in Entities

• Whitespace characters– Spaces, tabs, line feeds and carriage returns

• Significant (preserved by application)

• Insignificant (not preserved by application)– Normalization

» Whitespace collapsed into single whitespace character» Sometimes whitespace removed entirely

<markup>This is character data</markup>

after normalization, becomes

<markup>This is character data</markup>

39

White Space, Entity References and Built-in Entities (cont.)

• XML-reserved characters– Ampersand (&)

– Left-angle bracket (<)

– Right-angle bracket (>)

– Apostrophe (’)

– Double quote (”)

• Entity references– Allow to use XML-reserved characters

• Begin with ampersand (&) and end with semicolon (;)

– Prevents from misinterpreting character data as markup

40

White Space, Entity References and Built-in Entities (cont.)

• Build-in entities– Ampersand (&)– Left-angle bracket (<)– Right-angle bracket (>)– Apostrophe (')– Quotation mark (")– Mark up characters “<>&” in element message

<message><>&</message>

41

More XML: Oids and References

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>oids and references in XML are just syntax

42

More XML: CDATA Section

• Syntax: <![CDATA[ .....any text here...]]>

• Example:

<example> <![CDATA[ some text here </notAtag> <>]]>

</example>

43

Using a CDATA section

44

More XML: Entity References

• Syntax: &entityname;

• Example: <element> this is less than < </element>

• Some entities: < <

> >

& &

' ‘

" “

& Unicode char 45

More XML: Processing Instructions

• Syntax: <?target argument?>• Example:

<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>

• What do they mean ?

46

More XML: Comments

• Syntax 

• Yes, they are part of the data model !!!

47

XML Namespaces

• http://www.w3.org/TR/REC-xml-names (1/99)

• name ::= [prefix:]localpart

<book xmlns:isbn=“www.isbn-org.org/def”>

<title> … </title>

<number> 15 </number>

<isbn:number> …. </isbn:number>

</book>

<book xmlns:isbn=“www.isbn-org.org/def”>

<title> … </title>

<number> 15 </number>

<isbn:number> …. </isbn:number>

</book>48

<tag xmlns:mystyle = “http://…”>

…

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

<tag xmlns:mystyle = “http://…”>

…

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

XML Namespaces

• syntactic: <number> , <isbn:number>

• semantic: provide URL for schema

defined here

49

XML Namespaces

• Naming collisions– Two different elements have same name

<subject>Math</subject>

<subject>Thrombosis</subject>

• Namespaces– Differentiate elements that have same name

<school:subject>Math</school:subject>

<medical:subject>Thrombosis</medical:subject>

• school and medical are namespace prefixes– Prepended to elements and attribute names– Tied to uniform resource identifier (URI)

» Series of characters for differentiating names

50

XML Namespaces

• Creating namespaces– Use xmlns keyword

xmlns:text = “urn:deitel:textInfo”

xmlns:image = “urn:deitel:imageInfo”

• Creates two namespace prefixes text and image•urn:deitel:textInfo is URI for prefix text•urn:deitel:imageInfo is URI for prefix image

– Default namespaces• Child elements of this namespace do not need prefix xmlns = “urn:deitel:textInfo”

51

1 <?xml version = "1.0"?>

2

3 

4 

5

6 <directory xmlns:text = "urn:deitel:textInfo"

7 xmlns:image = "urn:deitel:imageInfo">

8

9 <text:file filename = "book.xml">

10 <text:description>A book list</text:description>

11 </text:file>

12

13 <image:file filename = "funny.jpg">

14 <image:description>A funny picture</image:description>

15 <image:size width = "200" height = "100"/>

16 </image:file>

17

18 </directory>

Element directory contains two namespace prefixes

Use prefix text to describe elements file

and description

Apply prefix text to describe elements file, description and size

1 <?xml version = "1.0"?>

2

3 

4 

5

6 <directory xmlns = "urn:deitel:textInfo"

7 xmlns:image = "urn:deitel:imageInfo">

8

9 <file filename = "book.xml">

10 <description>A book list</description>

11 </file>

12

13 <image:file filename = "funny.jpg">

14 <image:description>A funny picture</image:description>

15 <image:size width = "200" height = "100"/>

16 </image:file>

17

18 </directory>

urn:deitel:textInfo is default namespace

Element file is in default namespace

Specify namespace

XML StylesheetExtensible Stylesheet Language (XSL)Language for document transformation

Transformation Converting XML to another form

Formatting objectsLayout of XML document

Defined by W3C

http://www.codeproject.com/Articles/294380/Applying-XSLT-Stylesheet-to-an-XML-File-at-Runtime 54

Xml path WHY

To Access particular parts of and XML Document To Navigate within an XML Document

WHAT Analogous to Select statement in SQL

HOW It views an XML document as a tree Root of the tree is a node, which doesn’t correspond

to anything in the document Internal nodes are elements Leaves are either

Attributes Text nodes Comments

55

Xml path

56

Xml query

• WHAT • XQuery can be used to: Extract information to use in a Web Service Generate summary reports Transform XML data to XHTML Search Web documents for relevant

information

WHYNeed to extract parts of XML documents (Database)Need to transform documents into different forms

Another XML form HTML (to display on a Web browser) Other (e.g. bibtex)

Need to relate – join – parts of the same or different documents

HOW•The XML-QL language •XQuery – W3C standard.

• Very powerful, fairly intuitive, SQL-style

57

XML Query Data Model

• http://www.w3.org/TR/query-datamodel/2/2001

• Describes XML as a tree, specialized nodes

• Uses a functional-style notation (think ML)

58


• Node ::= DocNode | ElemNode | ValueNode | AttrNode | NSNode | PINode | CommentNode | InfoItemNode | RefNode

59


Element node (simplified definition):

• elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode]) ElemNode

• QNameValue = means “a tag name”• {...} = means “set of...”• [...] = means “list of ...”

60


• Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”

61


Example<book price = “55”

currency = “USD”>

<title> Foundations … </title>




<year> 1995 </year>

</book>

<book price = “55”






<year> 1995 </year>

</book>

book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])

price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…

book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])

price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…

62


Attribute node:

• attrNode : (QNameValue, ValueNode) AttrNode

63


Example







<year> 1995 </year>

</book>







<year> 1995 </year>

</book>

price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)

price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)

64


Value node:• ValueNode = StringValue |

BoolValue | FloatValue …

• stringValue : string StringValue• boolValue : boolean BoolValue• floatValue : float FloatValue

65


Example







<year> 1995 </year>

</book>







<year> 1995 </year>

</book>

price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))

title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))

price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))

title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))

66

Semi-structured Data vs. XML• both described best by a graph

• both are schema-less, self-describing

• Attributes ---> tags

• objects ---> elements

• atomic values ---> CDATA (characters)

• Order? Assumed in XML.

• XML attributes (fixable)

• References in XML.

67

Similarities and Differences

<person id=“o123”>

<name> Alan </name>

<age> 42 </age>

<email> ab@com </email>

</person>

<person id=“o123”>

<name> Alan </name>

<age> 42 </age>

<email> ab@com </email>

</person>

{ person: &o123

{ name: “Alan”,

age: 42,

email: “ab@com” }

}

{ person: &o123

{ name: “Alan”,

age: 42,

email: “ab@com” }

}

person

name age email

Alan 42 ab@com

person

name age email

Alan 42 ab@com

father father

<person father=“o123”> …</person>

{ person: { father: &o123 …}}

similar on trees, different on graphs68

More Differences

• XML is ordered, ssd is not

• XML can mix text and elements:

<talk> Making Java easier to type and easier to type

<speaker> Phil Wadler </speaker>

</talk>

• XML has lots of other stuff: entities, processing instructions, comments

Very important:these differences make XML data management harder 69