Upload
carol-rich
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
IS432Semi-Structured Data
Lecture 1:
SSD & XML
Dr. Gamal Al-Shorbagy
In this lecture
• What semi structured data is.– Why we need it
– How it is represented and processed
– Related technologies
• What is XML– XML syntax
– XML Query data model
– Comparison of XML with semistructured data
Papers:– XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems.
– W3C XML Query Data Model Mary Fernandez, Jonathan Robie.
2
The Data
A Document/Page for a common user Type of data ?
Difficult to identify.
Is there any order ?No particular format or sequence
Does it follow any rules ? Can we predict about data ?
Management and Representation Unmanageable by nature Often found as; text , video, sound , images
Query and Search Brute force, finding a needle in the haystack.
3
The data
A table for organizations Data follows certain model e.g. Relational;
Entities, Same Attributes, Order and Relations Schema Data separation
First Schema then Data Data elements are strongly “typed” and “Ordered” Corporate Ownership
Management and Representation Specialized DBMS Engine
Management, Storage, Query Formulation Represented as ; Entity - Tuples, Class - Objects
Query and Search Optimized via indexes, trees …
4
group of tables
5
The data
• name: Some Body• email: [email protected], [email protected] • ------------------------------------------------------------------• name:
• first name: Ceaser• last name: The Great
• email: [email protected] • ------------------------------------------------------------------• name: Ranjeet Singh
• affiliation: Punjab 6
the Data
A graph/web for advanced users Structure Data (mixture) Schema-Less, Self Describing (Prescription Vs. Description) Schema may evolve overtime Schema may be larger than the data itself Irregular, Incomplete, Evolving Structure Entities may have different/missing attributes(Example; Person) Ownership is often shared among organizations
Management and Representation Data Representation & Exchange on WWW Labeled Directed Graph Representation
Query and Search Getting better …
7
Kinds of Data
Title Author FN
Author LN
Publisher Page
A D Edd IEEE 233
B Ted Hee ACM 553
StructuredUnstructured
• Semi Structured
&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle
yearhttp
author
authorauthor
titlepublisherauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
8
Why Semi structured data is important?
Scenario An organization A publishes movie data on its
web pages (HTML), generated from DBMS.A second organization B wants some movie
information; can access only web data.
DBMS
A BHTML
When we want to treat Web sources like a database, but can’t constrain these sources with a schema
9
Why Semi structured data is important?
Scenario; Electronic Data Interchange Standard
computer-to-computer interchange of strictly formatted messages http://www.itl.nist.gov/fipspubs/fip161-2.htm
When we want as flexible format for data exchange between disparate systems/databases;
Electronic Data Interchange ISO Standard
10
Semi Structured Data-(Pros&Cons)
Advantages No need to update schema continuouslyEasy to discover new data and load it Easy to integrate heterogeneous data Easy to query without knowing data types
Disadvantages The type information lossHarder Storage/Query Optimization/Management
11
Managing Semi structured Data
How do we model it? (directed labeled graphs).How do we query it? (many proposals, all include
regular path expressions).Optimize queries? (beginning to understand).Store the data? (looking for patterns)Integrity constraints, views, updates,…,
12
Semi Structured Data: OEM
Object Exchange Model Data in OEM is schema-less and self-describing, can be thought of as labeled
directed graph where nodes are objects, consisting of: unique object identifier (for example, &7), descriptive textual label (street), type (string), a value (“22 Deer Rd”).
Objects: atomic and complex: atomic object contains value for base type (e.g., integer or string) and in
diagram has no outgoing edges. All other objects are complex objects whose types are a set of object
identifiers.
Lore: OEM Confirming Data Storage System http://infolab.stanford.edu/lore/ Lorel: Lore Query Language
13
Semi-structured data model example&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle
yearhttp
author
authorauthor
title publisherauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
Object Exchange Model (OEM)
complex object
atomic object
Nodes are objects; labels on the arcs are attribute names. 14
Querying Semi structured Data
Important features:Ability to navigate the data (regular path expressions),Querying the attribute names (arc variables),Create new structures,Type coercion.
Languages: Lorel (Stanford) http://infolab.stanford.edu/pub/papers/lorel96.ps UnQL (U. Penn), http://www.unqlspec.org/display/UnQL/Home
15
17.2 Semistructured data
Lore and Lorel
Lore (Lightweight Object Repository) A DBMS Has external data manager
Lorel (Lore language): Returning meaningful results even when some data absent To operate uniformly over single-valued and set-valued data Accepts data with different types Can return heterogeneous objects Allows the object structure to be partially known.
Example: Find all properties with annual rent.SELECT DreamHomes.PropertyForRent FROM DreamHome.PropertyForRent.annualRent
Answer: PropertyForRent &6, street &14 “18 Dale Rd”, type &15 1, annualRent &16 7200 OverseenBy &4
16
Data Models Timeline
• Network Data Models (1964)
• Hierarchical Data Models (1968)
• Relational Data Models (1970)
• Object-oriented Data Models (~ 1985)
• Object-relational Data Models (~ 1990)
• Semi-structured Data Models (XML 1.0) (~1998)
17
XML
• a W3C standard to complement HTML
• origins: structured text SGML
• motivation:– HTML describes presentation– XML describes content
• • http://www.w3.org/TR/2000/REC-xml-20001006 (version
2, 10/2000)
SGMLXMLHTML4.0
18
XML – An Embodiment of Semi structured Data
Meta-language
A de-facto language to Represent Semi-Structured Data To create new languages (WAP, VoiceXML, MathML)
Extensibility
Create new elements Create new languages (WML, WAP)
Markup
Text Markup Element = Data + Markup Document = Nested Elements
<note>
<to>Rana </to>
<from>Tunga </from>
<heading>Hello </heading>
<body>What’s up ! </body>
</note>
19
From HTML to XML
HTML describes the presentation20
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 199921
XML<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content22
XML VS. HTML
XML and HTML were designed with different goals:
XML to describe data and to focus on what data is.
HTML was designed to display data and to focus on how data looks.
It is important to understand that XML is not a replacement for HTML.
23
XML Data Model
Several competing models:• Document Object Model (DOM):
– http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001)
– class hierarchy (node, element, attribute,…)– objects have behavior– defines API to inspect/modify the document
• XSL data model• Infoset
– PSV (post schema validation)
• XML Query data model (next)
24
Why XML
Portability Language neutrality Platform independence Program-Data Decoupling
Logic and NotationData and MetadataInformation and StructureContent and Form
25
Why XML
Data Evolution: Schema update not required
Integration: A prior knowledge of schema is not necessary
Sharing between incompatible formats Interoperability without rebuilding the systems.
Report Concrete Examples
26
How computers understand xml
Parsers; Software to understand XMLRemoves Markup and Retrieves Data
Document Object Model (DOM)Model a document as a Tree
Simple API for XML (SAX)Sequential access
27
What XML is not
A little hard to understand, but XML does not DO anything. XML is created to structure, store and send information.
<note>
<to>Rana </to>
<from>Tunga </from>
<heading>Hello </heading>
<body>What’s up ! </body>
</note>
The note; a header, a message body, sender and receiver information. But still, this XML document does not DO anything.
Just information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.
28
XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element
well formed XML document: if it has matching tags
29
More XML: Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
attributes are alternative ways to represent data30
Parsers and Well-formed XML Documents
• XML parser– Processes XML document
• Reads XML document
• Checks syntax
• Reports errors (if any)
• Allows programmatic access to document’s contents
31
Parsers and Well-formed XML Documents (cont.)
• XML document syntax– Considered well formed if syntactically correct
• Single root element
• Each element has start tag and end tag
• Tags properly nested
• Attribute (discussed later) values in quotes
• Proper capitalization– Case sensitive
32
Parsers and Well-formed XML Documents (cont.)
• XML parsers support– Document Object Model (DOM)
• Builds tree structure containing document data in memory
– Simple API for XML (SAX)• Generates events when tags, comments, etc. are
encountered– (Events are notifications to the application)
33
Parsing an XML Document with msxml
• XML document– Contains data
– Does not contain formatting information
– Load XML document into Internet Explorer 5.0• Document is parsed by msxml.
• Places plus (+) or minus (-) signs next to container elements– Plus sign indicates that all child elements are hidden– Clicking plus sign expands container element
» Displays children– Minus sign indicates that all child elements are visible– Clicking minus sign collapses container element
» Hides children
• Error generated, if document is not well formed
34
XML document shown in IE5.
35
Characters
• Character set– Characters that may be represented in XML
document• e.g., ASCII character set
– Letters of English alphabet
– Digits (0-9)
– Punctuation characters, such as !, - and ?
36
Character Set
• XML documents may contain– Carriage returns– Line feeds– Unicode characters (Section 5.5.4)
• Enables computers to process characters for several languages
37
Characters vs. Markup
• XML must differentiate between– Markup text
• Enclosed in angle brackets (< and >)– e.g,. Child elements
– Character data• Text between start tag and end tag
– e.g., Fig. 5.1, line 7: Welcome to XML!
38
White Space, Entity References and Built-in Entities
• Whitespace characters– Spaces, tabs, line feeds and carriage returns
• Significant (preserved by application)
• Insignificant (not preserved by application)– Normalization
» Whitespace collapsed into single whitespace character» Sometimes whitespace removed entirely
<markup>This is character data</markup>
after normalization, becomes
<markup>This is character data</markup>
39
White Space, Entity References and Built-in Entities (cont.)
• XML-reserved characters– Ampersand (&)
– Left-angle bracket (<)
– Right-angle bracket (>)
– Apostrophe (’)
– Double quote (”)
• Entity references– Allow to use XML-reserved characters
• Begin with ampersand (&) and end with semicolon (;)
– Prevents from misinterpreting character data as markup
40
White Space, Entity References and Built-in Entities (cont.)
• Build-in entities– Ampersand (&)– Left-angle bracket (<)– Right-angle bracket (>)– Apostrophe (')– Quotation mark (")– Mark up characters “<>&” in element message
<message><>&</message>
41
More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>oids and references in XML are just syntax
42
More XML: CDATA Section
• Syntax: <![CDATA[ .....any text here...]]>
• Example:
<example> <![CDATA[ some text here </notAtag> <>]]>
</example>
43
Using a CDATA section
44
More XML: Entity References
• Syntax: &entityname;
• Example: <element> this is less than < </element>
• Some entities: < <
> >
& &
' ‘
" “
& Unicode char 45
More XML: Processing Instructions
• Syntax: <?target argument?>• Example:
<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>
• What do they mean ?
46
More XML: Comments
• Syntax <!-- .... Comment text... -->
• Yes, they are part of the data model !!!
47
XML Namespaces
• http://www.w3.org/TR/REC-xml-names (1/99)
• name ::= [prefix:]localpart
<book xmlns:isbn=“www.isbn-org.org/def”>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>
<book xmlns:isbn=“www.isbn-org.org/def”>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>48
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
XML Namespaces
• syntactic: <number> , <isbn:number>
• semantic: provide URL for schema
defined here
49
XML Namespaces
• Naming collisions– Two different elements have same name
<subject>Math</subject>
<subject>Thrombosis</subject>
• Namespaces– Differentiate elements that have same name
<school:subject>Math</school:subject>
<medical:subject>Thrombosis</medical:subject>
• school and medical are namespace prefixes– Prepended to elements and attribute names– Tied to uniform resource identifier (URI)
» Series of characters for differentiating names
50
XML Namespaces
• Creating namespaces– Use xmlns keyword
xmlns:text = “urn:deitel:textInfo”
xmlns:image = “urn:deitel:imageInfo”
• Creates two namespace prefixes text and image•urn:deitel:textInfo is URI for prefix text•urn:deitel:imageInfo is URI for prefix image
– Default namespaces• Child elements of this namespace do not need prefix xmlns = “urn:deitel:textInfo”
51
1 <?xml version = "1.0"?>
2
3 <!-- Fig. 5.8 : namespace.xml -->
4 <!-- Namespaces -->
5
6 <directory xmlns:text = "urn:deitel:textInfo"
7 xmlns:image = "urn:deitel:imageInfo">
8
9 <text:file filename = "book.xml">
10 <text:description>A book list</text:description>
11 </text:file>
12
13 <image:file filename = "funny.jpg">
14 <image:description>A funny picture</image:description>
15 <image:size width = "200" height = "100"/>
16 </image:file>
17
18 </directory>
Element directory contains two namespace prefixes
Use prefix text to describe elements file
and description
Apply prefix text to describe elements file, description and size
1 <?xml version = "1.0"?>
2
3 <!-- Fig. 5.9 : defaultnamespace.xml -->
4 <!-- Using Default Namespaces -->
5
6 <directory xmlns = "urn:deitel:textInfo"
7 xmlns:image = "urn:deitel:imageInfo">
8
9 <file filename = "book.xml">
10 <description>A book list</description>
11 </file>
12
13 <image:file filename = "funny.jpg">
14 <image:description>A funny picture</image:description>
15 <image:size width = "200" height = "100"/>
16 </image:file>
17
18 </directory>
urn:deitel:textInfo is default namespace
Element file is in default namespace
Specify namespace
XML StylesheetExtensible Stylesheet Language (XSL)Language for document transformation
Transformation Converting XML to another form
Formatting objectsLayout of XML document
Defined by W3C
http://www.codeproject.com/Articles/294380/Applying-XSLT-Stylesheet-to-an-XML-File-at-Runtime 54
Xml path WHY
To Access particular parts of and XML Document To Navigate within an XML Document
WHAT Analogous to Select statement in SQL
HOW It views an XML document as a tree Root of the tree is a node, which doesn’t correspond
to anything in the document Internal nodes are elements Leaves are either
Attributes Text nodes Comments
55
Xml path
56
Xml query
• WHAT • XQuery can be used to: Extract information to use in a Web Service Generate summary reports Transform XML data to XHTML Search Web documents for relevant
information
WHYNeed to extract parts of XML documents (Database)Need to transform documents into different forms
Another XML form HTML (to display on a Web browser) Other (e.g. bibtex)
Need to relate – join – parts of the same or different documents
HOW•The XML-QL language •XQuery – W3C standard.
• Very powerful, fairly intuitive, SQL-style
57
XML Query Data Model
• http://www.w3.org/TR/query-datamodel/2/2001
• Describes XML as a tree, specialized nodes
• Uses a functional-style notation (think ML)
58
XML Query Data Model
• Node ::= DocNode | ElemNode | ValueNode | AttrNode | NSNode | PINode | CommentNode | InfoItemNode | RefNode
59
XML Query Data Model
Element node (simplified definition):
• elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode]) ElemNode
• QNameValue = means “a tag name”• {...} = means “set of...”• [...] = means “list of ...”
60
XML Query Data Model
• Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”
61
XML Query Data Model
Example<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])
price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…
book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])
price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…
62
XML Query Data Model
Attribute node:
• attrNode : (QNameValue, ValueNode) AttrNode
63
XML Query Data Model
Example
<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)
price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)
64
XML Query Data Model
Value node:• ValueNode = StringValue |
BoolValue | FloatValue …
• stringValue : string StringValue• boolValue : boolean BoolValue• floatValue : float FloatValue
65
XML Query Data Model
Example
<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
<book price = “55”
currency = “USD”>
<title> Foundations … </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<year> 1995 </year>
</book>
price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))
title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))
price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))
title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))
66
Semi-structured Data vs. XML• both described best by a graph
• both are schema-less, self-describing
• Attributes ---> tags
• objects ---> elements
• atomic values ---> CDATA (characters)
• Order? Assumed in XML.
• XML attributes (fixable)
• References in XML.
67
Similarities and Differences
<person id=“o123”>
<name> Alan </name>
<age> 42 </age>
<email> ab@com </email>
</person>
<person id=“o123”>
<name> Alan </name>
<age> 42 </age>
<email> ab@com </email>
</person>
{ person: &o123
{ name: “Alan”,
age: 42,
email: “ab@com” }
}
{ person: &o123
{ name: “Alan”,
age: 42,
email: “ab@com” }
}
person
name age email
Alan 42 ab@com
person
name age email
Alan 42 ab@com
father father
<person father=“o123”> …</person>
{ person: { father: &o123 …}}
similar on trees, different on graphs68
More Differences
• XML is ordered, ssd is not
• XML can mix text and elements:
<talk> Making Java easier to type and easier to type
<speaker> Phil Wadler </speaker>
</talk>
• XML has lots of other stuff: entities, processing instructions, comments
Very important:these differences make XML data management harder 69