Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native
XML Databases Ronald Bourret [email protected]
http://www.rpbourret.com
Slide 2
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Overview What is a native XML database? Native XML database
architectures When should I use a native XML database?
Normalization, referential integrity, scalability, and performance
Native XML database features
Slide 3
Copyright 2001, Ronald Bourret, http://www.rpbourret.com What
is a Native XML Database?
Slide 4
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Blame
Software AG Software AG coined the term native XML database......
and used it to market Tamino...... without ever defining it For a
long time Everybody knew Tamino was a native XML database Nobody
knew what Tamino did or how it worked
Slide 5
Copyright 2001, Ronald Bourret, http://www.rpbourret.com What
is a native XML database? A database that stores XML documents as
XML Defines a (logical) model for an XML document Fundamental unit
of (logical) storage is a document Can have any physical
storage
Slide 6
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Example: Storing a sales order Store data Store documentsStore
documents as text as DOM objects Orders Items Customers Parts 1234
Gallagher Industries 29.10.00 A-10 12 10.95 B-43 600 3.99 Element
Element Element Text Text Text Attr Element... Element Element
Element Text Text Text Attr Element... Element Element Element Text
Text Text Attr Element... Element Element Element Text Text Text
Attr Element............ 1234 29.10.00 Gallagher
Industries........................ 1234 1 A-10 12 10.95 1234 2 B-43
600 3.99........................ Gallagher
Industries.................. B-43... A-10......
Slide 7
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Logical model of XML document Must include elements, attributes,
PCDATA, and document order Examples are XPath data model, XML
Infoset, DOM, and model implied by SAX 1.0 Documents stored and
retrieved according to the model
Slide 8
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Fundamental unit of storage Fundamental unit of (logical) storage
is a document Equivalent structure in a relational database is a
row Document usually contains single set of data In future, unit of
storage could be a fragment
Slide 9
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Physical storage Can have any physical storage For example, can be
built on a relational, hierarchical, or object-oriented
database...... or use a proprietary storage format such as indexed,
compressed files
Slide 10
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native
XML Database Architectures
Slide 11
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based storage Stores documents as text Can use file system,
BLOB, proprietary storage, etc. XML-aware text engine in RDBMS is a
native XML database Uses indexes heavily
Slide 12
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based storage 123 Main St. Chicago IL 60609 USA
Slide 13
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based databases Indexed files TextML Proprietary GoXML DB
Slide 14
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Model-based storage Stores documents according to a specific model
For example, maps DOM to relational database Underlying storage can
be relational, object-oriented, hierarchical, or proprietary
Slide 15
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Model-based storage 123 Main St. Chicago IL 60609 USA Element
Element Element Element Element Element Text Text Text Text
Text
Copyright 2001, Ronald Bourret, http://www.rpbourret.com When
Should I Use a Native XML Database?
Slide 18
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Storing document-centric documents Saves physical info (entity
references, CDATA, etc.) Stores document ID / name Supports
document-centric queries Retrieve the first section containing a
list in the third chapter Retrieve the headings of all chapters
that contain hyperlinks
Slide 19
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Natural format is XML XHTML, DocBook, etc. Data stored temporarily
as XML For example, in a message queue Common format of many
documents is XML For example, Web search engine database
Slide 20
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Retrieval speed is critical One hierarchical view must predominate
Happens today: 15 billion gigabytes of data in IMS Relational
queries are hierarchy-neutral Speed depends on: Query Underlying
storage engine Output format (DOM, SAX, string)
Slide 21
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Semi-structured data Structure is present, but not regular like
tabular data For example, geneological records or patient records
Difficult to store in a relational database Choice is many tables
or many nulls Structure might not be known at design time
Slide 22
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Well-formed documents No known schema Best example is documents
stored by Web search engine Storing data in such documents is very
inefficient Tables and mappings must be created at run-time
Slide 23
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization, Referential Integrity, Scalability, and
Performance
Slide 24
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization Means that a given piece of data appears only once
Reduces disk usage Reduces potential update errors Fundamental
concept of relational databases
Slide 25
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization and native XML databases Concept same as in
relational database Only difference is database model Relational
tables are flat, can only store single values XML documents are
hierarchical, can store multiple values Not required
Slide 26
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Example: Sales order Requires two tables in RDBMS Can store in a
single document in native XML database Both are normalized
Relational database XML document Orders Items......... 1234
29.10.00 Gallagher Industries........................ 1234 1 A-10
12 10.95 1234 2 B-43 600 3.99............... 1234 Gallagher
Industries 29.10.00 A-10 12 10.95 B-43 600 3.99
Slide 27
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Problem: Real sales order Real world not that simple Sales order
probably contains customer information ID, name, bill-to address,
ship-to address, etc. 1234 020962 Gallagher Industries... 29.10.00
A-10 12 10.95 B-43 600 3.99
Slide 28
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Solutions: Real sales order Normal: Store customer info in separate
file Use XLinks or joins XLinks not widely supported (will be in
future?) If normalized and flat, might as well use relational
database Non-normal: Store customer info in each sales order Trades
speed for query flexibility and update complexity Real-world
relational databases often not normal
Slide 29
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization and document-centric documents Often not worth doing
For example, in a collection of user manuals Each contains
copyright, company logo, company address Duplicate information not
worth normalizing Matters only when there is significant overlap
Procedures common to many models of same product List of worldwide
customer support contacts ...
Slide 30
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Referential integrity Refers to validity of pointers to other data
For example, PartNumber in Items points to valid row in Parts
Applies to XLinks and external entity references XLinks generally
not supported => not an issue Probably not enforced for external
entity references Needs support in the future
Slide 31
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Scalability and performance Outside my area of expertise Native XML
databases appear to scale / perform Much better than relational
databases when retrieving whole documents or fragments Much worse
than relational databases when retrieving unindexed data Slower(?)
than relational databases when retrieving views of indexed data
that dont follow the storage hierarchy Benchmark data not yet
available
Slide 32
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Whole
documents or fragments Text-based databases are very fast Data is
contiguous on disk Retrieval requires index lookup and single disk
read 1. Index lookup 2. Position disk head 3. Read to here
Slide 33
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Whole
documents or fragments (cont.) Model-based databases with
proprietary storage are fast Generally use physical pointers
between nodes Model-based databases built on other DBs may be fast
Depends on underlying database and implementation strategy Node 1.
Index lookup 2. Position disk head 3. Follow pointers to here
Slide 34
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Views
not following storage hierarchy Slower than hierarchical views? May
require many index lookups or linear searches Pointers to parent
nodes should help in model-based databases Relational databases are
query neutral 1234 Gallagher Industries 29.10.00 A-10 12 10.95 B-43
600 3.99 Get the dates of all sales orders for part A-10 1. Index
lookup for part A-10 2. Follow pointers to Order? 3. Search
children for Date?
Slide 35
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Indexed data Native XML databases use indexes heavily Index lookup
speed same as any database, but...... more index lookups may be
required than by RDBMS Update times slower due to index
updates
Slide 36
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Unindexed data Slow for model-based databases Must read all
elements, not just elements of a particular type Comparisons slower
due to converting text Very slow for text-based databases Must
parse document as well as comparing values Element Element Element
Text Text Text Attr Element... Find date 29.10.00 Relational
database: 1. Search this column Model-based native XML database: 1.
Search all elements for Date elements 2. Search text for all Date
elements Orders......... 1234 29.10.00 Gallagher
Industries.........
Slide 37
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Query
return types String, DOM tree, SAX events Text-based databases Very
fast returning strings Slow returning DOM trees or SAX events due
to parsing Model-based databases Probably similar speed to
relational databases for all types
Slide 38
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Native
XML Database Features
Slide 39
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Document Collections Contain related documents Similar to
Catalog/schema in relational database Directory in file system Some
databases allow nested collections
Slide 40
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Indexes All databases use indexes Some databases index everything
Other databases allow user to specify what to index
Slide 41
Copyright 2001, Ronald Bourret, http://www.rpbourret.com Query
Languages XPath and XQL are most common Usually include extensions
for multi-document queries Many databases have proprietary
languages XQuery will probably be standard in the future
Slide 42
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Updates Many databases simply replace existing document Some
databases allow updates through live DOM Other databases have
fragment update language Best way to do updates still unclear
Slide 43
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Transactions, Locking, and Concurrency Most databases support
transactions Locking often at document (not fragment) level Whether
this is an issue depends on What is stored in a single document
Number of concurrent users Fragment locking probably more common in
future
Slide 44
Copyright 2001, Ronald Bourret, http://www.rpbourret.com APIs
Most databases have proprietary APIs XML:DB is database-neutral API
Standard API (XML:DB or other) likely in future APIs similar to
ODBC Query language is separate from API Methods to connect,
execute queries, retrieve results, commit transactions Results
returned as single document or set of documents Documents returned
as string, DOM tree, or SAX events Most databases support HTTP
Slide 45
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Round-tripping All native XML databases can round-trip documents
Round-trip level depends on database Text-based databases usually
do exact round-tripping Model-based databases round-trip at level
of model Minimum is elements, attributes, PCDATA, and document
order May be less than canonical XML (comments and processing
instructions discarded)
Slide 46
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
External data Some databases can merge data from external
databases, such as with ODBC, OLE DB, JDBC Whether data is live
depends on database In the future, most databases will probably
support live external data
Slide 47
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
External entity storage Not clear whether to store entity or URI
Storing entity value is incorrect if URI points to live data
Storing URI may be incorrect if entity meant as a snapshot Not sure
how databases handle this problem Correct answer is probably to let
user decide
Slide 48
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Resources
Slide 49
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Resources Ronald Bourrets Papers Page
http://www.rpbourret.com/xml/index.htm XML:DB.orgs Resources Page
http://www.xmldb.org/resources.html XML:DB Mailing List
http://www.xmldb.org/projects.html
Slide 50
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Questions? Ronald Bourret [email protected]
http://www.rpbourret.com