Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02,...

Preview:

Citation preview

Getting Data out of XML Documents

Bálint JoóSchool of Physics

University of EdinburghMay 02, 2003

ContentsIn search of a simple API for accessing DOM

The multiple tag problem

What is it?

Is it a problem for us?

How can we get around it?

XPath

What is easy to parse?

Software: XPathReader package

Conclusions

Motivation (Starting Points)

Lack of free Data- binding tools for C/C++

Desire to read ILDG Metadata documents, marshal application data

=> Have to write our own tools

Would like simple API to get at document data

Would like same API to cope with ILDG metadata AND application data.

We got as far as reading into a DOM.

Start With Simple Idea

Consider simple API with functions

push(tagname) -- select tag with name tagname

pop() -- move up a level

getType( tagname , result )

Type = string | float | double | int | bool;

Equivalent API: directory like structure with no absolute paths:

cd(tagname) = push(tagname) , cd(..) = pop()

Simple Data: No Attributes, No Namespaces No Empty Elements.

Example

<? xml version=”1.0”?><foo> <bar>String</bar> <fred>5.0</fred></foo>

Open(''file.xml'');push(''foo'');string bar; getString(''bar'', bar);double fred;getDouble(''fred'', fred);pop();

So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders

BUT Soon we encounter...

The Multiple Tag Problem

<size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

Consider following snippet:

Lets try our API: push(''size'');

But what does: push(''axis''); do?

Multiple Tag Problem (cont'd)push(“axis”) could select in document order

We could add an index to push(“axis”)

push(“axis”, 1) push(“axis”,2)

We could add an index attribute to <axis>

<axis index=”1”> <axis index=”2”>

But then we'd need a mechanism to match index attribute

We could change the names of axis:

<axis1> <axis2>

We could put the different <axis> into different namespaces -- effectively same as adding attribute

We could try and match the <dimension> tag.

The consequences

Changing tagnames for simplicity of parsing just seems wrong

Matching the <dimension> tag is not possible without first selecting an <axis> in our scheme (locality)

Adding attributes/namespaces complicates API.

This use of different namespaces would be philosophically wrong.

Adding order of occurrance index into API is cleanest

No need to change Schema, Instance documents etc.

Document ordering removes random access capability

In General

For less simple (more general) XML documents duplicate tags can be distinguished by:

Occurrance Order

Name

Attributes

Content

Namespace

An ideal, simple API should allow matching on all of these to interrogate any XML document.

What about Locality ?push(namespace, tagname, attributes, occurrance)

getType(ns, tagname, attributes, occurrance, result)

But NO local parser can match on element content.

need to open a tag based on value of content

BUT can't get to content without opening tag.<size> <num_dimensions>2</num_dimensions> <axis> <dimension>2</dimension> <length>16</length> </axis> <axis> <dimension>1</dimension> <length>16</length> </axis></size>

Document order may not help here Schema document still

satisfied. Would like to match on

<dimension> tag Need to abandon locality

Lesson

In order to avoid ambiguity we must

Restrict the form of markup we deal with

Force decisions onto our Schema writers

OR complicate our API

rely on tag ordering (either implicitly or explicitly)

introduce attributes (forcing decision on Schema writers)

give up locality in the API

Global Queries: XPath

Would like a nice way to encode

tag name

attributes

order of occurrence

attribute/content matching predicates

Can this be done?

YES! Using XPath

XPath Axes

Node

Parent axis: .. Attribute Axis: @

Child axis: ./

Following Sibling Axis(no compact selector)

Preceding Sibling Axis(no compact selector)

XPath Axes specify coordinates for DOM.

Some Axes can include more than one node:

ancestors: parent and all its ancestors

XPath Selectors

tagname selects all children of current node called tagname

* selects all children of node

@name selects all attribute nodes called name

@* selects all atributes nodes of current node.

name[i] selects the i-th occurrance of child node called name

.. selects parent of current node

//name selects name with any set of ancestors

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/

Selection

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size

Selection

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis

Selection

OR

/size/*

OR

//axis

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension>2</dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis[2]

Selection/size/axis[dimension=”2”]

OR

Query on element content

Query on order of occurrance

XPath Examples

<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/bj:axisSelection

Support Namespaces

XPath Examples

<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis[@index=”2”]

Selection

Attribute Matching

Visit: http://www.zvon.org/xxl/XPathTutorial

for more ...

XPath Notes

Can return sets of nodes - not just unique node

Has more features:

Functions to turn query results into strings, numbers, booleans

Encodes all features we need

C/C++ linkable XPath Processors exist

Xerces, Xalan, libxml

Solves all our reader API problems in nice way.

XPath Based Reader API

Basic Functions:open(file/stream);getType(xpath_string, result);getAttributeType(xpath_string,

attributeName, result);

Semantics:The xpath_string must identify a unique node.

What is Easy to Parse?Stylistic discussion on Metadata Mailing list.

One particular question:

“ How should we mark up things?”

<size> <dimensions>4</dimensions> <axis> <name>X</name> <length>16</length> </axis> <axis> <name>Y</name> <length>16 </length> </axis></size>

<size> <x value=”16”/> <y value=”16”/> <z value=”16”/> <t value=”32”/></size>

Chris' Way: Tomoteru's Way:

Known as the: “ Element v.s. Attribute”debate in the XML world.

What is Easy to Parse?One statement is that the attribute way is perhaps easier to parse?

With XPath, both ways are easy to parse.

To get the length of the x dimension:

Chris' Way:

number(//size/axis[normalize-space(string(name))=”X”]/length)

getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue);

Tomoteru's Way:

number(//size/x/@value)

getIntAttribute(“//size/x”, “value”, intValue);

Chris' Way has more complex query. But equally simple API Call.

Element v.s. Attribute Debate (aside)

Looked on Web

Tomoteru's way is preferred in general by object modellers (eg. database people)

Mark up most “ atomic” data as attributes

Use tags to indicate “ table structure”

Chris' way is perhaps preferred by archivists or librarians (Go Kim!)

Decide for yourself, a discussion is available at:

http://www.oasis-open.org/cover/elementsAndAttrs.html

Found no universally accepted best practice.

Software: XPathReader

Wrote software to implement XPath Reader API in C++

Wraps around free libxml2 (C) library

Uses overloading and templating

Two Classes:

BasicXPathReader:

Use XPath to get at basic C++ types (ints, std::strings, etc)

XPathReader

Allows reading of Complex Numbers and Arrays.

XPathReader Class Public Members

void open(istream& is); void close(void);

template <typename T> void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result);

template <typename T> void getXPath(const string& xpath, T& result);

int countXPath(const string& xpath_query);

open/close functions:

count results of XPath Query:

get value of attribute from node identified by XPath:

get value of node identified by XPath

Complex Numbers and Arrays

XPathReader Library provides Classes for Complex Numbers and Arrays:

template<typename T> class TComplex { ... };

template<typename T> class Array { ... };

Can have Complex numbers of arrays

Eg for storing real/imaginary parts of arrays:

TComplex< Array< double > >

Can also have Complex-es templated on string-s

Mathematically not sensible...

Complex Number Markup & Marshal

<foo> <cmpx> <re>real part</re> <im>imag part</im> </cmpx></foo>

Invented simple mark up:

can maintain API through C++ function overloading and recursion:

template <typename T>void getXPath(const string& path, TComplex<T>& result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() );}

similar but slightly more involved for Array.

Array Markup

Arrays were marked up as follows:

<foo> <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> <size>N</size> <el idx=”x”> element[0] </el> <el idx=”x+1”> element[1] </el> ... <el idx=”x+N-1”> element[N-1] </el> </array></foo>

This is a general mark up -- suitable for local parsers too

Array Mark - Up Example

<size> <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> <num_dimensions>4</num_dimensions> <axis dimension=”1”> ... </axis> <axis dimension=”2”> ... </axis> ... </array></size>

Minimally invasiveInsert <array> </array> tagsCopy <dimension> tag to attributeEasy to implement with XSL

transformationWorking group needn't amend

current metadata schema for it.

ConclusionsDiscussed API Issues for Parsing XML without full “data binding” tools.

Discussed Repeated Tag problem

Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too.

Discussed C++ Implementation of an XPathReader API

Discussed how to parse compound data types

Described markup for Complex Numbers and Arrays

Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data.

References/Links

XML, DOM, XPath: http://www.w3.org

Tutorials (XPath/XSLT): http://www.zvon.org

libxml2: http://www.xmlsoft.org

Attribute v.s. Entities (and other discussions):

http://www.oasis-open.org/cover/elementsAndAttrs.html

XPathReader software

send email to me: bj@ph.ed.ac.uk

SciDAC CVS repository at JLAB (xpath_reader)

SciDAC: http://www.lqcd.org

Recommended