33
Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Embed Size (px)

Citation preview

Page 1: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Getting Data out of XML Documents

Bálint JoóSchool of Physics

University of EdinburghMay 02, 2003

Page 2: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

ContentsIn search of a simple API for accessing DOM

The multiple tag problem

What is it?

Is it a problem for us?

How can we get around it?

XPath

What is easy to parse?

Software: XPathReader package

Conclusions

Page 3: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Motivation (Starting Points)

Lack of free Data- binding tools for C/C++

Desire to read ILDG Metadata documents, marshal application data

=> Have to write our own tools

Would like simple API to get at document data

Would like same API to cope with ILDG metadata AND application data.

We got as far as reading into a DOM.

Page 4: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Start With Simple Idea

Consider simple API with functions

push(tagname) -- select tag with name tagname

pop() -- move up a level

getType( tagname , result )

Type = string | float | double | int | bool;

Equivalent API: directory like structure with no absolute paths:

cd(tagname) = push(tagname) , cd(..) = pop()

Simple Data: No Attributes, No Namespaces No Empty Elements.

Page 5: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Example

<? xml version=”1.0”?><foo> <bar>String</bar> <fred>5.0</fred></foo>

Open(''file.xml'');push(''foo'');string bar; getString(''bar'', bar);double fred;getDouble(''fred'', fred);pop();

So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders

BUT Soon we encounter...

Page 6: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

The Multiple Tag Problem

<size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

Consider following snippet:

Lets try our API: push(''size'');

But what does: push(''axis''); do?

Page 7: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Multiple Tag Problem (cont'd)push(“axis”) could select in document order

We could add an index to push(“axis”)

push(“axis”, 1) push(“axis”,2)

We could add an index attribute to <axis>

<axis index=”1”> <axis index=”2”>

But then we'd need a mechanism to match index attribute

We could change the names of axis:

<axis1> <axis2>

We could put the different <axis> into different namespaces -- effectively same as adding attribute

We could try and match the <dimension> tag.

Page 8: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

The consequences

Changing tagnames for simplicity of parsing just seems wrong

Matching the <dimension> tag is not possible without first selecting an <axis> in our scheme (locality)

Adding attributes/namespaces complicates API.

This use of different namespaces would be philosophically wrong.

Adding order of occurrance index into API is cleanest

No need to change Schema, Instance documents etc.

Document ordering removes random access capability

Page 9: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

In General

For less simple (more general) XML documents duplicate tags can be distinguished by:

Occurrance Order

Name

Attributes

Content

Namespace

An ideal, simple API should allow matching on all of these to interrogate any XML document.

Page 10: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

What about Locality ?push(namespace, tagname, attributes, occurrance)

getType(ns, tagname, attributes, occurrance, result)

But NO local parser can match on element content.

need to open a tag based on value of content

BUT can't get to content without opening tag.<size> <num_dimensions>2</num_dimensions> <axis> <dimension>2</dimension> <length>16</length> </axis> <axis> <dimension>1</dimension> <length>16</length> </axis></size>

Document order may not help here Schema document still

satisfied. Would like to match on

<dimension> tag Need to abandon locality

Page 11: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Lesson

In order to avoid ambiguity we must

Restrict the form of markup we deal with

Force decisions onto our Schema writers

OR complicate our API

rely on tag ordering (either implicitly or explicitly)

introduce attributes (forcing decision on Schema writers)

give up locality in the API

Page 12: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Global Queries: XPath

Would like a nice way to encode

tag name

attributes

order of occurrence

attribute/content matching predicates

Can this be done?

YES! Using XPath

Page 13: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Axes

Node

Parent axis: .. Attribute Axis: @

Child axis: ./

Following Sibling Axis(no compact selector)

Preceding Sibling Axis(no compact selector)

XPath Axes specify coordinates for DOM.

Some Axes can include more than one node:

ancestors: parent and all its ancestors

Page 14: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Selectors

tagname selects all children of current node called tagname

* selects all children of node

@name selects all attribute nodes called name

@* selects all atributes nodes of current node.

name[i] selects the i-th occurrance of child node called name

.. selects parent of current node

//name selects name with any set of ancestors

Page 15: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/

Selection

Page 16: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size

Selection

Page 17: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis

Selection

OR

/size/*

OR

//axis

Page 18: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension>2</dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis[2]

Selection/size/axis[dimension=”2”]

OR

Query on element content

Query on order of occurrance

Page 19: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/bj:axisSelection

Support Namespaces

Page 20: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Examples

<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>

XPath Query:

/size/axis[@index=”2”]

Selection

Attribute Matching

Visit: http://www.zvon.org/xxl/XPathTutorial

for more ...

Page 21: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Notes

Can return sets of nodes - not just unique node

Has more features:

Functions to turn query results into strings, numbers, booleans

Encodes all features we need

C/C++ linkable XPath Processors exist

Xerces, Xalan, libxml

Solves all our reader API problems in nice way.

Page 22: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPath Based Reader API

Basic Functions:open(file/stream);getType(xpath_string, result);getAttributeType(xpath_string,

attributeName, result);

Semantics:The xpath_string must identify a unique node.

Page 23: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

What is Easy to Parse?Stylistic discussion on Metadata Mailing list.

One particular question:

“ How should we mark up things?”

<size> <dimensions>4</dimensions> <axis> <name>X</name> <length>16</length> </axis> <axis> <name>Y</name> <length>16 </length> </axis></size>

<size> <x value=”16”/> <y value=”16”/> <z value=”16”/> <t value=”32”/></size>

Chris' Way: Tomoteru's Way:

Known as the: “ Element v.s. Attribute”debate in the XML world.

Page 24: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

What is Easy to Parse?One statement is that the attribute way is perhaps easier to parse?

With XPath, both ways are easy to parse.

To get the length of the x dimension:

Chris' Way:

number(//size/axis[normalize-space(string(name))=”X”]/length)

getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue);

Tomoteru's Way:

number(//size/x/@value)

getIntAttribute(“//size/x”, “value”, intValue);

Chris' Way has more complex query. But equally simple API Call.

Page 25: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Element v.s. Attribute Debate (aside)

Looked on Web

Tomoteru's way is preferred in general by object modellers (eg. database people)

Mark up most “ atomic” data as attributes

Use tags to indicate “ table structure”

Chris' way is perhaps preferred by archivists or librarians (Go Kim!)

Decide for yourself, a discussion is available at:

http://www.oasis-open.org/cover/elementsAndAttrs.html

Found no universally accepted best practice.

Page 26: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Software: XPathReader

Wrote software to implement XPath Reader API in C++

Wraps around free libxml2 (C) library

Uses overloading and templating

Two Classes:

BasicXPathReader:

Use XPath to get at basic C++ types (ints, std::strings, etc)

XPathReader

Allows reading of Complex Numbers and Arrays.

Page 27: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

XPathReader Class Public Members

void open(istream& is); void close(void);

template <typename T> void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result);

template <typename T> void getXPath(const string& xpath, T& result);

int countXPath(const string& xpath_query);

open/close functions:

count results of XPath Query:

get value of attribute from node identified by XPath:

get value of node identified by XPath

Page 28: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Complex Numbers and Arrays

XPathReader Library provides Classes for Complex Numbers and Arrays:

template<typename T> class TComplex { ... };

template<typename T> class Array { ... };

Can have Complex numbers of arrays

Eg for storing real/imaginary parts of arrays:

TComplex< Array< double > >

Can also have Complex-es templated on string-s

Mathematically not sensible...

Page 29: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Complex Number Markup & Marshal

<foo> <cmpx> <re>real part</re> <im>imag part</im> </cmpx></foo>

Invented simple mark up:

can maintain API through C++ function overloading and recursion:

template <typename T>void getXPath(const string& path, TComplex<T>& result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() );}

similar but slightly more involved for Array.

Page 30: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Array Markup

Arrays were marked up as follows:

<foo> <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> <size>N</size> <el idx=”x”> element[0] </el> <el idx=”x+1”> element[1] </el> ... <el idx=”x+N-1”> element[N-1] </el> </array></foo>

This is a general mark up -- suitable for local parsers too

Page 31: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Array Mark - Up Example

<size> <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> <num_dimensions>4</num_dimensions> <axis dimension=”1”> ... </axis> <axis dimension=”2”> ... </axis> ... </array></size>

Minimally invasiveInsert <array> </array> tagsCopy <dimension> tag to attributeEasy to implement with XSL

transformationWorking group needn't amend

current metadata schema for it.

Page 32: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

ConclusionsDiscussed API Issues for Parsing XML without full “data binding” tools.

Discussed Repeated Tag problem

Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too.

Discussed C++ Implementation of an XPathReader API

Discussed how to parse compound data types

Described markup for Complex Numbers and Arrays

Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data.

Page 33: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

References/Links

XML, DOM, XPath: http://www.w3.org

Tutorials (XPath/XSLT): http://www.zvon.org

libxml2: http://www.xmlsoft.org

Attribute v.s. Entities (and other discussions):

http://www.oasis-open.org/cover/elementsAndAttrs.html

XPathReader software

send email to me: [email protected]

SciDAC CVS repository at JLAB (xpath_reader)

SciDAC: http://www.lqcd.org