2 ref avl

  • Upload
    knadham

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 2 ref avl

    1/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    XML in der CL: XML Parsing

    Bryan Jurish [email protected]

    1 Introduction: What is XML Parsing?

    XML: an increasingly popular markup language used for data storage,transfer, and interchange, and the main focus of this course.

    parsing: the act of testing a serial string of charactersfor a condition ofwell-formedness, thereby constructing an internal structureor representa-tion of the information represented by the input string.

    character: an atomic symbol (letter) belonging to some finite char-

    acter set, usually represented by an underlying binary code, or characterencoding.

    character set: a finite alphabet of potential characters, alt. the domainof some character encoding.

    character encoding: a convention by which each possible character isuniquely identified by the numeric value of some binary sequence.

    Common fixed-width encodings: ASCII (7 bits), ISO8859-1 (8 bits),Unicode (16 bits).

    Common variable-width encodings: UTF-8 ( 8 bits).

    well-formed: conforming to some well-defined conventional specificationregarding allowable objects, e.g. well-formed strings with respect to agiven grammar are those strings which belong to the language generatedby the grammar.

    Well-formed XML documents are (technically speaking) those whichconform to the XML specification: those which contain no mismatchedopening- and closing-element tags, etc.

    Contrast with valid XML documents: in XML, validity is definedwith respect to a DTD, and can only be confirmed (or refuted, in thewell-formed case) by reference to the DTD in question.

    internal structure: above, refers to data in some application-specific

    form which is to be extracted from an XML document and made palatableto the host (parsing) application.

    2 Motivation: Why do we care?

    2.1 Pros

    Data ExchangeMany libraries, programs and program suites use XML as an exchangeformat between various subsystems which need to communicate with one

    1

    mailto:[email protected]:[email protected]
  • 7/29/2019 2 ref avl

    2/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    another if your program happens to be on the receiving end of such a

    link, youll most likely need to do some parsing.

    Encoding IndependenceA veritable plethora of potential character encodings exist and currentlyare in active use: the XML specification allows (at least in theory) pro-grams a great deal of indepence from the actual character encoding usedin the data, relegating the handling of character sets to an abstract parserlayer (abstract XML parsers return all data to the application encoded inUTF-8).

    ReadabilityXML may not be pretty, but it can be read and understood by humanbeings for many applications, it is often possible to construct test XML

    documents by hand, as well as to diagnose program I/O errors simplyby looking at the relevant sections of the XML input (rsp. output).

    CompatibilityLow-level XML parsing libraries exist for nearly every major programmingand scripting language (C, C++, Perl, Python, Java, . . .); thus XML canbe a useful anchor format shared between different programs, or betweendifferent implementations of the same program (e.g. prototype in Perl,release in C).

    2.2 Cons

    Learning CurveAs is the case with most programming tasks, before you can enjoy thebenefits of XML, you need first to choose a parsing model, an implemen-tation of that model, and familiarize yourself with the peculiarities of themodel and implementation you have chosen: this takes time, effort, andmore work than you might want to invest often, it is easier in the shortrun to perform I/O in some native (application-specific) format.

    SpaceXML encoded documents are usually much larger in size than would bestrictly necessary for the amount of raw information they contain. Ifyou are encoding large amounts of information in XML, youre wasting alot of bytes.

    SpeedParsing of XML documents is SLOW. There are many reasons for this,among them the added size of XML documents, but the main reason isthe complexity of the algorithms used by the underlying abstract parserlayer to implement all of the potential benefits that XML offers if youdont need these benefits, a native (application-specific) data format islikely to be parseable up to twice as fast as a corresponding XML format.

    2

  • 7/29/2019 2 ref avl

    3/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    3 Variants: When does parsing happen?

    3.1 Overview

    There are two main paradigms for abstract XML parsers, which differ in theiranswer to the question of when the application programmer gets access to thedata encoded by the XML document, and in what form this data appears whenit becomes accessible.

    1. Callback-Based Document Parsing (CB)Often referred to by the name of the most popular instance specification,SAX, callback-based parsers provide the application programmer witha serial stream of events corresponding to the various components of

    the XML document being parsed, in document order.

    2. In-Memory Parse Tree Construction (TC)Also commonly referred to by the name of the most popular instance spec-ification, DOM (Document Object Model), tree-constructing parsersbuild an implementation-specific in-memory representation of the XMLdocuments tree-like structure. After a successful parse, they provide theapplication programmer with an API for searching, traversing, manipulat-ing, and saving the internal tree-like document representation. TC parsersare usually themlselves built on CB parsers, and are in the general caseslower than their CB cousins, and usually involve heavier memory usage.

    The following sections illustrate the differences between the two parsing modelsbased on the following XML document:

    This is some text.

    This too.

    3.2 Callback Parsers

    The example XML document given above in Section 3.1 would generate an eventstream such as the following to be interpreted by the application:

    3

  • 7/29/2019 2 ref avl

    4/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    Event Type Details

    XML Declaration XML-Version=1.0Encoding=ISO8859-1

    DOCTYPE DeclarationRoot=fooType=SYSTEM

    URI=foo.dtd

    Open ElementName=foo

    Attributes=(none)

    Open ElementName=bar

    Attributes=baz=bonkblip=boffo

    Text This is some text.Close Element Name =bar

    Open Element Name=barAttributes= blip=bleep

    Text This too.Close Element Name =barClose Element Name =foo

    End of Document

    3.3 Tree-Construction Parsers

    The example XML document given above in Section 3.1 passed through a tree-constructing parser would produce a data structure such as the following:

    Document

    Version=1.0Encoding=ISO8859-1

    DTD URI=foo.dtd

    Element

    Name=fooAttributes=(none)

    ElementName=bar

    Attributes=baz=bonkblip=boffo

    Text

    This is some text.

    ElementName=bar

    Attributes= blip=bleep

    Text

    This too.

    4

  • 7/29/2019 2 ref avl

    5/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    4 Implementations: How can we parse?

    4.1 Do-It-Yourself

    Dont even think about it.

    Really, I mean it.

    Trust me on this one.

    4.2 Expat

    The Big Idea

    Expat is a C library originally by James Clark which implements an ab-stract callback-based XML parser. It is just about the fastest XML parserfreely available. The application programmer is responsible for setting upcallbacks(function pointers) as handlers for various event types, and forproviding the expat routines with successive chunks of data to be parsed(buffer-filling). The user-defined callbacks receive all event-relevant dataand are responsible for constructing the application-internal representa-tion however the programmer sees fit to do so.

    Friends and RelationsExpat wrappers exist for C++ (programmer defines a derived class; call-backs appear as virtual methods of that class), Perl (in the guise of theXML::Parser module), and other programming languages. In fact, the Perl

    XML::Parser API even offers package-, callback-by-element-, and tree-construction parsing modes, but these extensions exist chiefly on the Perlside, and you should not expect to find them in other expat derivatives.

    BugletsExpat has built-in support for only a few encodings currently, theseare: UTF-8, ISO-8859-1, UTF-16, and US-ASCII. Other encodings mayadditionally be supported by various expat wrappers such as XML::Parser:check your implementations documentation for details.

    4.2.1 XML::Parser

    Preparationuse XML::Parser;

    Handler Setup: Element open-tag handler

    sub handle_start {

    my ($parser, $elt, %attrs) = @_;

    print "ELEMENT: $elt\n";

    foreach my $a (sort(keys(%attrs))) {

    print " ATTRIBUTE: $a = $attrs{$a}\n";

    }

    }

    5

  • 7/29/2019 2 ref avl

    6/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    Handler Setup: Element close-tag handler

    sub handle_end {

    my ($parser, $elt) = @_;

    print "/ELEMENT: $elt\n";

    }

    Handler Setup: Text handler

    sub handle_char {

    my ($parser, $string) = @_;

    print "TEXT: >>", $string, "

  • 7/29/2019 2 ref avl

    7/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    Buglets

    libxml2 is not entirely conformant to the DOM standard, althoughthere are extensions based on libxml2 (notably the Gnome DOM Enginegdome2) which provide DOM conformance.

    4.3.1 XML::LibXML

    Preparation

    use XML::LibXML;

    Parser Creation

    $tcp = XML::LibXML->new();

    Parsing an in-memory string buffer

    $doc = $tcp->parse_string(hello tree);

    Parsing a local file

    $doc = $tcp->parse_file("foo.xml");

    Parsing a Perl Filehandle

    open(FOO, "

  • 7/29/2019 2 ref avl

    8/8

    WS 2004-2005 / Jurish / XML in der CL: XML Parsing 12.01.2005

    @attrs = $node->attributes();

    Get node text content (recursively)

    $string = $node->textContent();

    Save document to a named file

    $doc->toFile("out.xml", 1); ##-- 1 = pretty-print

    5 Further Details: Where can we read more?

    libxml2

    sources, documentation, pre-compiled binaries, etc.:http://xmlsoft.org

    Perl wrapper: http://cpan.uwinnipeg.ca/module/XML::LibXML

    expat

    sources & other links: http://expat.sourceforge.net

    introductory article at xml.com:http://www.xml.com/pub/1999/09/expat/index.html

    Perl wrapper: http://cpan.uwinnipeg.ca/module/XML::Parser

    DOM Specification: http://www.w3.org/DOM/

    SAX Specification: http://www.saxproject.org/

    Apache XML Project (WARNING not for the weak of heart!):http://xml.apache.org/

    8

    http://xmlsoft.org/http://cpan.uwinnipeg.ca/module/XML::LibXMLhttp://expat.sourceforge.net/http://www.xml.com/pub/1999/09/expat/index.htmlhttp://cpan.uwinnipeg.ca/module/XML::Parserhttp://www.w3.org/DOM/http://www.saxproject.org/http://xml.apache.org/http://xml.apache.org/http://www.saxproject.org/http://www.w3.org/DOM/http://cpan.uwinnipeg.ca/module/XML::Parserhttp://www.xml.com/pub/1999/09/expat/index.htmlhttp://expat.sourceforge.net/http://cpan.uwinnipeg.ca/module/XML::LibXMLhttp://xmlsoft.org/