Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Parallel XSLT Processing of Large Documents

Jakub Maly, Barclays@j_malyjakub@maly.cz XML Prague 2015

Reminder on streaming…

Can now process huge documents in bounded memory

A whole new area where XSLT is now applicable

With trade-offs stylesheet must follow streamability rules

limited XPath

XSLT 3.0 only, only in commercial products

Large documents take long time to process processing time dominated by the time required to parse the input

Motivation

Simple input XML structure, 700MB in size

Simple XSLT

Takes 35s to process…

<ProteinEntry id="CCMQR"> <header> <uid>CCMQR</uid> <accession>A00003</accession> <created_date>17-Mar-1987</created_date> <seq-rev_date>17-Mar-1987</seq-rev_date> <txt-rev_date>03-Mar-2000</txt-rev_date> </header> <protein> <name>cytochrome c</name> </protein>...</ProteinEntry>

Why so long?

I/O is not a problem (SSDs are fast enough)

We are using streaming, so memory consumption is constant (bounded)

Processor runs on 100% but just one of the cores…

Space for optimization?

Multi-core machines are ubiquitous

XSLT processor should use all cores if possible

Parsing + processing in multiple threads and then merge the outputs

Results

Trade-offs

One processor thread can’t see data processed by other threads The document has to consist of fairly independent “records”

can be processed separately

As in streaming, we can’t “go back”

and crotches like accumulators won’t work

And sometimes can’t even “go up” (out of the record)

Requirements #1 (input)

The document has a well-defined structure (schema)

A major part of the content is in a sequence of nodes of certain types (we will call these core types)

Core types and their ancestors are not recursive.

Contents of core types are reasonably independent.

We expect that processing of each record takes similar amount of time

Input can readable by multiple threads from random positions

Requirements #2 (stylesheet)

Streamable

Explicitly marked templates for core nodes

Paths in those templates are absolute and use only child axis and element names

alternatively: provide schema

Only the core node and it’s subtree can be accessed by XPath

match="/ProteinDatabase/ProteinEntry"

pxsl:core="yes"

Special cases

If we know more about the structure, we can access more data safely, e.g. If all core nodes are children of one node

We can read from „intro“ in all threads

Special cases #2

If all core nodes are not children of one node Maybe we could choose different layer of

nodes as core nodes

Parsing problems

Possible issues when splitting the document comments, PIs, CDATA

Solutions

report error

preprocessing

with „fast“ XML parser

non XML-aware

Side-effect problem

Parallelization can produce unexpected results

Side-effects defined by the language, e.g. xsl:message Could be buffered/concatenated

Others Vendor-specific extensions

User extensions

Solutions?

Experimental implementation

Thin wrapper around Saxon EE 9.6, written in Java

1. Split the documents into portions of roughly the same size

2. Turn each portion into a well-formed XML (by adding a small prefix/suffix)

3. Run an instance of Saxon on each portion

4. Merge the results when all threads finish

https://github.com/j-maly/pXSLT

Use Case

RUIAN = DB of geographical, municipal information, XML Prague = 614 MB of data

Simple format Records for streets, buildings, …

Task: split the large file into individual records (each in one XML file) Takes 42 minutes in Saxon EE

Conclusion

Processing in multiple threads provides measurable speed-up

Imposes additional limitations on the stylesheet and input

Described approach makes sense only for large documents (for documents that fit into memory, other solutions are already

available, e.g. saxon:threads)

https://github.com/j-maly/pXSLT

Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Software

XML II: XSL,XPath,XSLT

XML and Databases XQuery, XSLT and XPath XQuery XQuery XML

XML Transformation: XSLT

Building Dynamic Forms with XML, XSLT - Hikari ... dynamic forms with XML, XSLT 29 Figure 3. XML to HTML transformation

XSLT - XSL Transformations, Teil 1joern/edu/xml/xmlpraxis02/xslt… · XML-Praxis XSLT – XSL Transformations, Teil 1 6/22. XML-Dokument als Baum presentation title text date status

XML XSLT of IBM

Internet Technologies1 XSLT Processing XML using XSLT Using XPath

XML Prague 2018 · XML Prague 2018 Conference Proceedings University of Economics, Prague Prague, Czech Republic February 8–10, 2018

Crash Course in XSLT - stuff.mit.edu: students' portal · 1 Crash Course in XSLT: Overview of XML and XSLT Jan. 15, 2007 Crash Course in XSLT Overview of XML and XSLT Jan. 15, 2007

Transforming xml with XSLT

XML - DTD - XML XSchema - XSLT / OpenERP

Transforming XML The XSLT Language

Einführung in XSLT - uni-bielefeld.dejoern/edu/xml/xml... · 2004. 1. 5. · XML-Praxis Einführung in XSLT Jörn Clausen joern@TechFak.Uni-Bielefeld.DE 1

Object Oriented Programmin III1 XML/XSLT What is XML? What is XSLT? From XML to HTML using XSLT The XSLT processing model An Example from FpML Homework

XML and XSLT

XML Xpath & XSLT

XML XPath XSLT - i-d-e.de · Folie 2 XML –XPath XSLT. Wiederholung und Auffrischung IDE Autumn School 2012, Chemnitz XML, Wiederholung Werkzeuge? oXygen

XSLT XML DBs , and Schemas

Introduction technique à XML xml-tech Introduction ...tecfa.unige.ch/guides/tie/pdf/files/xml-tech.pdf · Module technique suivant: xml-xslt (Transformations XSLT) Module technique

XML, DTD, XML Schema, and XSLT