View
120
Download
1
Category
Tags:
Preview:
Citation preview
Parallel XSLT Processing of Large Documents
Jakub Maly, Barclays@j_malyjakub@maly.cz XML Prague 2015
Reminder on streaming…
Can now process huge documents in bounded memory
A whole new area where XSLT is now applicable
With trade-offs stylesheet must follow streamability rules
limited XPath
XSLT 3.0 only, only in commercial products
Large documents take long time to process processing time dominated by the time required to parse the input
Motivation
Simple input XML structure, 700MB in size
Simple XSLT
Takes 35s to process…
<ProteinEntry id="CCMQR"> <header> <uid>CCMQR</uid> <accession>A00003</accession> <created_date>17-Mar-1987</created_date> <seq-rev_date>17-Mar-1987</seq-rev_date> <txt-rev_date>03-Mar-2000</txt-rev_date> </header> <protein> <name>cytochrome c</name> </protein>...</ProteinEntry>
Why so long?
I/O is not a problem (SSDs are fast enough)
We are using streaming, so memory consumption is constant (bounded)
Processor runs on 100% but just one of the cores…
Space for optimization?
Multi-core machines are ubiquitous
XSLT processor should use all cores if possible
Parsing + processing in multiple threads and then merge the outputs
Trade-offs
One processor thread can’t see data processed by other threads The document has to consist of fairly independent “records”
can be processed separately
As in streaming, we can’t “go back”
and crotches like accumulators won’t work
And sometimes can’t even “go up” (out of the record)
Requirements #1 (input)
The document has a well-defined structure (schema)
A major part of the content is in a sequence of nodes of certain types (we will call these core types)
Core types and their ancestors are not recursive.
Contents of core types are reasonably independent.
We expect that processing of each record takes similar amount of time
Input can readable by multiple threads from random positions
Requirements #2 (stylesheet)
Streamable
Explicitly marked templates for core nodes
Paths in those templates are absolute and use only child axis and element names
alternatively: provide schema
Only the core node and it’s subtree can be accessed by XPath
match="/ProteinDatabase/ProteinEntry"
pxsl:core="yes"
Special cases
If we know more about the structure, we can access more data safely, e.g. If all core nodes are children of one node
We can read from „intro“ in all threads
Special cases #2
If all core nodes are not children of one node Maybe we could choose different layer of
nodes as core nodes
Parsing problems
Possible issues when splitting the document comments, PIs, CDATA
Solutions
report error
preprocessing
with „fast“ XML parser
non XML-aware
?
<ProteinEntry>...<!--</ProteinEntry><ProteinEntry>...--></ProteinEntry>
Side-effect problem
Parallelization can produce unexpected results
Side-effects defined by the language, e.g. xsl:message Could be buffered/concatenated
Others Vendor-specific extensions
User extensions
Solutions?
Experimental implementation
Thin wrapper around Saxon EE 9.6, written in Java
1. Split the documents into portions of roughly the same size
2. Turn each portion into a well-formed XML (by adding a small prefix/suffix)
3. Run an instance of Saxon on each portion
4. Merge the results when all threads finish
https://github.com/j-maly/pXSLT
Use Case
RUIAN = DB of geographical, municipal information, XML Prague = 614 MB of data
Simple format Records for streets, buildings, …
Task: split the large file into individual records (each in one XML file) Takes 42 minutes in Saxon EE
Conclusion
Processing in multiple threads provides measurable speed-up
Imposes additional limitations on the stylesheet and input
Described approach makes sense only for large documents (for documents that fit into memory, other solutions are already
available, e.g. saxon:threads)
https://github.com/j-maly/pXSLT
Recommended