Accelerating DITA with OmniMarkA Scalable Sol tion for Demanding Prod ction En ironments
Copyright © Stilo International 2008
A Scalable Solution for Demanding Production Environments
[email protected] 2008
Darwin Information Typing Architecture (DITA)An OASIS standard for content Reduce
ToolsDITA Open ToolKit Editors
An OASIS standard for contentReuseRepurpose
Transclusion Topic-LevelMaps Specialization Metadata-Based
Filtering
ToolspCMSes
Established Foundationsand Best Practices
Maps Filtering
HTML
CALS TablesHyTime
Topic Types
XML SGML
DITA PublishingDepends on efficient assembly, interpretation, filtering & formatting of content components
The DITA Open Toolkit FactorThe Open Toolkit has been a big part of DITA's successThe Open Toolkit has been a big part of DITA s success
Open sourceActive development communityThorough implementation of DITAOut-of-the-box support for multiple output formatsModular architectureEasily customized
Components of the Open Toolkit are replaceableU h h i f XSLT d FO tUsers have a choice of XSLT and FO processor components
Many commercial products bundle the Open ToolkitAs a result DITA is closely identified with the Open ToolkitAs a result DITA is closely identified with the Open Toolkit
DITA Editors incorporating Open ToolkitAdobe Framemaker 8Adobe Framemaker 8
Information Mapping Content Mapper
Inmedius DITA Storm rcom
Inmedius DITA Storm
In.vision DITA Studio
Justsystems XMetaL Author Enterprise 5 1 DIT
A is
sue)
ools
/STC
_Int
er
Justsystems XMetaL Author Enterprise 5.1
PTC Arbortext 5.3
S RO S ft X / 9 1
m, A
pril
2008
(fro
m a
to z
,
dita
new
s.co
m/to
SyncRO Soft <oXygen/> 9.1
Syntext Serna 3.5
STC
inte
rco
DIT
A To
ols
fB
ob D
oyle
http
://w
ww.
d
XMLmind XML Editor 3.6
Sou
rce:
DITA CMS Integration with Open ToolkitAstoria On Demand
Author-it
Bluestream XDocs
rcom
DITA Exchange
DocZone
Inmedius Horizon DIT
A is
sue)
ools
/STC
_Int
er
IXIASOFT DITA CMS Framework
PTC Arbortext Content Manager
m, A
pril
2008
(fro
m a
to z
,
dita
new
s.co
m/to
SiberLogic SiberSafe
Trisoft Infoshare
Vasont
STC
inte
rco
DIT
A To
ols
fB
ob D
oyle
http
://w
ww.
d
Vasont
X-Hive Docato
XyEnterprise Content@ Sou
rce:
Exploiting DITAAs DITA evolves it will be applied to ever more demanding situationsAs DITA evolves, it will be applied to ever more demanding situations
Many industries publish huge volumes of dataAerospace, automotive, oil services, legal publishing
Aspects of DITA can be used for their own sakeAspects of DITA can be used for their own sakeDITA specialization may spin off into its own standardTransclusion can allow reuse even among monolithic documentsMetadata based filtering can provide general purpose effectivity supportMetadata-based filtering can provide general-purpose effectivity supportDITA is a very modular specification
Some of these scenarios will have very demanding requirementsV l "t i "Very large "topics"Large numbers of topics
The DITA Continuum at Stilo
Pure DITA Semi-conductor
FrameMaker source; PDF
Authoring costs; Consistency;
Content Details Drivers
Datasheets publishing Customized Pubs
Legal Procedures
e-Learning; Word and HTML source
Adaptable; Simplified authoring; Integration with existing XML tools
Semi-DITA
Aerospace Standards, 2 projects
Monolithic; SGML, Interleaf, Word source; publish to ATA, S1000D, new web services
Many legacy formats;Multi-target; access to sub-contractors; S1000D support
web services
Aircraft Maint. Manuals
Monolithic; ATA source; E-manuals
Efficient update; Targeting; Costs; Regulatory compliance
Non-DITA
Automotive Monolithic; Multiple sources; SGML
Efficient update; Targeting; Costs;
Software Docs
Topics; SGML; RDBMS storage
Authoring costs; Multi-target; Reuse;
Pushing the BoundariesHow well does the Toolkit cope with these situations?How well does the Toolkit cope with these situations?The Toolkit has a modular architecture
It can be used as a base for partial DITA applications
Some coding tricks are requiredXSLT rules must be implemented carefully to preserve support for specializationspecialization
Most importantly, XSLT is not known as a fast processing technology
Can the Toolkit cope with high volumes of data?
We can test this
Building a DITA Stress TestSample input is the DITA language referencep p g g
200+ topics1468 conref references741 targets referenced by conref1 06 MB1.06 MBAverage file size 5 kB
The DITA language reference was inflated in two waysTopic sizes were increased up to a factor of 100 (to 500KB per file)p p ( p )Number of files was increased up to a factor of 100 (to 20,000 files)
To increase topic sizesThe body of each topic was replicatedA random prefix was added to each word to create unique contentA random prefix was added to each word to create unique contentThe number of links increased proportionately
To increase the number of filesThe whole topic was replicatedp pA random prefix was added to each word, each id, and each idrefThe number of links and link targets increased proportionately
Open Toolkit performance (1)Processing Time vs. Average File Size: from 5 kB to 50 kB
4000 1.1
g g
hoursseconds
4000 1.1
3000
3500
0 7
0.8
0.9
1.0AVG SIZE (kB) Open Toolkit
TIME (s)5 80
3000
3500
0 7
0.8
0.9
1.0
me
1500
2000
2500
0 4
0.5
0.6
0.7DITA Accelerator
Open Toolkit
10 156
21 428
31 8521500
2000
2500
0.4
0.5
0.6
0.7DITA Accelerator
Open Toolkit
cess
ing
Tim
500
1000
0.1
0.2
0.3
0.441 1422
51 2160
500
1000
0.1
0.2
0.3
0.4
Proc
00 10 20 30 40 50 60
0.000 10 20 30 40 50 60
0.0
Average File Size (kB)
Open Toolkit performance (2)Processing Time vs. Average File Size: from 5 kB to 500 kB
40000
10
1140000
10
11
AVG SIZE (kB)
Open Toolkit TIME (s)
DITA Open Toolkit TIME
(hr)
g ghoursseconds
25000
30000
35000
7
8
9
10
25000
30000
35000
7
8
9
10 5 80 0.02
10 156 0.04
21 428 0.12
31 852 0 24me
15000
20000
4
5
6
DITA Accelerator15000
20000
4
5
6
DITA Accelerator
31 852 0.24
41 1422 0.40
51 2160 0.60
103 8160 2.27cess
ing
Tim
5000
10000
1
2
3DITA Accelerator
Open Toolkit
0
5000
10000
0
1
2
3DITA Accelerator
Open Toolkit
103 8160 2.27
206 33660 9.35
309
412
Proc
OUT OFMEMORY0
0 100 200 300 400 500 60000
0 100 200 300 400 500 6000
515
Average File Size (kB)
MEMORY
Open Toolkit performance (3)Processing Time vs. Number of Files: from 200 to 2,000g ,
10001000Number of DITA Open
800DITA AcceleratorOpen Toolkit800DITA AcceleratorOpen Toolkit
Number of Files
DITA Open Toolkit TIME
(s)
206 80
me
(s)
400
600
400
600 412 144
824 286
1236 415essi
ng T
im
0
200
0
2001648 557
2060 699
Proc
e
Number of Files
00 500 1000 1500 2000 2500
00 500 1000 1500 2000 2500
Open Toolkit performance (4)Processing Time vs. Number of Files: from 200 to 20,000
30003000
g ,Number of
FilesOpen Toolkit
TIME (s)
2000
2500
2000
2500
me
(s)
206 80
412 144
824 286
1000
1500
1000
1500
essi
ng T
im 1236 415
1648 557
2060 699
0
500DITA AcceleratorOpen Toolkit
0
500DITA AcceleratorOpen ToolkitPr
oce
4120 1429
8240
1236000 5000 10000 15000 20000 25000
00 5000 10000 15000 20000 25000
Number of Files
12360
16480
20600
OUT OFMEMORY
Accelerating DITA for ProductionAn alternative to the Toolkit is requiredAn alternative to the Toolkit is requiredProduction-level quality
No limits on large volumes of contentConsistently high throughput speed as volume increasesy g g p pRobust and maintainable
Rapid development architectureOut-of-the-box rendering for standard DITA schemas/DTDsEasily customized
DITA-awareBuilt-in support for DITA concepts
TransclusionSpecializationFiltering
No programming tricks requiredNo programming tricks required
OmniMark DITA AcceleratorDITA Accelerator implements HTML publishingp p g
Implements all functionality required for language referenceHTML support still requires completionPDF to be implemented in the future
Behavior is modeled on the ToolkitBehavior is modeled on the ToolkitAutomated tests were written to ensure that the output is almost identicalThe output of the DITA Accelerator is nearly identical to the Open Toolkit
index.html from the Open Toolkitpindex.html from the DITA Accelerator
Some small differences remainTable cell borders are inconsistent in some casesSome errors in the DITA toolkit are corrected in the AcceleratorSome errors in the DITA toolkit are corrected in the Accelerator
High performance is achieved with streaming technologyLeverages OmniMark's built-in support for streamingMakes heavy use of referentsy
A DITA-aware library has been implementedProgrammers do not have to employ coding tricks
Gentlemen, start your enginesDITA language referenceDITA language reference
206 files1414 elements with ids (potential link or conref targets)1468 conref references1468 conref references741 targets referenced by conref1.06 MBAverage file size 5 kBAverage file size 5 kB
Initial results are promisingDITA Open Toolkit: 1 minute, 21 secondsDITA Accelerator: 18 secondsSpeed improvement: 4X
What about larger input sets?g p
Comparing DITA Accelerator and Open Toolkit (1)
Processing Time vs. Average File Size: from 5 kB to 50 kB
3500
4000
1.0
1.1
3500
4000
1.0
1.1 AVG SIZE (kB)
Open Toolkit
TIME (s)
DITA Accelerator
TIME (s)
ocess g e s e age e S e o 5 o 50hoursseconds
2500
3000
0.7
0.8
0.9
DITA Accelerator2500
3000
0.7
0.8
0.9
DITA Accelerator
TIME (s) TIME (s)5 80 18
10 156 20
21 428 35ng T
ime
1500
2000
0 3
0.4
0.5
0.6DITA Accelerator
Open Toolkit
1500
2000
0 3
0.4
0.5
0.6DITA Accelerator
Open Toolkit
21 428 35
31 852 41
41 1422 46
51 2160 57Proc
essi
n
0
500
1000
0 0
0.1
0.2
0.3
0
500
1000
0 0
0.1
0.2
0.3
00 10 20 30 40 50 60
0.000 10 20 30 40 50 60
0.0
Average File Size (kB)
Comparing DITA Accelerator and Open Toolkit (2)Processing Time vs. Average File Size: from 5 kB to 500 kB
40000
10
1140000
10
1140000
10
11
AVG SIZE (kB)
Open Toolkit
TIME (s)
DITA Accelerator
TIME (s)
g ghoursseconds
25000
30000
35000
7
8
9
10
25000
30000
35000
7
8
9
10
25000
30000
35000
7
8
9
10 5 80 18
10 156 20
21 428 35
31 852 41g Ti
me
15000
20000
25000
4
5
6
7
15000
20000
25000
4
5
6
7
15000
20000
25000
4
5
6
7 31 852 41
41 1422 46
51 2160 57
103 8160 86Proc
essi
ng
5000
10000
1
2
3
4DITA Accelerator
Open Toolkit
5000
10000
1
2
3
4DITA Accelerator
Open Toolkit
5000
10000
1
2
3
4DITA Accelerator
Open Toolkit
103 8160 86
206 33660 150
309 217
412 292
P
OUT OFMEMORY
00 100 200 300 400 500 600
0
1
00 100 200 300 400 500 600
0
1
00 100 200 300 400 500 600
0
1
= 9 hours
= 6 minutes
515 369
Average File Size (kB)
MEMORY
Processing Throughput as File
Comparing throughput rate as sizes increaseAVG SIZE
Open Toolkit THROUGHPUT
DITA AcceleratorProcessing Throughput as File
Size IncreasesSIZE (kB)
THROUGHPUT (kB/s)
Accelerator THROUGHPUT
(kB/s)5 13.3 62
250
300
e (k
B/s
)
DITA Accelerator
10 13,6 104
21 9.9 120
31 7.5 155
100
150
200
ghpu
t Rat
e DITA Accelerator
Open Toolkit
41 6.0 184
51 4,9 186
103 2.6 247
0
50
100
Thro
ug 206 1.3 283
309 293
412 290OUT OFMEMORY
0 200 400 600
Average File Size (kB)
412 290
515 287MEMORY
Comparing DITA Accelerator and Open Toolkit (3)
Processing Time vs. Number of Files: from 200 to 20,000
2500
3000
2500
3000
2500
3000 Number of Files
Open Toolkit TIME (s)
DITA Accelerator
TIME (s)
g ,
2000
2500
2000
2500
2000
2500206 80 18
412 144 49.5
824 286 100.2 Tim
e (s
)
1000
1500
DITA Accelerator
1000
1500
DITA Accelerator
1000
1500
DITA Accelerator
824 286 100.2
1236 415 142.4
1648 557 193.8
2060 699 247 1Proc
essi
ng
0
500
0 5000 10000 15000 20000 25000
DITA AcceleratorOpen Toolkit
0
500
0 5000 10000 15000 20000 25000
DITA AcceleratorOpen Toolkit
0
500
0 5000 10000 15000 20000 25000
DITA AcceleratorOpen Toolkit
2060 699 247.1
4120 1429 491.2
8240 1055.2
12360 1601 5
P
0 5000 10000 15000 20000 250000 5000 10000 15000 20000 250000 5000 10000 15000 20000 25000 12360 1601.5
16480 2143.4
20600 2788.5
Number of FilesOUT OF
MEMORY
70.0
Throughput rate as number of files increasesNumber Open Toolkit DITA
50.0
60.0
kB/s
)
of Files THROUGHPUT (kB/s)
Accelerator THROUGHPUT
(kB/s)206 13.3 35
30.0
40.0
hput
Rat
e (k
DITA Accelerator
412 14.7 35
824 14.8 42
1236 15.3 47
10.0
20.0Th
roug
hDITA Open Toolkit
1648 15.2 47
2060 15.2 45
4120 14 8 43
0.00 5000 10000 15000 20000 25000
Number of files
4120 14.8 43
8240 42
12360 42
16480 41OUT OF
MEMORY Number of files16480 41
20600 39
O
Interpretation of timing statisticsDITA Open Toolkit is best for light dutyDITA Open Toolkit is best for light duty
Performance degrades rapidly as file sizes increasePerformance is fairly flat as the number of files increaseIn both sets of tests, the toolkit eventually fails when it runs out of memoryA great starting point
OmniMark DITA Accelerator is robust and scales wellDoes not run out of memoryThroughput rate is fairly flat in both types of testingThroughput rate is fairly flat in both types of testing
DITA can play in demanding production environmentsBecause DITA is a standard, technology can be changed without changing the information architecture
Ongoing analysisTests used DITA Toolkit "out-of-the-box"Tests used DITA Toolkit out-of-the-box
Different XSLT processors may improve performance
Forum discussions suggest a workaround for memory exhaustion
Reload XSLT stylesheet on every transformationCurrently requires toolkit modification (may be configurable in 1 5)Currently requires toolkit modification (may be configurable in 1.5)Expect slower performance on smaller topics
Even with improvements, best performance will still be quadratic for increasing file sizes
Th ill b f i t f th
Linear
Quadratic
There will be room for improvement for the foreseeable future
Role of OmniMarkMost of the performance is due to engineeringMost of the performance is due to engineering "behind the scenes"
Native efficiency of OmniMarkStreaming architecture reduces memory requirementsStreaming architecture reduces memory requirementsRecord shelves can be used to implement high speed lookup for DITA processing rules
OmniMark referents simplify support for transclusionOmniMark referents simplify support for transclusionReferents are a streaming mechanism for reordering contentEliminate complex book-keeping
O iM k l i il t d dOmniMark language is easily extendedMacrosModules (functions and data types)
Bonus: SGML support included
UsabilityXSLT supports DITA reluctantlyXSLT supports DITA reluctantlyXSLT rule selection mechanism is not DITA-aware
Two templates that match the element "u":<xsl:template match="*[contains(@class,' hi-d/u ')]"><xsl:template match="*[contains(@class,' topic/ph ')]">
Both have equal priorityProgrammer must use tricks to ensure that the "hi-d/u" takes precedence over the "topic/ph" ruleExtra conditions on the "topic/ph" rule can invert the hierarchy!y
The spaces around the class names are requiredAnd no more than one on each sideXSLT d t f thiXSLT does not enforce thisProgrammer must code carefully to avoid inexplicable behavior
OmniMark extensions provide DITA supportDITA Accelerator augments OmniMark with "DITA rules"DITA Accelerator augments OmniMark with DITA rules
Automatically prioritized according to the specialization hierarchyRule selection is optimized so that performance stays consistent
l dd das more rules are addedDITA rules can be grouped into sets, like OmniMark rulesDITA rules can be supplied as OmniMark modulesLocal DITA rules take precedence over imported rules for the same DITA class
Module supplies support functions that understand DITAModule supplies support functions that understand DITA class specialization
DITA Accelerator specialization supportThe syntax of DITA rules is based on OmniMark element rulesThe syntax of DITA rules is based on OmniMark element rules
Element rules specify element nameselement "u"
output "<u>" || "%c" || "</u>"p
DITA rules specify classes instead of elementsdeclare dita-rule hi-d-u-rule
class "hi-d/u"
Selection by DITA class – understands
specializationoutput "<u>" || dita.process-content || "</u>"
DITA rule for "hi-d/u" will take precedence over "topic/ph"Based on the class specialization in the DTD Processes content,
Currently implemented by macrosAllows access to full OmniMark language
DITA module also provides utility functions
,like "%c" in element
rules
p yDITA class-based queries for current and ancestor elementsMimics the element tests built into OmniMark
ConclusionsThe OmniMark-based DITA Accelerator provides scalabilityy
RobustConsistent throughput as volumes increaseNo catastrophic failures
The OmniMark language can be easily extended to provide a natural DITA g g y pprogramming environment
Programmers can "think in DITA", rather than trying to align a pre-existing programming model with the DITA semantics
Standards are about choice of toolsDITA Toolkit is a good choice for
Learning DITAPrototypingLess demanding production uses
OmniMark DITA AcceleratorDemanding production environments
Most importantly, tool choice must be governed by the unique characteristics of your environment