Upload
dortha-casey
View
227
Download
1
Embed Size (px)
Citation preview
XML Compression
Aslam Tajwala
Kalyan Chakravorty
Overview
• Motivation for XML Compression
• Techniques for achieving XML compression
• XMill
• XMill Architecture
Why Compress XML?
• Structured nature of XML makes it understandable to humans,
• Downside: XML is verbose– Each non-empty element tag must end with a
matching closing tag -- <tag>data</tag>– Ordering of tags is often repeated in a
document (e.g. multiple records)
Why Compress XML?: 2
• XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied
• Can gain a significant savings from compression, due to highly structured nature of XML
• XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents
Using Huffman/LZ
• Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values)
• Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression
Using Huffman/LZ: 2
• Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip)
• Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression
Huffman Encoding Example
• ACDABA • Since these are 6 characters, this text is 6 bytes or
48 bits long • tree is build that replaces the symbols by shorter
bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111
• 01101110100 (ACDABA = 11 bits)
LZ77 Example( Dictionary Based Compressors)
• Lempel-Ziv 77 algorithm• Dictionary is a portion of encoded sequence• The encoder examines the input sequence through
a sliding window• The window consists of two parts:
– a search buffer that contains a portion of the recently encoded sequence, and
– a look-ahead buffer that contains the next portion of the sequence to be encoded.
XMill (Liefke and Suciu, 2000)
• Relies heavily on zlib, the compression library used in gzip
• Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API)
• During compression, each XML tag is examined to see which compression technique(s) should be applied
XML Compression
• View XML as a tree
• Separate the tree structure and what is stored in leaves
• Save the tree structure so that it can be restored
• The compressed file may or may not remember the tree structure
breadfruit tree
XMill: Compression Strategy
• XMill applies 3 principles during compression:– Separate structure (element tags and attribute
names) from data– Group related data items in a single container;
compress each container separately– Apply appropriate semantic compressors to
each container
XMill – Separating Structure From Content
• Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)
• End tags replaced with ‘/’ token
• Data values replaced with their container number
XMill – Separating Structure From Content 2
<Employees>
<Employee id=“1”>Homer Simpson</Employee>
<Employee id=“2”>Frank Grimes</Employee>
</Employees>
DictionaryT1 =>EmployeesT2 => EmployeeT3 => @id
Structure ContainerT1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / /
C312
C4Homer SimpsonFrank Grimes
XMill: Container Expressions
• Users can override default settings using the container expression language– Specify container membership
– Specify which semantic compressor(s) are applied for each container
• E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container:
xmill –p //(Name | Location) employees.xml
XMill: Semantic Compressors
Compressor Descriptiont Default Text Compressor
(gzipped only)
u Compressor for positive integers (binary encoded using 1 – 4 bytes)
i Compressor for integers
u8 Compressor for positive integers < 256
di Differential compressor for integers
XMill: Semantic Compressors 2
Compressor Description
rl Run-length encoder (store single copy of a sequence, its length, and repetition count)
e Enumeration encoder (dictionary)
“…” Constant compressor – outputs nothing: used to check that current token is a specified constant value
XMill: Semantic Compressors 3
• Text compressor is applied to each element by default
• User can add other instructions via command line:
xmill –p //price=>i file.xml
Applies integer compressor to each occurrence of ‘price’ element in file.xml
XMill Architecture (1/3)
XMill Architecture (2/3)
• SAX Parser – sends tokens to the path processor.
• Path Processor– determines how to map data values to containers.
• Semantic Compressors – compresses the input and copies it to the container – in the memory window.– E.x. binary encoding of integers, differential compressors.
When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.
Performance Evaluation (1/2)
Performance Evaluation (2/2)
References
• XMill:An efficent Compressor for XML Data
• XGrind:A query friendly compressor
• www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt
• Questions ?