Upload
sibyl-harrison
View
213
Download
1
Embed Size (px)
Citation preview
Version Management for XML Documents Copy-Based vs Edit-Based Schemes
Shu-Yao ChienComputer Science
Department
University of California,
Los Angeles
Vassilis J. TsotrasDepartment of Computer
Science and Engineering
University of California,
Riverside
Carlo ZanioloComputer Science
Department
University of California,
Los Angeles
The Problem
• Managing (storing, querying) multiple versions documents is important for content providers and cooperative work
• Temporal DBs: transaction time, CAD/OO applications
• Web/XML changes/unifies everything
• Traditional schemes (RCS, SCCS): not optimized for secondary store---no temporal clustering
• DB-oriented approaches: not optimized for retrieval of complete documents
• Transport level: exchange and processing (browser side) of multiversion documents also critical—need to reconcile storage and exchange representations.
Version Management: Approaches
• Time stamping of objects
• Store all Snapshots: fast retrieval, excessive storage
• Edit-Based Schemes store the Deltas. Minimal storage but slow retrieval.
• Traditionally line-oriented DIFF, but semistructured objects in Lorel
• Our Scheme: Usefulness Based Copy Control (UBCC)
- Separate edit scripts from the objects.
- Temporal Clustering of objects using page usefulness.
Example: an Evolving XML Document
VERSION 1<root>
<ch A><sec D> ... </sec><sec E> … </sec>
</ch><ch B>
<sec F> … </sec><sec G> … </sec><sec H> … </sec>
</ch></root>
VERSION 2<root>
<ch A><sec J> … </sec><sec E> … </sec>
</ch><ch B>
<sec F> … </sec><sec G’> … </sec>
</ch><ch K>
<sec L> … </sec></ch>
</root>
Order1234
5678
Order1234
567
89
Temporal Clustering by Page Usefulness
• Usefulness: percentage of page occupied by objects from the current version—the rest is occupied by ‘dead’ objects from previous versions
• We set a minimum usefulness requirement e.g. 50%
• When the usefulness of a page fall below this minimum we copy its live objects to a new page
Maintaining Page Usefulness above 70% by Copying Alive Objects
O1 O2 O3 O4 O5 O6 O7 O8
VERSION 1
P1
VERSION 2 DEL DEL DEL
,U(P1) =75% P2 ,U(P2) = 50% < Umin=70%
P3
Copied
O5 O6 O9 O10
,U(P3) = 100%
Usefulness Based Copy Control (UBCC)
root ch A sec D sec E ch B sec F sec G sec H
VERSION 2 INS(sec J)
DEL
INS(sec G’)
DEL DEL
INS(ch K),INS(sec L)
• STEP 1 : Determine page usefulness for copying.
, U(P1) = 75%
VERSION 1
, U(P2) = 50% < Umin=70%
• STEP 2 : Append new/copied objects into new pages by their logical order.
P3
sec J
COPY
ch B sec F sec G’
P4
ch K sec L
P1 P2
, U(P3)=100% , U(P4)=100%
Document Object Order
sec A2 sec E4 ch B sec F sec G sec H
ch B5 sec F6
P3
sec J3 sec G’7 sec L9
P4
ch K8
P1 P2
sec D
• Version 2 objects are not stored in sequence :
• Hence, we use the edit script.
VERSION 2 = ( root1 , sec A2 , sec J3 , sec E4 , ch B5 , sec F6 ,
sec G’7 , ch K8 , sec L9)
root1
Beyond Edit-Based Versioning
• The UBCC schemes achieves good storage and retrieval efficiency.
• But it is not suitable at the transport level and for query on content
• Thus, we propose a copy-based model which :– explores shared elements– needs no edit script– Yields a simple XML representation for the document
history
The XML Version Model (XVM)
• XVM is a list of version nodes• Each version node is an ordered tree consisting of
four types of nodes :– element node– attribute node– text node– copy record node
• Minimal extensions to the Xpath data model—the copy record node is actually a link.
Copy-Based XML Version Model (XVM)
V E T
A C
Version node Element node Text node
Attribute node copy record node
V
E E
EA A
A
T T
T
V
E
EA
A
T
T
C
C
Tree Addr Ref :V1.2.1
XVM --- Example
V
E chapter“Intro”
E chapter“Tutorial”
E
section“Scope”
E
section“Concepts”
E
section“Context”
V1
Changes :
1. DELETE chapter “Tutorial”2. INSERT chapter “Second Ex”
C
V
E chapter“Second Ex”
V2
V1.1
E
section“Test Data”
Changes :
1. UPDATE the textual content of chapter “Second Ex”2. COPY the “Concepts” section and insert after section “Test data”.
E chapter“Intro”
E
section“Scope”
E
section“Concepts”
C
V
E chapter“Second Ex”
V3
C C
V2.1
V2.2.1V2.1.2
XVM Version Retrieval --- Example
V
E Cchapter“Intro”
E chapter“Tutorial”
E
section“Scope”
E
section“Concepts”
E
section“Context”
V1 V
E chapter“Second Ex”
E
section“Test Data”
V2
E chapter“Intro”
E
section“Scope”
E
section“Concepts”
C
V
E chapter“Second Ex”
V3
C C
V2.1
V2.2.1V2.1.2
V1.1
XVM Benefits
• Transport Level: Represent XVM as an XML
document—its DTD automatically generated from
the document DTD
• Storage Level: we extended the usefulness-based
temporal clustering scheme to XVM
XVM Implementation --- Use XML to Represent XVM
• DTD Transformation :– Define three new elements : <Repository>, <Version>
and <CopyRecord>.– For each element in the original DTD add to its
content model a CopyRecord as an alternate.• Example :
Original DTD<!ELEMENT volumn (chapter)*><!ELEMENT chapter (title,(sec)*)><!ELEMENT title (#PCDATA)><!ELEMENT sec (#PCDATA)>. . .
Version DTD <!ELEMENT Repository (Version)+><!ELEMENT Version (volumn)><!ELEMENT CopyRecord><!ATTLIST CopyRecord Ref IDREF><!ELEMENT volumn(chapter)*><!ELEMENT chapter ((title,(sec)*)|
CopyRecord)><!ELEMENT title ((#PCDATA)|CopyRec)><!ELEMENT sec ((#PCDATA)|CopyRec)>. . .
Performance and Storage Cost
Storage
0
2000
4000
6000
8000
10000
12000
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Total Number of Versions
PagesRCSCopy-Based 50%Edit-Based 50%Snapshot
Version Retrieval Cost
0
200
400
600
800
1000
1200
1400
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Total Number of Versions
PagesRCSCopy-Based 50%Edit-Based 50%Snapshot
Conclusion
• UBCC is efficient at the storage level.• The copy-based scheme is effective as a storage
representation and a transport representation
• Our current research focuses on efficient evaluation of queries on versions:– content queries, – snapshot queries, – history queries.