17
Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles [email protected] Vassilis J. Tsotras Department of Computer Science and Engineering University of California, Riverside [email protected] du Carlo Zaniolo Computer Science Department University of California, Los Angeles [email protected]. edu

Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles [email protected]

Embed Size (px)

Citation preview

Page 1: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Version Management for XML Documents Copy-Based vs Edit-Based Schemes

Shu-Yao ChienComputer Science

Department

University of California,

Los Angeles

[email protected]

Vassilis J. TsotrasDepartment of Computer

Science and Engineering

University of California,

Riverside

[email protected]

Carlo ZanioloComputer Science

Department

University of California,

Los Angeles

[email protected]

Page 2: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

The Problem

• Managing (storing, querying) multiple versions documents is important for content providers and cooperative work

• Temporal DBs: transaction time, CAD/OO applications

• Web/XML changes/unifies everything

• Traditional schemes (RCS, SCCS): not optimized for secondary store---no temporal clustering

• DB-oriented approaches: not optimized for retrieval of complete documents

• Transport level: exchange and processing (browser side) of multiversion documents also critical—need to reconcile storage and exchange representations.

Page 3: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Version Management: Approaches

• Time stamping of objects

• Store all Snapshots: fast retrieval, excessive storage

• Edit-Based Schemes store the Deltas. Minimal storage but slow retrieval.

• Traditionally line-oriented DIFF, but semistructured objects in Lorel

• Our Scheme: Usefulness Based Copy Control (UBCC)

- Separate edit scripts from the objects.

- Temporal Clustering of objects using page usefulness.

Page 4: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Example: an Evolving XML Document

VERSION 1<root>

<ch A><sec D> ... </sec><sec E> … </sec>

</ch><ch B>

<sec F> … </sec><sec G> … </sec><sec H> … </sec>

</ch></root>

VERSION 2<root>

<ch A><sec J> … </sec><sec E> … </sec>

</ch><ch B>

<sec F> … </sec><sec G’> … </sec>

</ch><ch K>

<sec L> … </sec></ch>

</root>

Order1234

5678

Order1234

567

89

Page 5: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Temporal Clustering by Page Usefulness

• Usefulness: percentage of page occupied by objects from the current version—the rest is occupied by ‘dead’ objects from previous versions

• We set a minimum usefulness requirement e.g. 50%

• When the usefulness of a page fall below this minimum we copy its live objects to a new page

Page 6: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Maintaining Page Usefulness above 70% by Copying Alive Objects

O1 O2 O3 O4 O5 O6 O7 O8

VERSION 1

P1

VERSION 2 DEL DEL DEL

,U(P1) =75% P2 ,U(P2) = 50% < Umin=70%

P3

Copied

O5 O6 O9 O10

,U(P3) = 100%

Page 7: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Usefulness Based Copy Control (UBCC)

root ch A sec D sec E ch B sec F sec G sec H

VERSION 2 INS(sec J)

DEL

INS(sec G’)

DEL DEL

INS(ch K),INS(sec L)

• STEP 1 : Determine page usefulness for copying.

, U(P1) = 75%

VERSION 1

, U(P2) = 50% < Umin=70%

• STEP 2 : Append new/copied objects into new pages by their logical order.

P3

sec J

COPY

ch B sec F sec G’

P4

ch K sec L

P1 P2

, U(P3)=100% , U(P4)=100%

Page 8: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Document Object Order

sec A2 sec E4 ch B sec F sec G sec H

ch B5 sec F6

P3

sec J3 sec G’7 sec L9

P4

ch K8

P1 P2

sec D

• Version 2 objects are not stored in sequence :

• Hence, we use the edit script.

VERSION 2 = ( root1 , sec A2 , sec J3 , sec E4 , ch B5 , sec F6 ,

sec G’7 , ch K8 , sec L9)

root1

Page 9: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Beyond Edit-Based Versioning

• The UBCC schemes achieves good storage and retrieval efficiency.

• But it is not suitable at the transport level and for query on content

• Thus, we propose a copy-based model which :– explores shared elements– needs no edit script– Yields a simple XML representation for the document

history

Page 10: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

The XML Version Model (XVM)

• XVM is a list of version nodes• Each version node is an ordered tree consisting of

four types of nodes :– element node– attribute node– text node– copy record node

• Minimal extensions to the Xpath data model—the copy record node is actually a link.

Page 11: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Copy-Based XML Version Model (XVM)

V E T

A C

Version node Element node Text node

Attribute node copy record node

V

E E

EA A

A

T T

T

V

E

EA

A

T

T

C

C

Tree Addr Ref :V1.2.1

Page 12: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

XVM --- Example

V

E chapter“Intro”

E chapter“Tutorial”

E

section“Scope”

E

section“Concepts”

E

section“Context”

V1

Changes :

1. DELETE chapter “Tutorial”2. INSERT chapter “Second Ex”

C

V

E chapter“Second Ex”

V2

V1.1

E

section“Test Data”

Changes :

1. UPDATE the textual content of chapter “Second Ex”2. COPY the “Concepts” section and insert after section “Test data”.

E chapter“Intro”

E

section“Scope”

E

section“Concepts”

C

V

E chapter“Second Ex”

V3

C C

V2.1

V2.2.1V2.1.2

Page 13: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

XVM Version Retrieval --- Example

V

E Cchapter“Intro”

E chapter“Tutorial”

E

section“Scope”

E

section“Concepts”

E

section“Context”

V1 V

E chapter“Second Ex”

E

section“Test Data”

V2

E chapter“Intro”

E

section“Scope”

E

section“Concepts”

C

V

E chapter“Second Ex”

V3

C C

V2.1

V2.2.1V2.1.2

V1.1

Page 14: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

XVM Benefits

• Transport Level: Represent XVM as an XML

document—its DTD automatically generated from

the document DTD

• Storage Level: we extended the usefulness-based

temporal clustering scheme to XVM

Page 15: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

XVM Implementation --- Use XML to Represent XVM

• DTD Transformation :– Define three new elements : <Repository>, <Version>

and <CopyRecord>.– For each element in the original DTD add to its

content model a CopyRecord as an alternate.• Example :

Original DTD<!ELEMENT volumn (chapter)*><!ELEMENT chapter (title,(sec)*)><!ELEMENT title (#PCDATA)><!ELEMENT sec (#PCDATA)>. . .

Version DTD <!ELEMENT Repository (Version)+><!ELEMENT Version (volumn)><!ELEMENT CopyRecord><!ATTLIST CopyRecord Ref IDREF><!ELEMENT volumn(chapter)*><!ELEMENT chapter ((title,(sec)*)|

CopyRecord)><!ELEMENT title ((#PCDATA)|CopyRec)><!ELEMENT sec ((#PCDATA)|CopyRec)>. . .

Page 16: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Performance and Storage Cost

Storage

0

2000

4000

6000

8000

10000

12000

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97

Total Number of Versions

PagesRCSCopy-Based 50%Edit-Based 50%Snapshot

Version Retrieval Cost

0

200

400

600

800

1000

1200

1400

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97

Total Number of Versions

PagesRCSCopy-Based 50%Edit-Based 50%Snapshot

Page 17: Version Management for XML Documents Copy-Based vs Edit-Based Schemes Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu

Conclusion

• UBCC is efficient at the storage level.• The copy-based scheme is effective as a storage

representation and a transport representation

• Our current research focuses on efficient evaluation of queries on versions:– content queries, – snapshot queries, – history queries.