108
Node Node Indexes Indexes Interval Labeling Interval Labeling Schemes Schemes Prefix Labeling Prefix Labeling Schemes Schemes Konsolaki Konsolaki Konstantina (624) Konstantina (624) [email protected] [email protected] University of Crete Department of Computer Science Fafalios Pavlos Fafalios Pavlos (623) (623) [email protected] [email protected] May 2010 May 2010

Node Indexes

  • Upload
    metea

  • View
    71

  • Download
    1

Embed Size (px)

DESCRIPTION

Node Indexes. Interval Labeling Schemes. Prefix Labeling Schemes. Konsolaki Konstantina (624) [email protected]. Fafalios Pavlos (623) [email protected]. University of Crete Department of Computer Science. May 2010. Outline. Introduction Interval Labeling Schemes - PowerPoint PPT Presentation

Citation preview

Page 1: Node Indexes

Node IndexesNode Indexes

Interval Labeling SchemesInterval Labeling Schemes Prefix Labeling SchemesPrefix Labeling Schemes

Konsolaki Konstantina (624)Konsolaki Konstantina (624)[email protected]@csd.uoc.gr

University of Crete Department of Computer Science

Fafalios Pavlos (623)Fafalios Pavlos (623)[email protected]@csd.uoc.gr

May 2010May 2010

Page 2: Node Indexes

2

OutlineOutline

• Introduction• Interval Labeling Schemes• Prefix Labeling Schemes• Comparison

Page 3: Node Indexes

3

Node Indexing SchemesNode Indexing Schemes• Hold values that reflect the nodes’ position

within the structure of an XML tree.• Can solve both simple path and twig path

queries.• Use two types of labeling schemes:

• Interval labeling Schemes• Prefix labeling Schemes

Page 4: Node Indexes

4

Labeling SchemesLabeling Schemes• The purpose of a labeling scheme is to provide

unique labels for each node in the XML tree• A good labeling scheme should have the

following characteristics:• The relationships between two nodes should be

uniquely and quickly determined simply by examining their labels

• Updating XML files should not require the re-labeling of nodes in the XML trees

• The size of the label should be minimal in order to fit in the main memory

• The scheme should be used to support all kinds of XPath functions

• Should follow the order of the XML document

Page 5: Node Indexes

5

Node Indexes vs. Graph IndexesNode Indexes vs. Graph Indexes

• Graph indexes consider paths, during query evaluation, as a whole path.

• Node indexes deal with each node in the path separately.

• In graph indexes, the numbers of joins is reduced during query processing and therefore, query performance is improved.

• In node indexes, at each step of a query processing, a structural join is performed between two nodes starting from one end of the path and finishing at the other end.

Page 6: Node Indexes

6

Node Indexes vs. Sequence IndexesNode Indexes vs. Sequence Indexes

• Sequence indexes transform XML documents and queries into an encoded sequences.

• Node indexes label each node of the XML document

• In Sequence indexes, answering a query requires a sequence matching between the encoded sequences of the data and the query• Efficient evaluation of simple path and twig queries without any

extra join operations• In Node indexes, answering a query requires structural

joins among the labeled nodes • Not efficient evaluation of queries due to the multiple structural

joins

Page 7: Node Indexes

7

XML Document for our examplesXML Document for our examples

<Bib>

<book>

<author>Tim</author>

</book>

<paper></paper>

<paper>

<author>Sarah</author>

</paper>

</Bib>

Bib

book paper

paperauthor

Tim Sarah

author

XML DocumentXML Document XML TreeXML Tree

Page 8: Node Indexes

8

OutlineOutline

• Introduction• Interval Labeling Schemes• Prefix Labeling Schemes• Comparison

Page 9: Node Indexes

9

Interval Labeling SchemesInterval Labeling Schemes

Page 10: Node Indexes

10

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 11: Node Indexes

11

Interval Labeling Scheme Interval Labeling Scheme • Interval based labeling schemes (otherwise known

as Containment based labeling schemes or Region encoded labeling schemes) exploit the properties of tree traversal to maintain document order and to determine various structural relationships between nodes

• Tree traversal is the process of visiting each node in a tree data structure. Such traversals are characterized by the order in which the nodes are visited.

Page 12: Node Indexes

12

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 13: Node Indexes

13

Beg-End Labeling Scheme

• A pair of numbers is assigned to each node in an XML document according to its sequential traversal order.• Starting from the root element, each node is given a

“Beg” number.• If the end of an attribute, an attribute value, or an

ending tag element is reached, the “End” number is assigned. The “End” number is equal to the next sequential number.

• If the value of the element is a leaf the “Beg” number =“End” number

Page 14: Node Indexes

14

ExampleBib

book paper

paperauthor

Tim Sarah

author

(1,14)(1,14)

(2,6)(2,6)

(3,5)(3,5)

(4,4)(4,4)

(7,8)(7,8)

(9,13)(9,13)

(10,12)(10,12)

(11,11)(11,11)

Page 15: Node Indexes

15

Properties [1]• A “Level” is added to the (Beg,End) label to form a node-

triplet identification label (Beg,End,Level) for each node in the tree, where “Level” represents the depth of an element in the tree.

Ancestor-descendant relationship:In a given data-tree, node “x” is an ancestor of node “y” iff x.Beg < y.Beg < x.End (preorder property).

Bib

book paper

paperauthor

Tim Sarah

author

(1,14)(1,14)

(2,6)(2,6)

(3,5)(3,5)

(4,4)(4,4)

(7,8)(7,8)

(9,13)(9,13)

(10,12)(10,12)

(11,11)(11,11)

Page 16: Node Indexes

16

Properties [2]Parent-child relationship:

In a given data-tree, node “x” is a parent of node “y”iff (x.Beg < y.Beg < x.End and y.Level = x.Level + 1.

There is no way to locate the siblings of a given node,using only the knowledge of its index numbers.

Bib

book paper

paperauthor

Tim Sarah

author

(1,14)(1,14)

(2,6)(2,6)

(3,5)(3,5)

(4,4)(4,4)

(7,8)(7,8)

(9,13)(9,13)

(10,12)(10,12)

(11,11)(11,11)

Page 17: Node Indexes

17

Are updates possible ?

Updating the labeling (numbering) scheme of Beg-End is costly.

• When a new node is inserted into the tree, then all the nodes in the tree, except the left sibling subtrees of the inserted node, have to be updated.

• On the other hand when a node is deleted no re-labeling is needed.

Page 18: Node Indexes

18

Update example

Bib

book paper

paperauthor

Tim Sarah

author

(1,14)(1,14)

(2,6)(2,6)

(3,5)(3,5)

(4,4)(4,4)

(7,8)(7,8)

(9,13)(9,13)

(10,12)(10,12)

(11,11)(11,11)

paper(9,10)(9,10)

(11,15)(11,15)

(12,14)(12,14)

(13,13)(13,13)

(1,16)(1,16)

Page 19: Node Indexes

19

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 20: Node Indexes

20

<Order-Size> Labeling Scheme

• This labeling scheme uses an extended preorder. Each node is associated with a pair of numbers <order-size> as follows:

• For a tree node y and its parent x:• order(x)< order(y),• order(y) + size(y) <= order(x) +size(x).

• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal:

• order(x)+size(x) < order(y).

Page 21: Node Indexes

21

Example

Bib

book paper

paperauthor

Tim Sarah

author

(1,100)(1,100)

(10,30)(10,30)

(11,20)(11,20)

(17,10)(17,10)

(41,10)(41,10)(60, 30)(60, 30)

(62,20)(62,20)

(65,10)(65,10)

Page 22: Node Indexes

22

PropertiesAncestor-descendant relationship:

For two given nodes x and y of a tree T, x isan ancestor of y if and only if:

• order(x) < order(y) <= order(x) + size(x).

There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Bib

book paper

paperauthor

Tim Sarah

author

(1,100)(1,100)

(10,30)(10,30)

(11,20)(11,20)

(17,10)(17,10)

(41,10)(41,10)(60, 30)(60, 30)

(62,20)(62,20)

(65,10)(65,10)

Page 23: Node Indexes

23

Are updates possible ?

• For a tree node x, size(x) <= Σy size(y) for all y’s that are a direct child of x. Size(x) can be an arbitrary integer larger than the total number of the current descendants of x.

• Thus <Order,Size> labeling scheme is more flexible and can deal with dynamic updates of XML data more efficiently, in contrast with the one presented before. Additional space is reserved for future data insertions.

Disadvantage :

It is hard to predict the actual space requirements, thus after several data insertions the space required to hold inserted data has exceeded the reserved space and in the worst case the relabeling of the whole data tree is needed.

Page 24: Node Indexes

24

Insertion without Re-labeling

(1,100)(1,100)

(10,30)(10,30)

(11,20)(11,20)

(17,10)(17,10)

(41,10)(41,10)

(60, 30)(60, 30)

(62,20)(62,20)

(65,10)(65,10)

Bib

book paper

paperauthor

Tim Sarah

authorpaper(53,5)(53,5)

No re-labeling since: order(x)+size(x) < order(y) and

size(x) <= Σy size(y)

No re-labeling since: order(x)+size(x) < order(y) and

size(x) <= Σy size(y)

Page 25: Node Indexes

25

Insertions with Re-labeling(1,100)(1,100)

(10,35)(10,35)

(11,20)(11,20)

(17,10)(17,10)

(46,10)(46,10)

(54, 35)(54, 35)

(62,20)(62,20)

(65,10)(65,10)

Bib

book paper

paperauthor

Tim Sarah

authorpaper(58,30)(58,30)

Re-labeling needed since:

order(x)+size(x) < order(y)

size(x) <= Σy size(y)

Re-labeling needed since:

order(x)+size(x) < order(y)

size(x) <= Σy size(y)

(90, 35)(90, 35)

(95,20)(95,20)

(100,10)(100,10)

(1,200)(1,200)

Page 26: Node Indexes

26

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 27: Node Indexes

27

Prime Number Labeling Scheme

Divisibility Property: If an integer X has a prime* factor Z which is not aprime factor of another integer Y, then Y is not divisible by X.

• In XML trees, if a node A has a descendant C which is not a descendant of another node B, then A cannot be a descendant of node B.

• Therefore, if the leaf nodes in XML are labeled by prime numbers and the non-leaf nodes as a product of the labels of its child nodes, then we can easily determine the ancestor-descendent relationship by using the “divisible” property of prime numbers.

*Prime factor: prime numbers that divide that integer exactly

A B

C

EXAMPLEX=6Z=3 (prime number)Y=10

Page 28: Node Indexes

28

Bottom-UpStarting from the leaf nodes prime numbers are assigned toeach leaf node. For each subsequent level, the parentslabels are assigned as the product of their children’s labels.

Bib

book paper

authorauthor

author

(1155)

(15*77)

(1155)

(15*77)(15)

(3*5)

(15)

(3*5)

(3)(3)

(77)

(7*11)

(77)

(7*11)

author

(5)(5)(5)(5)(7)(7) (11)(11)

Page 29: Node Indexes

29

Properties of Bottom-UpAncestor-descendant relationship:

For any nodes x and y in an XML tree, x is an ancestor of y if and only if: label(x) mod label(y) = 0.

Bib

book paper

authorauthor

author

(1155)(1155)

(15)(15)

(3)(3)

(77)(77)

author

(5)(5)(5)(5)(7)(7) (11)(11)

There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Page 30: Node Indexes

30

Disadvantages of Bottom-Up

• Can quickly result in relatively large numbers being assigned to nodes at the top of the tree.

• Special handling is required for those nodes that have only one child.

Page 31: Node Indexes

31

Top-Down• Each non-leaf node is given a unique prime number and the label of

each node is the product of its parent nodes label and its own label. Thus each label is a product of two factors: first factor is the number that is inherited from the label of its parent, is called “parent-label”. The second part is the value that is assigned to the node by the labeling scheme, is called “self- label”.

1

(1*1)

1

(1*1) Bib

book paper

paperauthor

Tim Sarah

author

2

(1*2)

2

(1*2)

14

(2*7)

14

(2*7)

182

(14*13)

182

(14*13)

3

(1*3)

3

(1*3)

5

(1*5)

5

(1*5)

55

(5*11)

55

(5*11)

935

(55*17)

935

(55*17)

parent-labelparent-label self-labelself-label

Page 32: Node Indexes

32

Properties of Top-DownAncestor-descendant relationship:

For any nodes x and y in an XML tree, x is an ancestor of y if and only if: label(y) mod label(x) = 0.

1

(1*1)

1

(1*1) Bib

book paper

paperauthor

Tim Sarah

author

2

(1*2)

2

(1*2)

14

(2*7)

14

(2*7)

182

(14*13)

182

(14*13)

3

(1*3)

3

(1*3)

5

(1*5)

5

(1*5)

55

(5*11)

55

(5*11)

935

(55*17)

935

(55*17)

There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Page 33: Node Indexes

33

Are updates possible ?The top-down prime number labeling scheme is good for dynamic updates. When a new node is inserted, it is easy to simply assign a prime number that has not been assigned before as the self-label for the newly inserted node. No re-labeling is required.

The top-down prime number labeling scheme is good for dynamic updates. When a new node is inserted, it is easy to simply assign a prime number that has not been assigned before as the self-label for the newly inserted node. No re-labeling is required.

1

(1*1)

1

(1*1) Bib

book paper

paperauthor

Tim Sarah

author

2

(1*2)

2

(1*2)

14

(2*7)

14

(2*7)

182

(14*13)

182

(14*13)

3

(1*3)

3

(1*3)

5

(1*5)

5

(1*5)

55

(5*11)

55

(5*11)

935

(55*17)

935

(55*17)

paper

19(1*19)

19(1*19)

Page 34: Node Indexes

34

Top-Down Disadvantage

• In the prime number labeling scheme each prime number can only be used once.

• Hence, the self-label of a node that is subsequently inserted is always larger than self-labels of existing nodes. • This implies that the size of the labels will increase

when the smaller prime numbers are used up.• Thus after a few insertions the space size for the

node label will be huge.

Page 35: Node Indexes

35

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 36: Node Indexes

36

Nested Tree StructureDefinition: A Nested Tree is a subtree which has an interval-

based number as a node of the containing tree and its own interval based numbering as a tree.

Bib

book

paper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)Nested TreeNested Tree

Page 37: Node Indexes

37

K-Nested Tree

Bib

bookpaper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)2-Nested Tree2-Nested Tree

1-Nested Tree1-Nested Tree

1-Nested Tree is a Nested Treeof XML data tree which is not included by any other NestedTrees.

K-Nested Tree is a Nested Tree that is included by (k- 1)-NestedTree and there is not any other Nested Tree that includes Tk and is included

by Tk-1.

Page 38: Node Indexes

38

StartList-EndList of a Node

Bib

bookpaper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)2-Nested Tree2-Nested Tree

1-Nested Tree

1-Nested Tree

StartList=([(1,50),29;1]

EndList=[(1,50),29;12]

StartList=([(1,50),29;1]

EndList=[(1,50),29;12]

The startList of any tree node N is the list, s1, . . . , sn;sn+1, where si is the

label of the i-Nested Tree of the node N (i = 1, 2,. . . ,n) and sn+1 is the start

position of N in the n-Nested Tree T. The endList of node N is defined in the same way of the previous definition of startList of N except that the startposition is substituted by the end position of N.

Page 39: Node Indexes

39

Nested Tree’s LabelThe label of each node can be represented as the 4-tuple (DocID, sList, eList, Level), where :

• DocID is the identifier of the document• sList and eList is the startList and endList of the node,

respectively• Level is the depth of the node in the data tree.

Bib

bookpaper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)2-Nested Tree2-Nested Tree

1-Nested Tree1-Nested Tree

For example the red’s node label is:

(1, [(1,50),29;1], [(1,50),29;12],2)

Assuming that DocId =1

For example the red’s node label is:

(1, [(1,50),29;1], [(1,50),29;12],2)

Assuming that DocId =1

Page 40: Node Indexes

40

Ancestor-Descendant Relationship

Bib

bookpaper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)2-Nested Tree2-Nested Tree

1-Nested Tree1-Nested Tree

The red’s node label is:(1, 1, 50, 1)

The blue’s node label is:(1, ( (1,50);(29;5)), (1,50);(29;9)),3)Assuming that DocId =1

The red’s node label is:(1, 1, 50, 1)

The blue’s node label is:(1, ( (1,50);(29;5)), (1,50);(29;9)),3)Assuming that DocId =1

The red node is the ancestor of the blue because :

•They have same DocId

•1<29<50

The red node is the ancestor of the blue because :

•They have same DocId

•1<29<50

Node X is ancestor of node Y:• Beg(X)<NestedTreeLabel(Y)< End(X)

Page 41: Node Indexes

41

Parent-Child Relationship

Bib

bookpaper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)2-Nested Tree2-Nested Tree

1-Nested Tree1-Nested Tree

The red’s node label is:(1, 1, 50, 1)The blue’s node label is:(1, ( (1,50);(29;1)), (1,50);(29;12)),2)Assuming that DocId =1

The red’s node label is:(1, 1, 50, 1)The blue’s node label is:(1, ( (1,50);(29;1)), (1,50);(29;12)),2)Assuming that DocId =1

The red node is the ancestor of the blue because :

•They have same DocId

•1<29<50

•Levelb= Levelr+1

The red node is the ancestor of the blue because :

•They have same DocId

•1<29<50

•Levelb= Levelr+1

Node X is parent of node Y:• Beg(X)<NestedTreeLabel(Y)< End(X) • Level(Y) = Level(X)+1

Page 42: Node Indexes

42

Insertion of a NodeThe space is the range of integers that are possible to be used as new labels for the inserted data and the size of thespace is the number of integers in the range. The sizeof the space is called SpaceSize and the size of the inserted data InsertSize.

Bib

book

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

For example the SpaceSize between the red ant blue node is 2.

For example the SpaceSize between the red ant blue node is 2.

Page 43: Node Indexes

43

Insertion of a NodeThe insertion of a node can be divided in three cases :• 1st case

• SpaceSize > InsertSize: Use the integers in the range of the space as labels for the inserted subtree

• 2nd case • 0 < SpaceSize <=InsertSize: Treat the inserted subtree as a

new Nested Tree and label the Nested Tree with an integer in the range of the space.

• 3rd case • SpaceSize = 0: Combine the inserted subtree with the subtree

rooted by the parent of the inserted subtree, treat the combined subtree as one Nested Tree and label the Nested Tree with an integer in the space.

Page 44: Node Indexes

44

Insertion of a Node : Case:1st The first case does not need a new method to process data insertion because the SpaceSize is enough to label the nodes of the new inserted tree.

Bib

book

paper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(35,40)(35,40)

(23,27)(23,27)

paper

Sarah

author

(28,32)(28,32)

(29,31)(29,31)

(30,30)(30,30)Inserted TreeInserted Tree

SpaceSize=7

InsertedData=5

SpaceSize=7

InsertedData=5

Page 45: Node Indexes

45

Insertion of a Node : Case: 2nd In the second case the size of the inserted subtree is larger than the sizeof the space. But if the new inserted subtree is treated as one NestedTree, only one integer is needed for the label of the new Nested Tree. Accordingly if the size of the space is one or more, the relabeling for thenodes in the original data tree is not necessary for the new data insertion.

Bib

book

paper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;5)(29;1,29;5)

(29;2,29;4)(29;2,29;4)

(29;329;3)(29;329;3)Inserted TreeInserted Tree

SpaceSize=2

InsertedData=5

SpaceSize=2

InsertedData=5

Page 46: Node Indexes

46

Insertion of a Node : Case: 3rd In the third case, the scope of the new Nested Tree is extended such that the Nested Tree includes the subtreerooted by the parent of the inserted subtree. In this case, it is required to relabel some nodes in the original data tree.

infoBooks

book

paper

paperauthor

Tim

(5,50)(5,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(28,35)(28,35)

(23,27)(23,27)

paper

Sarah

author

(5;9;5;13)(5;9;5;13)

(5;10,5;12)(5;10,5;12)

(5;11,5;11)(5;11,5;11)Inserted TreeInserted Tree

SpaceSize=0

InsertedData=5

SpaceSize=0

InsertedData=5

(5;1,5;16)(5;1,5;16)

(5;2,5;6)(5;2,5;6)

(5;3,5;5)(5;3,5;5)

(5;4,5;4)(5;4,5;4)

(5;7,5;8)(5;7,5;8)

(5;14,5;15)(5;14,5;15)

Page 47: Node Indexes

47

Deletion of a NodeIn the interval labeling scheme in case of deletion no processing is required. However, the more subtree insertionsoccur, the more Nested Trees are created. The more NestedTrees are created, the longer the lengths of the startList andendList of nodes are.

The deletion is classified by two cases: • Release the last Nested Tree in which the deleted subtree is included• Release following-sibling or preceding-sibling Nested Trees of the deleted subtree

Page 48: Node Indexes

48

Deletion of a Node: 1st Case

Bib

book

paper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(31,35)(31,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)Nested TreeNested Tree

(28,29)(28,29)

PositionSize=3

RemainSize=2

PositionSize=3

RemainSize=2

PositionSize is the size of the space in which the Nested Tree is included.

RemainSize is the size of the Nested tree, after delete processing.

Page 49: Node Indexes

49

Deletion of a Node: 2nd Case

Bib

book

paper

paperauthor

Tim

(1,50)(1,50)

(7,20)(7,20)

(11,15)(11,15)

(13,13)(13,13)

(30,35)(30,35)

(23,27)(23,27)

paper

Sarah

author

(29;1,29;12)(29;1,29;12)

(29;5,29;9)(29;5,29;9)

(29;7,29;7)(29;7,29;7)Nested TreeNested Tree

PositionSize=26

RemainSize=5

PositionSize=26

RemainSize=5

(14,18)(14,18)

(15,17)(15,17)

(16,16) (16,16)

Page 50: Node Indexes

50

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 51: Node Indexes

51

Label Size

Labeling Scheme Label Size

Beg-End O(logN)

Order-Size O(logN)

Prime Numbers Dlog(θΝ)

Nested Tree O(logN)

where :N is the number of nodes of an XML treeD is the maximal depthθN is the maximal prime number that has been used to label the nodes

where :N is the number of nodes of an XML treeD is the maximal depthθN is the maximal prime number that has been used to label the nodes

Page 52: Node Indexes

52

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 53: Node Indexes

53

Experimental Enviroment

The experiments were carried out on an Intel Pentium, 1.7Ghz with 1GB memory, running Windows XP. All procedures are implemented in Java. All experiments were repeated 10 times independently.

Page 54: Node Indexes

54

Experimental Data• Three data sets are used:

• The XMark data set contains information about auctions.

• The Shakespeare data set represents Shakespeare’s plays in XML format.

• The Nasa data set contains astronomical data.

Shakespeare XMark Nasa

Size 7.7 MB 115.7 MB 25.2 MB

Nodes 179,619 1,666,315 476,646

Depth 7 12 8

Page 55: Node Indexes

55

Insertion Processing • Measure the processing time of inserting 382 nodes, as

the size of the original data is increased.

100000

10000

1000

100

10

100000

10000

1000

100

10

1 10 20 30 40 501 10 20 30 40 50

Size of Original Data (MB)Size of Original Data (MB)

Inse

rtion

Tim

e (m

s)In

serti

on T

ime

(ms)

Beg-End Beg-End

Prime Prime

Nested Tree Nested Tree

Page 56: Node Indexes

56

Conclusions from the diagram• In the Beg-End labeling scheme, the relabeling of nodes in

the original data tree is inevitable when new data is inserted, and the number of nodes to be relabeled increases as the size of the original data increases.

• In the Prime approach the label of a node is determined by the product of the self-label and the label of the parent node, so the time of data insertion exceeds these of Nested approach.

• In the Nested approach, the data insertion is processed by a simple integer assignment to each node, so the performance is the best.

Page 57: Node Indexes

57

OutlineOutline

• Interval Labeling Scheme• Beg-End Labeling Scheme• Order-Size Labeling Scheme• Prime Number Labeling Scheme• Nested Tree Structure

• Label Size• Experimental results • Conclusion

Page 58: Node Indexes

58

Conclusion• The Beg-End labeling scheme can’t be used for updates

because the re-labeling of nodes is inevitable, when a new node is inserted.

• In the Order-Size labeling scheme, it is hard to predict the space requirements and thus in most cases the re-labeling is needed.

• In the Prime Number labeling scheme after a few insertions the space size for the node label will be huge.

• The Nested Tree Structure can handle efficiently the updates.

Page 59: Node Indexes

59

OutlineOutline

• Introduction• Interval Labeling Schemes• Prefix Labeling Schemes• Comparison

Page 60: Node Indexes

60

Prefix Labeling SchemesPrefix Labeling Schemes

Page 61: Node Indexes

61

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 62: Node Indexes

62

Prefix Labeling SchemesPrefix Labeling Schemes

• In a prefix labeling scheme, the label of a node in the XML tree often consists of:• A prefix, which often represents the label of all the

ancestors of the node.• A delimiter, which in most cases is the fullstop “.”• A positional identifier, which indicates the position of

the node relative to its siblings.

Page 63: Node Indexes

63

Prefix vs. Interval Labeling SchemesPrefix vs. Interval Labeling Schemes

• Prefix Labeling Schemes:• Can handle updates easier and more efficient than

Interval Labeling Schemes • Support sibling relationship

• However:• Extra space required to store paths• Its storage size increases quickly as the depth and the

breath of the tree increases• Infer a bit more costly ancestor/descendant

relationship

Page 64: Node Indexes

64

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 65: Node Indexes

65

Dewey - StructureDewey - Structure• Each node is assigned a label that represents the path from the

document’s root to the node.• Each component of the label represents the local order of an

ancestor node.• Nodes with the same number of delimiters (“.”) in their label are

in the same level.

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Tatarinov et al. - 2002Tatarinov et al. - 2002

Page 66: Node Indexes

66

Dewey – Supported Queries (1/3)Dewey – Supported Queries (1/3)

• Ancestors / Descendants• Node “X” is an ancestor of node “Y” if the label of node “X” is

a substring of the label of node “Y”.

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Page 67: Node Indexes

67

Dewey – Supported Queries (2/3)Dewey – Supported Queries (2/3)

• Parent / Child• Node “X” is parent of node “Y” if:

- The label of node “X” is a substring of the label of node “Y” and- frags(X) = frags(Y) – 1, where frags(X) is the number of

delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y.

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Page 68: Node Indexes

68

Dewey – Supported Queries (3/3)Dewey – Supported Queries (3/3)

• Siblings• Nodes “X” and “Y” are siblings if:

- They have the same number of delimiters in their labels and- X.prefix = Y.prefix, where prefix is the label of the node without

its positional identifier

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Page 69: Node Indexes

69

Dewey – UpdatesDewey – Updates• Insertion of new node

• The label of the nodes in the subtree rooted at the following sibling need to be updated

• O(n) nodes need relabeling, where n is the number of nodes of the XML file

Bib

book paper

paperauthor

TimSarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

paper(0.2)(0.2)

(0.3)(0.3)

(0.3.0)(0.3.0)

(0.3.0.0)(0.3.0.0)

Page 70: Node Indexes

70

• Not efficient for dynamic XML files with many updates• Need to re-label many nodes

• As the depth of the tree increases:• Label size of a node increases rapidly

• Storage size increases rapidly• It becomes more costly to infer the supported queries

between any two nodes (the string prefix matching becomes longer)

• Overflow problem• The original fixed length of bits assigned to store the size of the

label is not enough.

Dewey - ConclusionDewey - Conclusion

Page 71: Node Indexes

71

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 72: Node Indexes

72

ORDPATHs - StructureORDPATHs - StructureO’Neil et al. - 2004O’Neil et al. - 2004

• Allow updates without re-labeling other nodes• Assigns only positive, odd integers during the initial labeling• Even and negative number are reserved for later insertions

Bib

book paper

paperauthor

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

Page 73: Node Indexes

73

ORDPATHs – Supported QueriesORDPATHs – Supported Queries

• Compute ancestors / descendants, parent / child and siblings relations in the same way as Dewey

Bib

book paper

paperauthor

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

Page 74: Node Indexes

74

ORDPATHs – Updates (1/5)ORDPATHs – Updates (1/5)• Case 1: New node to the right of all existing child nodes

• Take the label of the immediate previous sibling and add 2 to the positional identifier

book(1.7)(1.7)

Bib

book paper

paperauthor

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

Page 75: Node Indexes

75

ORDPATHs – Updates (2/5)ORDPATHs – Updates (2/5)

• Case 2: New node to the left of all existing child nodes• Take the label of the immediate next sibling and add -2 to the

positional identifier

book(1.1.-1)(1.1.-1)

Bib

book paper

paperauthor

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

Page 76: Node Indexes

76

ORDPATHs – Updates (3/5)ORDPATHs – Updates (3/5)

• Case 3: New node between two consecutive nodes• Assign to the new node the even-number that sits between the

two odd positional identifiers of its neighbor siblings, and then concatenate a new component consisting of an odd number

book(1.2.1)(1.2.1)

Bib

book paper

paper

author

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

paper(1.2.3)(1.2.3)

Page 77: Node Indexes

77

ORDPATHs – Updates (4/5)ORDPATHs – Updates (4/5)• How to find now the parent?

• Node “X” is parent of node “Y” if:- The label of node “X” is a substring of the label of node “Y” and- frags(X) = frags(Y) – evenNum(Y) – 1, where frags(X) is the number of

delimiters of the label of node “X”, frags(Y) is the number of delimiters of label of node “Y” and evenNum(Y) is the number of even components of node “Y”

book(1.2.1)(1.2.1)

Bib

book paper

paper

author

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

paper(1.2.3)(1.2.3)

Page 78: Node Indexes

78

ORDPATHs – Updates (5/5)ORDPATHs – Updates (5/5)• How to find now the siblings?

• Nodes “X” and “Y” are siblings if:• In case nodes X and Y have the same length then the sibling conditions are the same as previous.• In case nodes X and Y have not the same length then they are siblings if:

• the node with the bigger length contains even number in the same position as the positional identifier of the other node and

• the prefix of the node with the bigger length until the first even number is the same with the prefix of the other node.• frags(X) = frags(Y) – evenNum(Y)

book(1.2.1)(1.2.1)

Bib

book paper

paper

author

Tim Sarah

author

(1)(1)

(1.1)(1.1)

(1.1.1)(1.1.1)

(1.1.1.1)(1.1.1.1)

(1.3)(1.3)

(1.5)(1.5)

(1.5.1)(1.5.1)

(1.5.1.1)(1.5.1.1)

paper(1.2.3)(1.2.3)

Page 79: Node Indexes

79

ORDPATHs – ConclusionORDPATHs – Conclusion• Unlike Dewey, it’s efficient for dynamic XML files.

• Not need to re-label nodes• Like Dewey, it’s not suitable for very deep trees

• Node’s label size increases quickly• Not suitable also for very wide trees

• Big label size for nodes with many siblings • Expensive comparative label evaluations between

siblings nodes of varying length• Waste of half of the total numbers due to odd numbers• Overflow problem

Page 80: Node Indexes

80

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 81: Node Indexes

81

LSDX - StructureLSDX - StructureDuong et al. - 2005Duong et al. - 2005

• Labeling Scheme for Dynamic Xml data• Allow updates without re-labeling other nodes• Combine numbers and letters to label each tree• For a node X, its label is:

level(X)parent(X).positionalIdentifier(X)

Bib

book paper

paperauthor

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

where parent(X) is the label of the parent of node X without its level and without its delimiter character

First positional identifier is “b” inorder to save codes for any insert before operation

Page 82: Node Indexes

82

LSDX – Supported Queries (1/3)LSDX – Supported Queries (1/3)

• Ancestors / Descendants• Node “X” is an ancestor of node “Y” if the label of node “X”

without the level number and without the delimiter character is a substring of the label of node “Y”.

Bib

book paper

paperauthor

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

ad

3adb.b

substring

Page 83: Node Indexes

83

LSDX – Supported Queries (2/3)LSDX – Supported Queries (2/3)

• Parent / Child• Node “X” is parent of node “Y” if node “X” is an ancestor of node

“Y” and level(X)=level(Y)-1

Bib

book paper

paperauthor

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

Page 84: Node Indexes

84

LSDX – Supported Queries (3/3)LSDX – Supported Queries (3/3)

• Siblings• Node “X” and “Y” are siblings if X.prefix = Y.prefix, where prefix is

the substring before the delimiter of a node’s label.

Bib

book paper

paperauthor

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

Page 85: Node Indexes

85

LSDX – Updates (1/3)LSDX – Updates (1/3)• Insertion of a new Node

1. If there is no node standing before the position we want to place the new node, get the label of the node standing after the new node and insert “a” after the delimiter

2. Otherwise, keep counting from the node standing before so that the label for the new node will be greater than the label of its previous sibling and less than the label of its next sibling (if have), in alphabetical order. If previous label ends with “z”, attach “b” at the end.

Bib

book paper

paper

author

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

book(1a.ab)(1a.ab)

paper(1a.e)(1a.e)

paper(1a.z)(1a.z)

…..paper(1a.zb)(1a.zb)

Page 86: Node Indexes

86

LSDX – Updates (2/3)LSDX – Updates (2/3)

Bib

book paperpaper

author

Tim Sarah

author

(0a)(0a)

(1a.b)(1a.b)

(2ab.b)(2ab.b)

(3abb.b)(3abb.b)

(1a.c)(1a.c)

(1a.d)(1a.d)

(2ad.b)(2ad.b)

(3adb.b)(3adb.b)

paper(1a.cb)(1a.cb)

paper(1a.cc)(1a.cc)

paper(1a.cab)(1a.cab)

• Insertion of a new Node1. If there is no node standing before the position we want to place the new node, get the

label of the node standing after the new node and insert “a” after the delimiter2. Otherwise, keep counting from the node standing before so that the label for the new

node will be greater than the label of its previous sibling and less than the label of its next sibling (if have), in alphabetical order. If previous label ends with “z”, attach “b” at the end.

Page 87: Node Indexes

87

LSDX – ConclusionLSDX – Conclusion• Like ORDPATH, it’s efficient for dynamic XML files

• Not need to re-label nodes• Like ORDPATH, it’s not suitable for deep and wide trees with

nodes of hundreds siblings • Node’s label size increases quickly

• Quick computation of supported queries• Capable of showing fast the level of each node• Unlike ORDPATH, finds siblings much easier

• Overflow problem• Although, it’s more resistant than ORDPATH and DEWEY

XML Doc (MB)

No of NodesTotal Size of labels

(MB)

1,2 17 0,17

5,6 84 1,63

11,4 167 5,29

Page 88: Node Indexes

88

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 89: Node Indexes

89

Persistent Labeling Scheme - StructurePersistent Labeling Scheme - StructureGabillon et al. - 2005Gabillon et al. - 2005

• Allow updates without re-labeling other nodes• Label of each node has the form: (l, [np,dp], [n,d]), where:

• l is the level of the node in the tree,• [np,dp] is the positional identifier of the parent node• [n,d] is the positional identifier of the node (unique for each level)

• Given a level “l”, the positional identifier of a node is “(i,1)”, where “i” is the position of the node at level “l”.

Bib

book paper

paperauthor

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])

Page 90: Node Indexes

90

Persistent Labeling Scheme – Supported Queries (1/4)Persistent Labeling Scheme – Supported Queries (1/4)

• Ancestors / Descendants• We build an ancestor structural summary “s” of a source tree “t”• Each node of “t” is represented in the summary tree “s” by the code of its parent• Nodes having the same parent are represented in “s” by only one node which has their parent’s

code• The root of “s” represents the nodes of “t” having the root of “t” as parent

• The ancestor structural summary “s” is held in memory.

Bib

book paper

paperauthor

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])source tree “t”

ancestor structural

summary “s”

[1,1][1,1]

[1,1][1,1] [3,1][3,1]

[1,1][1,1] [2,1][2,1]

Page 91: Node Indexes

91

Persistent Labeling Scheme – Supported Queries (2/4)Persistent Labeling Scheme – Supported Queries (2/4)

• Ancestors / Descendants• For a node “X” of code “(l1, [n1p,d1p], [n1,d1])” and a node “Y” of code “(l2,

[n2p,d2p], [n2,d2])”:• Node “X” is represented in “s” as the node “u” of level l1-1 and of local code “n1p,d1p”

• Node “Y” is represented in “s” as the node “v” of level l2-1 and of local code “n2p,d2p”• If node “X” is an ancestor of node “Y” then l1<l2 and we can reach node “u” starting from node “v” in “s”

with l2-l1 parent steps, that is node “X” is the (l2-l1)-ancestor of node “Y”

Bib

book paper

paperauthor

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])source tree “t”

ancestor structural

summary “s”

[1,1][1,1]

[1,1][1,1] [3,1][3,1]

[1,1][1,1] [2,1][2,1]

Page 92: Node Indexes

92

Persistent Labeling Scheme – Supported Queries (3/4)Persistent Labeling Scheme – Supported Queries (3/4)

Bib

book paper

paperauthor

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])

• Parent / Child• Let node “X” with label (l1, [n1p,d1p], [n1,d1]) and node “Y” with

label (l2, [n2p,d2p], [n2,d2])

• Node “X” is parent of node “Y” if l2=l1+1 and [n2p,d2p]=[n1,d1]

Page 93: Node Indexes

93

Persistent Labeling Scheme – Supported Queries (4/4)Persistent Labeling Scheme – Supported Queries (4/4)

Bib

book paper

paper

author

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])

• Siblings• Let node “X” with label (l1, [n1p,d1p], [n1,d1]) and node “Y” with

label (l2, [n2p,d2p], [n2,d2])

• Node “X” and “Y” are siblings if l1=l2 and [n1p,d1p]=[n2p,d2p]

Page 94: Node Indexes

94

Persistent Labeling Scheme – Updates (1/2)Persistent Labeling Scheme – Updates (1/2)

• Insertion of a new Node “X” at level “l”• If “X” is the first node to be inserted at level “l” then its positional identifier is (1,1)• If “X” is inserted immediately before the node of positional identifier (i, j) and if there is no other

node before (i, j), then the positional identifier of “X” is (i-j, j)• If “X” is inserted immediately after the node of positional identifier (i, j) and if there is no other node

after (i, j), then the positional identifier of “X” is (i+j, j)

Bib

bookpaper

paper

author

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])

book(1,[1,1],[0,1])(1,[1,1],[0,1])

book(1,[1,1],[4,1])(1,[1,1],[4,1])

Page 95: Node Indexes

95

Persistent Labeling Scheme – Updates (2/2)Persistent Labeling Scheme – Updates (2/2)• Insertion of a new Node “X” at level “l”

• If “X” is inserted immediately before the node of positional identifier (i, j) and immediately after the node of positional identifier (k, h), then the positional identifier of “X” is (a\d, b\d) with:

• a=i*h+k*j• b=2*h*j• d=“the highest common factor of a and b”

Bib

book paper

paper

author

Tim Sarah

author

(0,[1,1])(0,[1,1])

(1,[1,1],[1,1])(1,[1,1],[1,1])

(2,[1,1],[1,1])(2,[1,1],[1,1])

(3,[1,1],[1,1])(3,[1,1],[1,1])

(1,[1,1],[2,1])(1,[1,1],[2,1])

(1,[1,1],[3,1])(1,[1,1],[3,1])

(2,[3,1],[2,1])(2,[3,1],[2,1])

(3,[2,1],[2,1])(3,[2,1],[2,1])

paper(1,[1,1],[5,2])(1,[1,1],[5,2])

(i, j) = (3,1)(k,h) = (2,1)a= 3*1+2*1 =5b= 2*1*1=2d=1

Page 96: Node Indexes

96

Persistent labeling Scheme - ConclusionPersistent labeling Scheme - Conclusion

• Like ORDPATH and LSDX, its efficient for dynamic XML files• Not need to re-label nodes

• Quick computation of supported queries• A bit more complex for the ancestor/descendent relationship

• No overflow problem!• However, needs large memory storage

Page 97: Node Indexes

97

OutlineOutline

• Introduction• Prefix Labeling Schemes

• Dewey• ORDPATH• LSDX• Persistent

• Evaluation

Page 98: Node Indexes

98

Evaluation of Prefix Labeling SchemesEvaluation of Prefix Labeling Schemes

• Experiment’s Data:• Java 1.4.2• Sun Microsystems parser SAX• Pentium IV 1.3G, 1024MB RAM• Windows XP OS

• Impact of depth and breadth of the XML document on:• Time for generating labels• Space taken by these labels

Sans et al. - 2008Sans et al. - 2008

Page 99: Node Indexes

Evaluation – Time Analysis (1/2)Breadth InfluenceEvaluation – Time Analysis (1/2)Breadth Influence

Required time to generate labels – Constant Depth4000

3000

2000

1000

50000 100000 150000 200000 250000 300000 350000

Number of nodes

Tim

e (s

)

0PersistentLSDXORDPATH

Dewey

Page 100: Node Indexes

Evaluation – Time Analysis (2/2)Depth InfluenceEvaluation – Time Analysis (2/2)Depth Influence

Required time to generate labels – Constant Breadth (50 nodes)

0.30

0.25

0.10

0.05

5 15 30 50Depth

Tim

e (s

)

0PersistentLSDXORDPATH

Dewey

0.15

0.20

Page 101: Node Indexes

Evaluation – Storage Analysis (1/2)Breadth InfluenceEvaluation – Storage Analysis (1/2)Breadth Influence

Required storage space for labels – Constant Depth

400

200

50 nodes

Spac

e (o

ctet

s*)

0PersistentLSDXORDPATH

Dewey

600

800

10000

5000

500 nodes

Spac

e (o

ctet

s)

0

15000

20000

0.50

0.25

5000 nodes

Spac

e (m

illio

ns o

ctet

s)

0

0.75

1

* An octet is a grouping of eight bits

Page 102: Node Indexes

Evaluation – Storage Analysis (2/2)Depth InfluenceEvaluation – Storage Analysis (2/2)Depth Influence

Required storage space for labels – Constant Breadth (50 nodes)

20000

10000

5 15 30 50Depth

Spac

e (o

ctet

s)

0PersistentLSDXORDPATH

Dewey

30000

40000

Page 103: Node Indexes

103

Evaluation – ConclusionEvaluation – Conclusion

• For generating labels, DEWEY and ORDPATH are the quickest techniques for both deep and wide trees• LSDX and PERSISTENT follow• Since ORDPATH supports updates, it's preferable than DEWEY

• For wide trees, DEWEY, ORDPATH and PERSISTENT require the least space

• As the breadth of the tree grows, PERSISTENT technique outperforms the other techniques and LSDX worsens• For not very wide trees, LSDX needs the least space

• For deep trees, LSDX require the least space, DEWEY and ORDPATH follow• For not deep trees, DEWEY and ORDPATH outperform

Page 104: Node Indexes

104

Comparing with Interval • Measure the processing time of inserting 382 nodes, as

the size of the original data is increased.

100000

10000

1000

100

10

100000

10000

1000

100

10

1 10 20 30 40 501 10 20 30 40 50

Size of Original Data (MB)Size of Original Data (MB)

Inse

rtion

Tim

e (m

s)In

serti

on T

ime

(ms)

Beg-End Beg-End Prime Prime

Nested Tree Nested Tree

ORDPATH ORDPATH

Page 105: Node Indexes

105

OutlineOutline

• Introduction• Interval Labeling Schemes• Prefix Labeling Schemes• Comparison

Page 106: Node Indexes

106

Comparing Structural Indexes (1/2)Comparing Structural Indexes (1/2)Node Graph Sequence

Wrong initial answers No

Yes(non-deterministic with forward and backward bisimilarity)

No No

Missing initial correct answers No No Yes

Structural path

joins required twig

Yes

Yes

No

Yes

No

No

How to evaluate a twig query

•Break the query into nodes•Join the nodes

•Break the query to several paths•Solve each path•Join results

•Process the twig query as whole

Page 107: Node Indexes

107

Node Graph Sequence

Hold values No

(values have to index separately)

No(there are some attempts to integrate values into the index)

Yes(efficient integrate values into the index)

Main role in answering a XML query

Path joining Path selection Complete query evaluation

Update cost for inserting a node/subtree

O(N)* O(N+M)* O(b.logN)*

*number of nodes that are needed to be touched during an update

N=number of nodes , M=number of edges, b=fan-out of B+ tree

Comparing Structural Indexes (2/2)Comparing Structural Indexes (2/2)

Page 108: Node Indexes

108

Questions?