Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
XML Design for Relational Storage
Solmaz KolahiUniversity of Toronto
Leonid LibkinUniversity of Edinburgh
16th International World Wide Web Conference (WWW 2007)
Kolahi, Libkin @WWW 2007 – p.1/17
Motivation
Like relational data, XML documents can contain redundantinformation due to functional dependencies.
416
CityToronto
416 416 416
City City CityToronto TorontoToronto
AreaCode AreaCodeAreaCode AreaCode
Functional Dependency: @AreaCode → @City
Kolahi, Libkin @WWW 2007 – p.2/17
Motivation
This redundancy is reflected in the relational storage of XMLdocuments.
AreaCode416
CityToronto
CityAreaCode
XML functional dependency
@AreaCode → @City AreaCode → City
regular functional dependency
416
416
416
416
Toronto
Toronto
Toronto
Toronto
XML Document Relational Storage
Kolahi, Libkin @WWW 2007 – p.3/17
Motivation
This redundancy is reflected in the relational storage of XMLdocuments.
AreaCode
416
416
416
416
City
Toronto
Toronto
Toronto
TorontoAreaCode416
CityToronto
XML functional dependency
XML Document Relational Storage
equality-generating dependency
on
@AreaCode → @City ∀ R1(x, a) ∧ R2(y, c) ∧
R1(x′, a) ∧ R2(y′, c′) → c = c′
Kolahi, Libkin @WWW 2007 – p.3/17
Motivation
This redundancy is reflected in the relational storage of XMLdocuments.
AreaCode
416
416
416
416
City
Toronto
Toronto
Toronto
TorontoAreaCode416
CityToronto
XML functional dependency
XML Document Relational Storage
equality-generating dependency
on
@AreaCode → @City ∀ R1(x, a) ∧ R2(y, c) ∧
R1(x′, a) ∧ R2(y′, c′) → c = c′
The more redundant the data, the more prone to updateanomalies.
Kolahi, Libkin @WWW 2007 – p.3/17
Motivation
Solution: normalizing data to eliminate redundancies.
Normal forms for relational data:• BCNF eliminates all redundancies w.r.t. functional
dependencies but may lose dependencies.• 3NF eliminates some redundancies but preserves
dependencies.
Normal forms for XML documents:• XNF eliminates redundancies w.r.t. XML functional
dependencies.
If nontrivial FD p1, . . . , pn → q.@l holds then p1, . . . , pn → qshould also hold.
Kolahi, Libkin @WWW 2007 – p.4/17
Motivation
Many XML documents are stored and queried using relationaldatabase management systems.
Questions:• How do XML constraints translate to the relational storage?• What is the best XML design to have a redundancy-free
relational storage?• What is a good XML design to have a low-redundancy
relational storage?
We use an information-theoretic technique to measure theredundancy of data.
Kolahi, Libkin @WWW 2007 – p.5/17
Outline
• Overview of the information-theoretic measure.• Storing XML in relations and constraint translation.• Designing XML to achieve low redundancy in relational
storage.• Concluding remarks.
Kolahi, Libkin @WWW 2007 – p.6/17
Measure of Information Content
• Proposed by Arenas & Libkin in PODS’03.• Used to measure the redundancy of a data value in a
database instance with respect to a set of constraints.• Intuitively, RICI(p|Σ) measures the relative information
content of position p in instance I w.r.t. constraints Σ.• Independent of data models and query languages.
Kolahi, Libkin @WWW 2007 – p.7/17
Measure of Information Content
• Proposed by Arenas & Libkin in PODS’03.• Used to measure the redundancy of a data value in a
database instance with respect to a set of constraints.• Intuitively, RICI(p|Σ) measures the relative information
content of position p in instance I w.r.t. constraints Σ.• Independent of data models and query languages.
Σ = {A → C}
RICI(P |Σ)
0.875
A B C D
1 2 3 4
1 2 3 5
Σ = {A → C, B → C}
RICI(P |Σ)
0.781
Kolahi, Libkin @WWW 2007 – p.7/17
Measure of Information Content
• Proposed by Arenas & Libkin in PODS’03.• Used to measure the redundancy of a data value in a
database instance with respect to a set of constraints.• Intuitively, RICI(p|Σ) measures the relative information
content of position p in instance I w.r.t. constraints Σ.• Independent of data models and query languages.
Σ = {A → C}
RICI(P |Σ)
0.875
0.781
A B C D
1 2 3 4
1 2 3 5
1 2 3 6
Σ = {A → C, B → C}
RICI(P |Σ)
0.781
0.629
Kolahi, Libkin @WWW 2007 – p.7/17
Measure of Information Content
• Proposed by Arenas & Libkin in PODS’03.• Used to measure the redundancy of a data value in a
database instance with respect to a set of constraints.• Intuitively, RICI(p|Σ) measures the relative information
content of position p in instance I w.r.t. constraints Σ.• Independent of data models and query languages.
Σ = {A → C}
RICI(P |Σ)
0.875
0.781
0.711
A B C D
1 2 3 4
1 2 3 5
1 2 3 6
1 2 3 7
Σ = {A → C, B → C}
RICI(P |Σ)
0.781
0.629
0.522
Kolahi, Libkin @WWW 2007 – p.7/17
Measure of Information Content
• Proposed by Arenas & Libkin in PODS’03.• Used to measure the redundancy of a data value in a
database instance with respect to a set of constraints.• Intuitively, RICI(p|Σ) measures the relative information
content of position p in instance I w.r.t. constraints Σ.• Independent of data models and query languages.
Σ = {A → C}
RICI(P |Σ)
0.875
0.781
0.711
0.658
A B C D
1 2 3 4
1 2 3 5
1 2 3 6
1 2 3 7
1 2 3 8
Σ = {A → C, B → C}
RICI(P |Σ)
0.781
0.629
0.522
0.446
Kolahi, Libkin @WWW 2007 – p.7/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
1 2 3
1 2 4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
2 3
1 2
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) =
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
1 2 3
1 2 1
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) =
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
4 2 3
1 2 7
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) =
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
4 2 3
1 2 7
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) = 48/
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
a 3
1 2
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
1 2 3
1 2 4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2
Conditional entropy : 2.8057
Average over all possible X: RICkI = 2.4558
Kolahi, Libkin @WWW 2007 – p.8/17
Measure of Information Content
R(A,B,C) Σ = {A → B}
A B C
1 2 3
1 2 4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distributionP (a|X) for every a ∈ {1, . . . , k}.
P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2
Conditional entropy : 2.8057
Average over all possible X: RICkI = 2.4558
RICI(p|Σ) = limk→∞
RICkI (p | Σ)
log k= 0.875
Kolahi, Libkin @WWW 2007 – p.8/17
Measure and Normal Forms
Ideally, we want to achieve a well-designed database by havingthe maximum information content for the entire database.
If this is not achievable, we want to maximize the informationcontent for all positions to the possible extent by enforcing somedesign conditions.
Given a condition C, guaranteed information content (GIC(C)) isthe largest number g ∈ [0, 1] such that for all positions in allinstances of all schemas satisfying C, the information content isnot smaller than g.
instances of (S, Σ)
RICI(p|Σ) ≥ g
Schema satisfying condition C(S, Σ) =⇒
Kolahi, Libkin @WWW 2007 – p.9/17
Measure and Normal Forms
Known results (Arenas&Libkin: PODS’03, Kolahi&Libkin:PODS’06):
• For relational schemas with functional dependencies:◦ BCNF is the only normal form that guarantees
well-design databases: GIC(BCNF) = 1.◦ a good 3NF normalization guarantees a minimum of 1/2
information content: GIC(3NF+) = 1/2.• For XML designs with functional dependencies:
◦ XNF is the only normal form that guaranteeswell-designed XML documents: GIC(XNF) = 1.
How do we design XML to achieve high information content forthe relational storage?
Kolahi, Libkin @WWW 2007 – p.10/17
Storing XML in Relations
Inlining technique: Given a DTD, separate relations are createdfor the root and elements occurring under a Kleene star.
student
db
contact
addressphone
@streetNo@number @city @postalCode
@name
*
**
student(stID, name, conID)
address(addID, conID, postalCode, streetNo, city)
phone(phID, conID, number)
Kolahi, Libkin @WWW 2007 – p.11/17
Storing XML in Relations
Inlining technique: Given a DTD, separate relations are createdfor the root and elements occurring under a Kleene star.
student
db
contact
addressphone
@streetNo@number @city @postalCode
@name
*
**
student(stID, name, conID)
address(addID, conID, postalCode, streetNo, city)
phone(phID, conID, number)
address[conID] ⊆FK student[conID]
phone[conID] ⊆FK student[conID]
Kolahi, Libkin @WWW 2007 – p.11/17
Storing XML in Relations
Inlining technique: Given a DTD, separate relations are createdfor the root and elements occurring under a Kleene star.
student
db
contact
addressphone
@streetNo@number @city @postalCode
@name
*
**
student(stID, name, conID)
address(addID, conID, postalCode, streetNo, city)
phone(phID, conID, number)
student, @postalCode → address
address[conID] ⊆FK student[conID]
phone[conID] ⊆FK student[conID]
∀ student(s, n, c) ∧ address(a, c, pc, st, apt, ct) ∧
student(s, n, c) ∧ address(a′, c, pc, st′, apt′, ct′)
→ a = a′
Kolahi, Libkin @WWW 2007 – p.11/17
Redundancy-Free Design
XML Design
• DTD• XML functional depen-
dencies
Relational Storage
• Relational schema• Keys, foreign keys• EGDs
Kolahi, Libkin @WWW 2007 – p.12/17
Redundancy-Free Design
XML Design
• DTD• XML functional depen-
dencies
Relational Storage
• Relational schema• Keys, foreign keys• EGDs
RICI (P |Σ) =?
Kolahi, Libkin @WWW 2007 – p.12/17
Redundancy-Free Design
XML Design
• DTD• XML functional depen-
dencies
Relational Storage
• Relational schema• Keys, foreign keys• EGDs
RICI (P |Σ) =?
Theorem: XML design in XNF ⇔ Max information content forpositions in the relational storage.
Kolahi, Libkin @WWW 2007 – p.12/17
Non-XNF Designs
If there are non-XNF functional dependencies,• we can do XNF normalization, but• normalizing is not always efficient.
With no restriction on FDs, we can have high redundancy in therelational storage.
• For any ε > 0, the information content can get as close as εto zero.
Can we restrict XML functional dependencies to guarantee areasonable information content?
• like the restriction that 3NF enforces to guarantee 1/2information content.
Kolahi, Libkin @WWW 2007 – p.13/17
Relative vs Absolute Functional Dependencies
student
db
contact
addressphone
@streetNo@number @city @postalCode
@name
*
**
Relative functional dependency:holds within each student element.
student, @postalCode → address
Absolute functional dependency:holds globally.
@postalCode → @city
We are interested in FDs relative to an element that occursunder a Kleene star in the DTD.
Kolahi, Libkin @WWW 2007 – p.14/17
Another Good Design
Theorem: If all XML functional dependencies are either XNF orrelative, then the information content for all data values in therelational storage is not less than 1/2.
Kolahi, Libkin @WWW 2007 – p.15/17
Another Good Design
Theorem: If all XML functional dependencies are either XNF orrelative, then the information content for all data values in therelational storage is not less than 1/2.
RICI (P |Σ) =?
only XNF functional dependencies ⇔ RICI(p|Σ) = 1
XNF or relative functional dependencies ⇒ RICI(p|Σ) ≥ 1/2
Kolahi, Libkin @WWW 2007 – p.15/17
Another Relational Storage for XML
We can treat XML documents as edge-labeled graphs andshred them into relations:
Edge(source, target, label) orValue(vid, val)
Blabel(source, target)
Value(vid, val)
Kolahi, Libkin @WWW 2007 – p.16/17
Another Relational Storage for XML
We can treat XML documents as edge-labeled graphs andshred them into relations:
Edge(source, target, label) orValue(vid, val)
Blabel(source, target)
Value(vid, val)
Our results extend:
only XNF functional dependencies ⇔ RICI(p|Σ) = 1
XNF or relative functional dependencies ⇒ RICI(p|Σ) ≥ 1/2
But enforcing XML functional dependencies requires arbitrarilymany more joins.
Kolahi, Libkin @WWW 2007 – p.16/17
Conclusions
Design tips to have a good relational storage:• try to have an XNF design to ensure a redundancy-free
relational storage.• organize XML elements so that there is no absolute FD to
ensure a low-redundancy relational storage.
Comparing inlining and edge relational representations:• they are equivalent in terms of redundancy.• edge can be significantly worse when enforcing constraints.
Future work:• is it always possible to avoid absolute FDs?
Kolahi, Libkin @WWW 2007 – p.17/17