36
0360-569 Semantic Web 0360-569 Semantic Web (Winter 2007) (Winter 2007) A Report on A Report on DTD vs XML Schema: A DTD vs XML Schema: A Practical Study Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. By – Bex, G. J., Neven, F., Bussche, J. V. Presented By: Presented By: Quazi Rahman Quazi Rahman Titas Mutsuddi Titas Mutsuddi

Outline

Embed Size (px)

DESCRIPTION

0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By:Quazi Rahman Titas Mutsuddi. Outline. Introduction Structural View of DTDs and XSDs Dataset Expressiveness of XSDs Additional Features - PowerPoint PPT Presentation

Citation preview

0360-569 Semantic Web0360-569 Semantic Web(Winter 2007)(Winter 2007)

A Report onA Report on

DTD vs XML Schema: A DTD vs XML Schema: A Practical StudyPractical Study

By – Bex, G. J., Neven, F., Bussche, J. V.By – Bex, G. J., Neven, F., Bussche, J. V.

Presented By:Presented By: Quazi RahmanQuazi RahmanTitas MutsuddiTitas Mutsuddi

60-569 2

Outline1. Introduction2. Structural View of DTDs and XSDs3. Dataset4. Expressiveness of XSDs5. Additional Features6. Regular Expression Characterization7. Schema and Ambiguity8. Errors 9. Conclusion10. Reference

60-569 3

1. Introduction DTD and XSD are two widely used schemas to

describe the contents in an XML documents. Although DTDs and XSDs differs syntactically, they

are quite related on an abstract level. In this paper the authors present a comparative

study of both DTDs and XSDs. They have tried to answer two questions: Which of the extra features or expressiveness of XML

schema are effectively used in practice that are not allowed in DTDs, and

How sophisticated are the structural properties (nature of regular expression) of the two formalisms.

60-569 4

1. Introduction (cont’d)Definition of DTD and

XSD Both Document Type Definitions (DTDs) and

XML Schema Definitions (XSDs) states what tags and attributes are used to describe the elements in an XML document, where each tag is allowed, and which tags can appear within other tags, etc.

Applications use a document's DTDs or XSDs to properly read and display a document's contents.

Changes in the format of the document can be easily made by modifying the DTDs or the XSDs of the document.

60-569 5

1. Introduction (cont’d) Merits and Demerits of DTD and XSD

Shortcomings of DTDs No support for namespaces Limited support for data types Limited support for cardinality

Shortcomings of XSDs It is more complex than DTDs There are complains about the performance issue.

Merits of XSDs XSDs are extensible to future additions

Reuse Schema in other Schemas Create new data types derived from the standard types Reference multiple schemas in the same document

XSDs are richer and more powerful than DTDs

60-569 6

1. Introduction (cont’d) Merits of XSDs

XSDs are written in XML Don't have to learn a new language Can use XML editor to edit Schema files Can use XML parser to parse Schema files Can transform Schema with XSLT

XSDs support data types. It is easier to: Describe allowable document content Validate the correctness of data Work with data from a database Define data facets (restrictions on data) Define data patterns (data formats) Convert data between different data types

XSDs support namespaces

60-569 7

2. Structural View of DTD and XSD An XML document may be viewed as a finite

ordered tree structure. An Example:

<store><dvd><title>Amelie</title><price>17</price></dvd><dvd><title>Good bye, Lenin</title><price>20</price><discount>20%</discount></dvd>

</store>

60-569 8

2. Structural View of DTD and XSD(cont’d)

Corresponding Tree structure:

store

dvd dvd

title price title price discount

“Amelie” “17” “Good bye, Lenin” “20” “20%”

60-569 9

2. Structural View of DTD and XSD(cont’d)

DTD to describe the previous document<!ELEMENT store (dvd+)><!ELEMENT dvd (title, price, discount?)><!ELEMENT title (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT discount (#PCDATA)>

For the tree above let us consider every node label is a member of some finite alphabet .

Definition 1. A DTD is a pair (d, s) where d is a function that maps -symbols to regular expression over , and s is the start symbol. A tree satisfies the DTD if its root is labeled by s and for every node u with label a, the sequence a1…an of labels of its children matches the regular expression d(a).

60-569 10

2. Structural View of DTD and XSD(cont’d)

We can abstract the DTD by the set of rules of the form a r, where a is an element and r is a regular expression over the alphabets of elements. Such as

store dvd+dvd title price discount?

Definition 2. A specialized DTD (SDTD) is a 4-tuple (, ’, , ), where ’ is an alphabet of types, is a DTD over ’ and is a mapping from ’ to . Note that can be applied to a ’-tree as a re-labeling of the nodes, thus yielding a -tree. A -tree t then satisfies the SDTD if t can be written as (t’), where t’ satisfies the DTD .

60-569 11

A simple example of a SDTD:store (dvd1 + dvd2)*dvd2(dvd1 + dvd2)*dvd1 title pricedvd2 title price discount

Here, dvd1 defines ordinary DVDs while dvd2 defines DVDs on sale. The rule for store specifies that there should be at least one of the latter

Definition 3. A single-type SDTD is an SDTD (, ’, (d,s), ) with the property that no regular expression d(a) has occurrences of types of the form bi and bj with the same b but different i and j.

The example above is not a single-type SDTD, as both dvd1 and dvd2 occur in the rule for store.

2. Structural View of DTD and XSD(cont’d)

60-569 12

An example of single-type grammar is given below:store regulars discountsregulars (dvd1)*discounts dvd2(dvd2)*dvd1 title pricedvd2 title price discount

Although there are still two element definitions dvd1 and dvd2, they can only occur in a different context, regulars and discounts respectively.

2. Structural View of DTD and XSD(cont’d)

60-569 13

Fragment of XSD of the above DTD may be written as:<xs:element name = “store”> <xs:complexType> <xs:sequence>

<xs:choice minOccurs=“0” maxOccurs=“unbounded”/> <xs:element name = “dvd” type = “dvd1”/> <xs:element name = “dvd” type = “dvd2”/></xs:choice> <xs:element name = “dvd” type = “dvd2”/> <xs:choice minOccurs=“0” maxOccurs=“unbounded”/> <xs:element name = “dvd” type = “dvd1”/> <xs:element name = “dvd” type = “dvd2”/></xs:choice>

</xs:sequence> </xs:complexType> </xs:element>

2. Structural View of DTD and XSD(cont’d)

60-569 14

3. Dataset The authors have gathered a representative

samples of DTDs and XSDs for this comparative study, mostly from the online source xml.coverpages.org

They have obtained 109 DTDs and 93 XSDs for this study.

60-569 15

4. Expressiveness of XSDsSingle-Type

The authors tried to find out whether the expressive power of single-type SDTDs actually used in real world XSDs.

Most XSDs define local tree language, that is, can be defined by DTDs

Only 5 out of 30 XSDs that are used in this analysis, or only 15%, are true single-type SDTDs

All five XSDs were of the form:p …a1…q …a2…a1 expr1a2 expr2

Which means, when a parent of an a is p (or q) use the rule for a1 (or a2)

60-569 16

XML Schema provides two kinds of types, simple and complex types

Simple type describes the character data an element can contain (like #PCDATA in DTDs)

Complex type specifies which elements may occur as children in a given element.

In XSDs, new types may derived from existing types using two mechanisms:

Extension Restriction

4. Expressiveness of XSDs (cont’d) Derived Types

60-569 17

A simple type can be extended to complex type to add attributes to elements

A complex type can be extended to add a sequence of additional elements to its content model or to add attributes

A simple type can be restricted to limit the acceptable range of values for that type

A complex type can be restricted to limit the set acceptable sub-trees

4. Expressiveness of XSDs (cont’d)Derived Types

  Simple type (%) Complex type (%)Extension 27 37Restriction 73 7

Table1: Relative use of derivation features in XSDs

60-569 18

Out of 93 XSDs considered: Approx. one fifth (20%) do not construct new type

through derivation at all Extension is used to define additional attributes in 58%,

and to add new elements to a content model in 42% Restriction of complex type is used only in 7% Note that only 37% used extension of complex type

which is parallel to inheritance in OOP. Extension of simple type occurs in 27% of XSDs Restriction of simple type is most heavily used (73%),

which shows the shortcomings of DTDs which uses unrestrictive #PCDATA

4. Expressiveness of XSDs (cont’d)Derived Types

60-569 19

6 XSDs have used the feature of finalizing a type definition, that is using an attribute that specify that the type can not be restricted nor extended

11 XSDs have used the abstract type definition that must be derived to new types from it.

Derived type can occur anywhere in the content model where the original type is allowed, but this can be prevented by applying block attribute to the original type. 2 XSDs have used this blocking feature.

Fixed attribute is usually used to indicate that an element or attribute is restricted to specific value. Only a single XSD used this feature.

Using substitutionGroup feature the name of an element can be substitute with other name. This feature is used by 10 XSDs.

4. Expressiveness of XSDs (cont’d)Derived Types

60-569 20

5. Additional Features The &-operator specifies that all elements must occur but their

order is not significant, was available in SGML DTD, but is lost in XML DTD. (a1& a2 & a3 a1a2a3 | a1a3a2 | … | a3a2a1). In XSDs this feature is restored by defining the xsd:all element. Only 4 XSDs used this operator

Elements of an XML document can be identified using ID attribute and referred by IDREF or IDREFS (also supported by DTDs). The IDs are unique throughout the document. Only 6 XSDs used this feature

Referring to elements can be accomplished by key/keyref pairs. Using a reference to a key implies that the element with the corresponding key should exist in the document. It is used by 4 XSDs.

One important feature of XSDs is the use of namespace. This allows to use elements and types in the current XSD that are defined elsewhere. Apart from the obvious inclusion of XML Schema namespace, 20 XSDs used this feature.

60-569 21

6. Regular Expression Characterization The second question the authors tried to answer is how

sophisticated regular expression tend to be in the real world DTDs and XSDs.

For this analysis, the authors had to perform some preprocessing on the documents: DTD element definition were converted to a canonical form

such as, <!ELEMENT lib ((book | journal)*)> was converted to the form (c1 | c2)*, just to keep the structural DTD information

XSDs were preprocessed using XSLT to the canonical form For DTDs, total 11802 element definition was reduced to

750 canonical forms, and for XSDs, total 1016 element definition was reduced to 138 canonical forms, totaling to 838 for both types of schema.

60-569 22

6. Regular Expression Characterization (cont’d) Definition 4. A base symbol is a regular expression a, a?,

or a* where a ; a factor is of the form e, e?, or e*, where e is a disjunction of base symbols. A simple regular expression is , Ø, or a sequence of factors, such as, (a*+b*)(a+b)?b*(a+b)*.

The authors introduced a uniform syntax to denote subclass of simple regular expressions by specifying the allowed factors. They distinguish base symbols extended by ? Or *. Further, they distinguish between factors with one disjunct or with arbitrarily many disjuncts; the latter is denoted by (+…). Finally, factors can again be extended by * or ?. For example, they write RE((+a)*,a?) for the set of regular expression e1… en where every ei is (a1+…+ an)* for some a1,…, an and n 1, or a? for some a .

60-569 23

Following is a table of possible factors in simple regular expressions and how they are denoted (a, a1, . . . , an ).

Table 2

6. Regular Expression Characterization (cont’d)

Factor Abbr. Factor Abbr.aa*a?

(a1 + … + an)

aa*a?

(+a)

(a1 + … + an)*(a1 + … + an)?(a1* + … + an*)(a1* + … + an*)*

(+a)*(+a)?(+a*)(+a*)*

60-569 24

The authors have analyzed the DTDs and XSDs to characterize their content models according to the subclasses defined above.

The result is represented in the Table 3 that list the non-overlapping categories of expression having a significant population (more than 0.5%)

Two major differences between DTDs and XSDs. XSDs have more simpleType elements (#PCDATA). This

may be due to the fact that XSD introduces more distinct simpleType elements. It is now possible to fine tune the specification of an element’s content.

XSDs have less expression in the category RE(a,(+a)*). This is most probably due to the nature of the XSDs in the sample since those describing data are over represented with respect to those describing meta documents

6. Regular Expression Characterization (cont’d)

60-569 25

6. Regular Expression Characterization (cont’d)

  DTDs (%) XSDs (%)#PCDATA 34 48

EMPTY 16 10ANY 1 0

RE(a) 5 5RE(a, a?) 2 10RE(a, a*) 8 10

RE(a, a?, a*) 1 4RE(a, (+a)) 3 3RE(a, (+a)?) 0 1RE(a, (+a)*) 20 2

RE(a, (+a)?, (+a)*) 0 1RE(a, (+a*)*) 0 2

Total simple expression 92 97Non-simple expression 8 3

Table 3: Relative occurrence of various types of regular expressions given in % of element definitions

60-569 26

The authors have compared DTDs and XSDs using different measures but did not observe any significant differences between them. More importantly, it is clear from different comparison that vast majority of expressions are simple both in DTDs (92%) and in XSDs (97%)

Some of the comparisons they have carried out are: Density Width and depth of canonical form Simple content model Star height

6. Regular Expression Characterization (cont’d)

60-569 27

The density of a schema is defined as the number of elements occurring in the right hand side of its rule divided by the number of elements.

6. Regular Expression Characterization (cont’d)

60-569 28

The table bellow show the fraction of DTDs and XSDs versus the fraction of their simple content models: the majority of documents have 90% or more simple content models

6. Regular Expression Characterization (cont’d)

60-569 29

The star height of a regular expression is the maximum nesting depth of Kleene stars occurring in the expression. Content models with star height larger than 1 are very rare.

In DTDs presence of more 1 star height expression is due to the abundance of RE(a, (+a)*) type of expressions in DTDs with respect of XSDs.

6. Regular Expression Characterization (cont’d)

star height DTDs XSDs0 61 781 38 172 1 43 0 0

Table 4: Star height observed in DTDs and XSDs

60-569 30

7. Schema and Ambiguity The XML 1.0 specification by W3C, requires that

schema definition to be deterministic or one-unambiguous.

The authors checked whether the DTDs and XSDs in the study respect this requirement using the tool IBM’s XML Schema Quality Checker (SQC).

The authors found almost all of them follow the rule.

Only 3 out of 93 XSDs having one or more ambiguous content model of two canonical forms: c1?(c1|c2)* and (c1c2)|(c1c3).

60-569 31

For DTDs, the first exception is a regular expression of the type: (… | ci | … | ci | …)*. But the authors claimed it to be only a typo, not a design feature.

The second type of ambiguous regular expression is of type: c1c2?c2?. The designer’s intention was clearly to state that c2 may occur zero, one or two times.

This illustrates a shortcoming of DTDs that has been addressed in XSDs, as in the following example

<xsd:sequence> <xsd:element name=“c1” type=“t1”/> <xsd:element name=“c2” type=“t2”

minOccurs=“0” maxOccurs=“2”/></xsd:sequence>

7. Schema and Ambiguity(cont’d)

60-569 32

8. Errors The authors found some of the errors with XSDs they

have retrieved Only 30 out of 93 XSDs were found to pass a conformance

test by SQC, that is to be complying the W3C specifications 19 XSDs were designed according to a schema older than

2001 specs. Some simple type have been omitted or added from one

version of the specs to another causing the SQC to report errors.

Some errors concern violation of the Datatypes part of the specs., like a regular expression wrongfully restricting xsd:string

Some XSDs violating the specs. by specifying a type attribute for complexType element, or leaving out the name attribute for a top-level complexType element.

60-569 33

9. Conclusion Many features defined in the XML Schema

specification are not widely used yet, especially those that are related to OO data modeling such as derivation of complex type extension.

The expressive power of XSDs under investigation is almost equivalent of that of DTDs, which means that disregarding some exceptions, these XSDs could as well have been written as DTDs. This might show that the level of sophistication offered by XSDs is not necessary for most of the applications, at least until now.

60-569 34

The data type part of the XML Schema specs is heavily used, since it alleviates a major shortcoming of DTDs, namely the ability to specify the format and type of the text of an element, which, in XSDs, accomplish through restricting a simple type. Example:

<xs:element name="letter"> <xs:simpleType><xs:restriction base="xs:string"> <xs:pattern value="[a-z]"/></xs:restriction>

</xs:simpleType> </xs:element> The content models specified in both DTDs and XSDs

tend to be very simple. For XSDs, 97% of all content model can be classified as simple expression.

9. Conclusion (cont’d)

60-569 35

10. References1. Bex, G. T., Neven, F. and Bussche, J. V., DTDs versus XML Schema: A

Practical Study, In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, pages 79--84, Maison de la Chimie, Paris, France, June 17-18 2004.

2. http://www.webopedia.com/TERM/D/DTD.html3. http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci83132

5,00.html4. http://en.wikipedia.org/wiki/XML_Schema5. http://www.w3schools.com/schema/default.asp6. http://www.w3schools.com/dtd/dtd_intro.asp7. IBM Corp. XML Schema Quality Checker, 2003,

http://www.alphaworks.ibm.com/tech/xmlsqc8. R. Cover. The cover pages, 2003, http://xml.coverpages.org/9. P. Biron and A. Mathotra, XML Schema part 2: datatypes. W3C, May

2001, http://www.w3.org/TR/xmlschema-2/10. http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-01-02

/03-01-02.pdf

60-569 36