View
218
Download
1
Tags:
Embed Size (px)
Citation preview
2
What is XML•XML stands for Extensible Markup Language– the World Wide Web Consortium (W3C) directs the effort
•XML isn't a markup language, like HTML, but rather a system for defining other markup languages.•XML is a common syntax for expressing structure in data, and as a result a way for others to define new tags– whereas the <H1> tag in HTML specifies text to be presented in a certain typeface and weight, an XML tag would explicitly identify the kind of information it surrounds:
<AUTHOR> tag might identify the author of a document,
<PRICE> tag could contain an item's cost in an inventory list
3
SGML, XML and HTML• The parent of HTML and XML is Standard Generalized
Markup Language (SGML) an ISO standard for electronic document exchange
• SGML competes with other standards, mainly de facto standards, like Adobe PDF (Acrobat), Microsoft RTF (Rich Text Format) and popular word processor file formats like Microsoft Word.
• both XML and HTML are document formats derived from SGML. – Thus they all share certain characteristics, such as
a similar syntax and the use of bracketed tags. – But HTML is an application of SGML, whereas XML is a
subset of SGML. • XML documents can be
– read by any SGML authoring or viewing tool. – XML is less complex than SGML, and it is designed to
work across a limited-bandwidth network such as the Internet.
4
Why Are Developers Excited about XML?
• Domain-Specific Markup Languages– A DTD precisely describes the format– DTDs verify that documents adhere to the format– Ensures interoperability of unrelated tools
• Self-Describing Data– DTDs explain the format so reverse engineering isn't as
necessary– Comments in DTDs can go even further<!-- This should be a four digit year like "1999", not a
two-digit year like "99" --> <!ELEMENT YEAR (#PCDATA)> • Interchange of Data Among Applications
– E-commerce and syndication– DTDs make sure that two independent applications speak
the same language– DTDs detect malformed data– DTDs verify correct data
• Structured and Integrated Data– Can specify relationships between elements using element
declarations– Can assemble data from multiple sources using external
entity references declared in the DTD
5
XML Appications• Chemical Markup Language (CML)
– Jumbo: the first general-purpose XML browser– Assign each XML elements to a java class that
knows how to render that element– http://www.xml-cml.org
• Mathematical Markup Language (MathML)– The Amaya browser
• Synchronized Multimedia Integration Language (SMIL)
• Scalable Vector Graphics• MusicML • FoodWebML, GuiML
6
A Song Description in HTML
<dt>Hot Cop
<dd> by Jacques Morali, Henri Belolo, and Victor Willis <ul> <li>Producer: Jacques Morali
<li>Publisher: PolyGram Records
<li>Length: 6:20
<li>Written: 1978
<li>Artist: Village People
</ul>
7
A Song Description in XML
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> <SONG LENGTH="6:20">
<TITLE>Hot Cop</TITLE>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis
</COMPOSER>
<PRODUCER>Jacques Morali
</PRODUCER>
<PUBLISHER>PolyGram Records
</PUBLISHER> <YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
9
Attaching style sheets to documents
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<SONG LENGTH="6:20">
<TITLE>Hot Cop</TITLE>
<COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>PolyGram Records</PUBLISHER> <YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
Using CSS – simpler, but limitted
10
Well-formedness
• All XML documents must be well-formed• Well-formedness rules:
– Open and close all tags– Empty tags end with /> – There is a unique root element– Elements may not overlap– Attribute values are quoted– < and & are only used to start tags and entities
• Parsers are required to reject malformed documents.
• This improves compatibility and interoperability.
11
Well-formedness Rules• Open and close all tags• Empty tags end with /> • There is a unique root element• Elements may not overlap• Attribute values are quoted• < and & are only used to start tags and
entities• Only the five predefined entity references are
used
12
What is a Document Type Definition• A Document Type Definition (DTD) is a set of syntax
rules for tags. It tells you – what tags you can use in a document, – what order they should appear in, – which tags can appear inside other ones, – which tags have attributes, and so on.
• Originally developed for use with SGML, a DTD can be part of an XML document, but it's usually a separate document or series of documents.
• Because XML is not a language itself, but rather a system for defining languages, it doesn't have a universal DTD the way HTML does. Instead, each industry or organization that wants to use XML for data exchange can define its own DTDs.
• If an organization uses XML to tag documents for internal use only, it can create its own private DTD.
13
Validity
• To be valid an XML document must be
1.Well-formed
2.Must have a Document Type Definition (DTD)
3.Must comply with the constraints specified in the DTD
14
Validity is not always sufficient
• DTDs cannot specify anything about the contents of an element. – That an element must contain a number– That an element must contain a date– That a date must be between 1970 and 2001– etc.
• Custom validation layers can sit on top of XML validation
• Schemas will add this
15
XML Schemas
• an XML-based syntax, or schema, for defining how an XML document is marked up.
• recommended by Microsoft an alternative to Document Type Definition (DTD)
• DTDs have many drawbacks, including the use of non-XML syntax, no support for data-typing, and non-extensibility.
• XML Schema improves upon DTDs in several ways, including the use of XML syntax, and support for data-typing and namespaces.
• For example, an XML Schema allows you to specify an element as an integer, a float, a boolean, an URL, etc.
• The XML parser in Internet Explorer 5 can validate an XML document with both a DTD and an XML Schema.
16
How to process XML? Java Parsers
• DOM Parser – tree structure• SAX Parser – event driven approach
• DOM Parser makes use of SAX parser to parse and then create a tree structure
17
DTDs – Content Definitions• Content model definitions describe what may be
contained in an instance of an element– names of allowed or forbidden elements– DTD entities– document text
• syntax for expressing content is a form of regular expressions:– (…) delimits a group– A | B either A or B– A, B A followed by B– A & B A and B in any order– A? A occurs zero or one time– A* A occurs zero or more times– A+ A occurs one or more times
18
Element Declarations
• Each tag must be declared in a <!ELEMENT> declaration.
• A <!ELEMENT> declaration gives the name and content model of the element
• The content model uses a simple regular expression-like grammar to precisely specify what is and isn't allowed in an element
19
Content Specifications• ANY
– <!ELEMENT catalog ANY> – A catalog can contain any child element and/or raw text (parsed character data)
• #PCDATA– Parsed Character Data; i.e. raw text, no markup. For example,
– <year>1984</year> – <!ELEMENT year (#PCDATA)>
• Sequences
• Choices• Mixed Content• Modifiers• EMPTY
20
#PCDATAThere are a number of elements in the example document that only contain PCDATA:
<!ELEMENT category (#PCDATA)>
<!ELEMENT abstract (#PCDATA)>
<!ELEMENT keyword (#PCDATA)>
<!ELEMENT last_updated (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT middle_name (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT instruments (#PCDATA)>
<!ELEMENT publisher (#PCDATA)> <!ELEMENT length (#PCDATA)>
21
Comments in DTDs• DTDs seem fundamentally more obfuscated than C. • Comments can improve this by giving example
elements• Comments are the same as in HTML; e.g. <!--
Comment -->
<!-- e.g. "1999 New York Women Composers", not "Copyright 1999 New York Women Composers" -->
<!ELEMENT copyright (#PCDATA)>
22
Child Elements<date><year>1994</year></date> • To declare that a date element must have a year
child:
<!ELEMENT date (year)>
23
Child Elements•You only have to declare the immediate children<maintainer email="[email protected]" url="http://www.macfaq.com/personal.html"> <name> <first_name>Elliotte</first_name> <middle_name>Rusty</middle_name> <last_name>Harold</last_name> </name> </maintainer> <composer id="c1"> <name> <first_name>Julie</first_name> <middle_name></middle_name> <last_name>Mandel</last_name> </name> </composer> •To declare that an element must have exactly one name child:<!ELEMENT maintainer (name)> <!ELEMENT composer (name)>
24
Sequences<name>
<first_name>Elliotte</first_name> <middle_name>Rusty</middle_name>
<last_name>Harold</last_name>
</name> •Separate multiple required child elements with commas; e.g.
<!ELEMENT name (first_name, middle_name, last_name)> •A list of child elements separated by commas is called a sequence
25
More Sequences• To use a sequence in an ELEMENT declaration:
– The element being described must have only child elements, no mixed content
– You must know the order of the child elements
– You must know the type of each child element– You must know the number of child elements – The number can be relaxed with wild cards
26
One or More Children +<cataloging_info>
<abstract>Compositions by the members of New York Women Composers</abstract> <keyword>music publishing</keyword> <keyword>scores</keyword> <keyword>women composers</keyword> <keyword>New York</keyword> </cataloging_info> •The + suffix indicates that one or more of that element is required at that point
<!ELEMENT cataloging_info (abstract, keyword+)>
27
A DTD for Songs<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*,
PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)> <!-- This should be a four digit year like "1999", not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>
<!ELEMENT ARTIST (#PCDATA)>
28
Internal DTDs
<?xml version="1.0"?>
<!DOCTYPE GREETING [
<!ELEMENT GREETING (#PCDATA)> ]>
<GREETING>
Hello XML!
</GREETING>
29
Complete Example – Mail MessageSuppose we describe an email message as consisting of:
a title; <!--Mail System DTD-->
a header made of: <!ELEMENT mail - - (head,body)>
the sender; <!ELEMENT head - O ((TO & FR) & SH?)>
the recipient; <!ELEMENT body - O (p*)>
a subject; <!ELEMENT TO - O (#PCDATA)>
the body text made of: <!ELEMENT FR - O (#PCDATA)>
four paragraphs; <!ELEMENT SH - O (#PCDATA)>
quoted material; <!ELEMENT p - O ((#PCDATA|cite)*)>
<!ELEMENT cite - - (#PCDATA)>
The tags are <MAIL><HEAD><BODY><TO><FR><SB><P><CITE>
<!-- is a comment, (head,body) implies a group with body following head
TO is followed by FR and both must appear, ? Means SB is optional, P may occur zero or more times
30
Well-formedness
• All XML documents must be well-formed• Well-formedness rules:
– Open and close all tags– Empty tags end with /> – There is a unique root element– Elements may not overlap– Attribute values are quoted– < and & are only used to start tags and entities
• Parsers are required to reject malformed documents.
• This improves compatibility and interoperability.
31
Well-formedness Rules• Open and close all tags• Empty tags end with /> • There is a unique root element• Elements may not overlap• Attribute values are quoted• < and & are only used to start tags and
entities• Only the five predefined entity references are
used
32
Open and close all tags• Good:
– <p>The quick brown fox jumped over the lazy dog</p>
– <li>A very <B>important</B> point</li> – Copyright 1999 Ellis Horowitz<br></br>
• Bad: – The quick brown fox jumped over the lazy dog<p>
– <li>A very <B>important point – Copyright 1999 Ellis Horowitz<br>
33
Empty tags end with />
• <BR/>, <HR/>, and <IMG/> instead of <BR>, <HR>, and <IMG>
• Web browsers deal inconsistently with these• Can use <BR></BR> <HR></HR> <IMG></IMG> instead
34
There is a unique root element• One element completely contains all other
elements of the document• This is HTML in HTML files• The XML declaration and xml-stylesheet
processing instruction are not elements
35
Elements may not overlap• If an element contains a start tag for an
element, it must also contain the corresponding end tag
• Empty elements may appear anywhere• Every non root element has a parent element
36
Attribute values are quoted• Good:
– <A HREF="http://metalab.unc.edu/xml/"> – <DIV ALIGN="CENTER"> – <A HREF="http://metalab.unc.edu/xml/"> – <EMBED SRC="minnesotaswale.aif" hidden="true">
• Bad: – <A HREF=http://metalab.unc.edu/xml/> – <DIV ALIGN=CENTER> – <EMBED SRC=minnesotaswale.aif hidden=true> – <EMBED SRC="minnesotaswale.aif" hidden>
37
< and & are only used to start tags and entities
• Good:
<H1>O'Reilly & Associates</H1> • Bad:
<H1>O'Reilly & Associates</H1> • Good:
<CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
• Bad:
<CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
38
Only the five predefined entity references are used
• Good: – & – < – > – " – '
• Bad:– ©– ®– &tm;– α– é– – etc.
• DTDs loosen this restriction by allowing you to define new entities, even in an invalid document.
39
Validity
• To be valid an XML document must be
1.Well-formed
2.Must have a Document Type Definition (DTD)
3.Must comply with the constraints specified in the DTD
40
Validity is not always sufficient
• DTDs cannot specify anything about the contents of an element. – That an element must contain a number– That an element must contain a date– That a date must be between 1970 and 2001– etc.
• Custom validation layers can sit on top of XML validation
• Schemas will add this
41
XML Schemas
• an XML-based syntax, or schema, for defining how an XML document is marked up.
• recommended by Microsoft an alternative to Document Type Definition (DTD)
• DTDs have many drawbacks, including the use of non-XML syntax, no support for data-typing, and non-extensibility.
• XML Schema improves upon DTDs in several ways, including the use of XML syntax, and support for data-typing and namespaces.
• For example, an XML Schema allows you to specify an element as an integer, a float, a boolean, an URL, etc.
• The XML parser in Internet Explorer 5 can validate an XML document with both a DTD and an XML Schema.
44
A DTD for Songs<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*,
PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)> <!-- This should be a four digit year like "1999", not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>
<!ELEMENT ARTIST (#PCDATA)>
<!ATTLIST SONG LENGTH CDATA #IMPLIED>
45
A Valid Song Document<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> <!DOCTYPE SONG SYSTEM "song.dtd">
<SONG LENGTH="6:20">
<TITLE>Hot Cop</TITLE>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>PolyGram Records</PUBLISHER>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
46
XSLT - XSL TransformationsXSL (eXtensible Stylesheet Language) consists of two parts: XSL Transformations and XSL Formatting Objects. •An XSLT stylesheet is an XML document defining a transformation for a class of XML documents. •A stylesheet seperates contents and logical structure from presentation. •Not intended as completely general-purpose XML transformation language - designed for XSL Formatting Objects.Nevertheless: XSLT is generally useful. The basic idea:
The basic design:XSLT is declarative and based on pattern-matching and templates
51
Processing model
template rule = pattern + template
Construction of result tree fragment: •the source tree is processed by processing the root •a single node is processed by
1.finding the template rule with the best matching pattern 2.instantiating its template (creates fragment + continues processing recursively)
•a node list is processed by processing each node in order current node: the node currently being processedcurrent node list: the node list currently being processed(used for evaluation context later)
61
A Blank Style Sheet
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/css" href="compositions1.css"?>
<catalog> ... </catalog>
62
The Default Rule• Not every element needs a rule• The root element should be at least display:
block
catalog { font-family: New York, Times New Roman, serif;
font-size: 14pt; background-color: white; color: black; display: block }
63
A style rule for the category element•Make it look like an H1 headingcategory { display: block; font-family: Helvetica, Arial, sans; font-size: 32pt; font-weight: bold; text-align: center} catalog { font-family: New York, Times New Roman, serif; font-size: 14pt; background-color: white; color: black; display: block }
64
A style rule for the composer element
• Make it look like a level 2 head
• No need to styleize the first, middle, and last names separately
composer { display: block;
font-family: Helvetica, Arial, sans;
font-size: 24pt; font-weight: bold;
text-align: left}
65
A style rule for the title element
• composition title { display: block; font-family: Helvetica, Arial, sans; font-size: 18pt; font-weight: bold; text-align: left}
66
Style Rules for composition children
composition * {display:list-item}
description {display: block}
67
Finished Style Sheetcategory { display: block; font-family: Helvetica, Arial,
sans; font-size: 32pt; font-weight: bold; text-align: center}
catalog { font-family: New York, Times New Roman, serif; font-size: 14pt; background-color: white; color: black; display: block }
composer { display: block; font-family: Helvetica, Arial, sans; font-size: 24pt; font-weight: bold; text-align: left}
composition title { display: block; font-family: Helvetica, Arial, sans; font-size: 18pt; font-weight: bold; text-align: left}
composition * {display:list-item} description {display: block} // cataloging_info is only for
search engines cataloging_info { display: none; color: #FFFFFF}
last_updated, copyright, maintainer {display: block; font-size: small} copyright:before {content: "Copyright " }
last_updated:before {content: "Last Modified " } last_updated {margin-top: 2ex }
68
Java Parsers• DOM Parser – tree structure• SAX Parser – event driven approach
• DOM Parser makes use of SAX parser to parse and then create a tree structure