Upload
pamela-logan
View
286
Download
1
Embed Size (px)
Citation preview
C# and Windows Programming
XML Processing
2
Contents
Markup XML DTDs XML Parsers DOM
3
Markup
When we write text, it is just text For example:
John Smith 123 Main St. Toronto Ontario
We can all read this and understand it A computer cannot and needs additional
information
4
Markup
Markup is added to documents in the form of tags
A tag consists of text delimited by angle brackets
The name of the tag identifies it and the information which is conveyed by the tag
5
Markup
Let’s add some semantic markup to our address<address>
<name>John Smith</name><street>123 Main St.</street><city>Toronto</city><province>Ontario</province>
</address> This identifies the information in the various
parts of the address
6
Markup
You will notice Tags occur in pairs
A start tag A matching end tag with a “/” before the tag name
The text that the tags are describing is enclosed between the start tag and the end tag
A single tag is placed around the entire document The fact that every start tag has a matching end tag
makes the document well-formed
7
XML
XML is the latest is a long line of markup languages
It is the eXtensible Markup Language Unlike, other markup languages, you can
define your own tags Any meaning associated with those tags is
imposed by your program
8
Uses of XML
SOAPSimple Object Access Protocol – a type of
remote procedure call Configuration files Web services Security information Electronic document exchange
9
Defining Documents
If you can define your own tags, how do you know what should be in a document?Document Type Definition
This defines the allowable tags and their order It is similar to a BNF grammar
Schema Like a DTD, it describes the tags and their order It also describes the content which can be placed
within the tags
10
XML Structure
Here is a simple XML document<?xml version=“1.0” encoding=“ISO-8859-1”
standalone=“no”?><!DOCTYPE address SYSTEM “address.dtd”><address>
<name>John Smith</name><street>123 Main St.</street><city>Toronto</city><province>Ontario</province>
</address>
11
Attributes
A tag can also have attributes which provide additional information about the tag<city size=“large”>Toronto</city>
A tag can have zero or more attributes
12
The XML Declaration
The first line is the optional XML declaration
It consists of<?xml
Identify this as the XML declaration
version=“1.0” The version of XML in the document
13
The XML Declaration
encoding=“ISO-8859-1” This is the character set used in the document Various character sets can be used including unicode (UTF-
8) an international character set
standalone = “no” Determines if the document uses any external entities which
are defined in other files This will be discussed later in the course
In general, the order of attributes is not important but it is in the XML declaration
14
The DOCTYPE Declaration
The optional DOCTYPE declaration follows the XML declaration<!DOCTYPE address SYSTEM
“address.dtd”> This declaration is required only if you
want to validate the document against a definition of the tags in the document
15
The Root Element
This is the <address> element which begins the document
It is the first element in the document It contains all other elements in the
document
16
Elements
An element consists of a start tag, character data, and an end tag <name>John Smith</name>
A tag name must start with a letter or underscore A tag name cannot contain spaces or colons The end tag must match the start tag exactly,
including case
17
Mixed Content
If an element contains just text, it has simple content<name>John Smith</name>
If it contains a mix of text and elements, it is said to have mixed content<sentence>these are nested
<adverb>correctly</adverb></sentence>
18
Attributes
Attributes are name-value pairs which can be added to elements
Attributes allow you to provide additional information without changing the tag itself
The names for attributes follow the same rules as tag names
Every attribute name within the same tag must be unique
19
Attributes
<employee name=“Jones”>accountant<employee><employee name=“Smith”>sales<employee>
Note that these both contain a name attribute
That is OK since the attributes are in separate elements
Attribute values are placed in either single or double quotes
20
Comments
Comments are delimited by spacial brackets<!-- a comment -->Comments can
Add explanations Remove XML which is not needed for a while
21
Entities
The less than and greater than signs delimit tags What if you want to type these symbols in a document
and not have them delimit a tag? Then, enter them as entities To enter a less than sign
< All entities are referenced using
& The entity name ;
22
Entities
Entity Symbol Description
< < Less than
> > Greater than
& & Ampersand
" “ Double quote
' ‘ apostrophe
23
CDATA
Sometimes using entities is not enough since you have many special characters to type
A CDATA section allows you to enter anything without having special characters interpreted<![CDATA[ any characters here ]]>
24
Document Type Definitions
The DTD is one way to describe what should be in a valid XML document
There are other ways which we will examine later in the course
A DTD Describes each element and the elements which can
occur within it Describes the attributes for each element Describes entities which can be used in the document
25
Person DTD
<!-- The DTD for person --><!DOCTYPE persontype [<!ELEMENT person (first, last, gender, employee-id) ><!ELEMENT first (#PCDATA) ><!ELEMENT last (#PCDATA) ><!ELEMENT gender (#PCDATA) ><!ELEMENT employee-id (#PCDATA) >
]>
26
Reading the DTD
There is an element person containing the elements first last gender employee-id
These element are described below Each of these contains PCDATA, meaning parseable
character data This means that these elements only contain text – not
nested tags
27
XML Parsers
There are two types of XML parsers DOM
The Document Object Model This parses the document into a tree-like structure called a
DOM The document is parsed all at once
SAX Simple Api for Xml This is a sequential parser which executes a callback when
each part of the document is recognized This is good for very large documents since the entire
document does not have to be in memory at once
28
What is DOM?
DOM is an in-memory data structure It describes an XML document as a tree
structure The nodes in the tree are described by the
interface to them This means that there can be many
implementations that implement the interface
29
So, how do make a document into a tree?<?xml version=“1.0”?>
<friend>
<handle degree=“close”>
Harold
</handle>
</friend>
Document
friend
whitespace handle
Harold
whitespacedegree
close
RootElement
Text
Attribute
30
Nodes
All nodes in a DOM implement the Node interface
All other interfaces in the tree extend the Node interface
This means that every node can be treated as a Node, and maybe more
31
XmlNode
Represents every node in the DOM Properties
ParentNodeNameFirstChildNextSiblingPreviousSiblingValue
32
XmlNode
Methods InsertBefore()AppendChild()RemoveChild()Clone()
33
XmlDocument
The node above the root node of the document Can be used to represent an empty document Properties
DocumentElement Methods
CreateElement() CreateTextNode() GetElementsByTagName() Load() Save()
34
XmlElement
This represents an element An element can have attributes Properties
XmlAttributeCollection Attributes
Methods GetElementsByTagName() SetAttribute(string name, string value) string GetAttribute(string name)
35
XmlAttribute
This is an attribute Can have either Text nodes or
EntityReferences as children Name property gets the name Value gets the value
36
XmlText
This is the node representing text The text has no markup Even whitespace is represented as a text
node
37
CDATASection Interface
This is a CDATA section It is similar to a text node but the content
undergoes no interpretation
38
Other Node Subinterfaces
Comment Notation Entity EntityReference ProcessingInstruction
These are all just the same as in XML
39
Other Node Subinterfaces
DocumentFragmentPart of a document tree which can be inserted
into another tree DOMImplementation
Prevides capabilities of the implementationHas the method for creating a document
40
Other Node Subinterfaces
DOMExceptionSomething went wrong
NodeListA list of nodes which has an iterator
NamedNodeMapA map structure holding a collection of nodes
41
Common .NET DOM Classes
XmlNode
XmlDocument XmlElement XmlText XmlAttribute
42
XmlNodeList
A list of nodes Returned by GetElementsByTagName() Properties
Count -- number of nodes in the list Indexer -- retrieves a node
Methods Item(int n) -- retrieves a node
43
XmlNamedNodeMap
A map of nodes indexed by name Superclass of XmlAttributeCollection Returned by the Attributes property Properties
Count
Methods Item(int n) GetNamedItem(string name)
44
Examples
* see NodeLister * see DocBuilder