View
231
Download
0
Embed Size (px)
Citation preview
1
Extensible Markup Language: XML
• HTML: widely supported protocol for formatting data
• XML: widely supported protocol for describing data
• XML is quickly becoming standard for data exchange between applications
Root element contains all other document elements
Optional XML declaration includes version information parameter (MUST be very first line of file)
Because of the nice <tag>.. </tag> structure, the data can be viewed as organized in a tree:
article
title date author summary content
firstName lastName
<?xml version = "1.0"?>
<!– I-sequence structured as XML. -->
<SEQUENCEDATA>
<TYPE>dna</TYPE>
<SEQ>
<NAME>Aspergillus awamori</NAME>
<ID>U03518</ID>
<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca
acctcccatccgtgtctattgtaccctgttgcttcgg
cgggcccgccgcttgtcggccgccgggggggcgcctctg
ccccccgggcccgtgcccgccggagaccccaacacgaac
actgtctgaaagcgtgcagtctgagttgattgaatgcaat
cagttaaaactttcaacaatggatctcttggttccggc
</DATA>
</SEQ>
</SEQUENCEDATA>
An I-sequence might be
structured as XML like this..
SEQUENCEDATA
TYPE SEQ
DATAIDNAME
comment
1
XML is standard: Parsers exist already!
Minus sign
Each parent element/node can be expanded and collapsed
Plus sign
Standard browsers can format XML documents nicely!
1
Python offers a Document Object Model parser!
• A DOM parser returns the whole XML document represented as a tree• All nodes have name (of tag) and value (data)
• Text (including whitespace) represented in nodes with tag name #text
article
title
#text
#text
#text
#text
date
author
summary
content
#text
#text
#text
firstName
#text
lastName
#text
#text
Simple XML
#text
Dec..2001
#text
XML..easy.
#text
In this..XML.
#text
John
#text
Doe
deite
l_fig
16_0
4rev
ised
.py
Parse XML document and load data into variable document
documentElement attribute refers to root node
nodeName refers to element’s tag name
Various node attributes:
firstChild
nextSibling
nodeValue
parentNode
NB: Changes since book!
1
Program output
The first child of root element is: #textwhose next sibling is: titleText inside "title" tag is Simple XMLParent node of title is: article
Here is the root element of the document: articleThe following are its child elements:#texttitle#textdate#textauthor#textsummary#textcontent#text
article
title
#text
#text
#text
#text
date
author
summary
content
#text
#text
#text
firstName
#text
lastName
#text
#text
Simple XML
#text
Dec..2001
#text
XML..easy.
#text
In this..XML.
#text
John
#text
Doe
1
Parsing XML sequence?
• We have i2xml filter (exercise) – we want xml2i also
• New XML structure for Isequences: holds more than one
• Algorithm:– Open file– Use Python parser to obtain the DOM tree– Traverse tree to extract sequence information, build Isequence
objects
SEQUENCEDATA
SEQ (type)
DATAIDNAME
SEQ (type)
DATAIDNAME
Ignoring whitespace nodes, we have to search a tree like this:
We're still being systematic: Usual name for parse method
Obtain a parse tree with the xml data for free
xml2
i.py
(par
t 1)
SEQUENCEDATA
SEQ (type)SEQ (type)
Convert this SEQ subtree to an Isequence object
xml2
i.py
(par
t 2)
SEQ (type)
DATAIDNAME
Way of getting to all attributes of a node
Way of getting to the value of a specific attribute
Recall: text kept in a #text node underneath
#text
1
See all the methods and attributes of a DOM tree on pages 537ff
Attribute/Method Description
appendChild( newChild ) Appends newChild to the list of child nodes. Returns the appended child node.
attributes NamedNodeMap that contains the attribute nodes for the current node.
childNodes NodeList that contains the node’s current children.
firstChild First child node in the NodeList or None, if the node has no children.
insertBefore( newChild,
refChild )
Inserts the newChild node before the refChild node. refChild must be a child node of the current node; otherwise, insertBefore raises a
ValueError exception.
isSameNode( other ) Returns true if other is the current node.
lastChild Last child node in the NodeList or None, if the current node has no children.
nextSibling The next node in the NodeList, or None, if the node has no next sibling.
nodeName Name of the node, or None, if the node does not have a name.
Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc.)
1
Convert old format XML sequence to new format
SEQUENCEDATA
TYPE SEQ
DATAIDNAME
Old format: sequence type has its own tag TYPE
SEQUENCEDATA
SEQ (type)
DATAIDNAME
New format: sequence type is attribute of SEQ tag
old_
xml2
i.py
Add new method to original xml2i.py and call it after parsing the XML file
old_
xml2
phyl
ip.p
y
Import new module
Check that type information is saved in the Isequence (not used in phylip format)
1
Testing on old format XML sequence
<?xml version = "1.0"?> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID>
<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatgcaatcagttaaaactttcaacaatggatctcttggttccggc</DATA> </SEQ> </SEQUENCEDATA>U03518b.xml
python old_xml2phylip.py U03518b.xml U03518b
sequence is of type dna
1
Remark: book uses old version of DOM parser
• XML examples in book won’t work (except the revised fig16.04)
• Look in the presented example programs to see what you have to import
• All the methods and attributes of a DOM tree on pages 537ff are the same
1
.. on to the exercises