View
215
Download
2
Category
Preview:
Citation preview
XML for Information Management – Day 3Airi Salminen
XML for Information Management
University of Erlangen-NurembergComputational Linguistics
Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/
26.4.-30.4.2010
XML for Information Management – Day 3Airi Salminen
2
1. Structured documents2. Formal grammars in XML3. Natural languages in XML
documents4. Adding meaning by markup5. Text indexing6. Logical structure of XML
documents
Outline
XML for Information Management – Day 3Airi Salminen
3
1. Structured documents
Structured document
‣ structure, content, and external presentation can be separated from each other and processed separately
‣ structural components have names
‣ structural components can be recognized by software modules
‣ possible to define the structure
XML for Information Management – Day 3Airi Salminen
4
Structured document
Structure
Content
Layout
1. Structured documents
an open language standard,
e.g. SGML, XML
different languages for defining the layout, e.g., CSS and XSL for XML
different languages for defining the structure,
e.g., DTD, XML Schema, RELAX NG for XML
XML for Information Management – Day 3Airi Salminen
5
Structured document
Structure
Content
Layout
1. Structured documents
Example
DTD.txt
rhymes-with-ext-dtd.txt
rhymes-with-ext-dtd.xml
rhymes-style.txt
rhymes-style.css
rhymes-with-style-and-ext-dtd.xml
rhymes-with-style-and-ext-dtd.txt
XML for Information Management – Day 3Airi Salminen
6
Management of structured documents
‣ document management
‣ management of the data contained in documents
1. Structured documents
XML for Information Management – Day 3Airi Salminen
7
Characteristics in the management of structured documents
‣ Design. Adopting the approach of structured document management in an environment often requires careful planning before the creation of documents. Includes schema design and layout design.
‣ Content production. Content can be produced by different types of software, e.g. by a syntax-directed editor. Checking the validity against the schema.
‣ Evolution. Schema versioning, layout versioning.
‣ Operations. Most typical operation is some kind of transformation.
‣ Software. Many kinds of software systems used.
1. Structured documents
XML for Information Management – Day 3Airi Salminen
8
2. Formal grammars in XML
‣ terminal symbols (alphabet)‣ nonterminal symbols ‣ production rules‣ start symbol
The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present.
A formal grammar is a way to describe the syntax of language.
XML for Information Management – Day 3Airi Salminen
9
In XML there are two kinds of formal grammars with their own notations:
‣ the grammar defining the XML syntax in the XML specification
‣ DTD
2. Formal grammars in XML
XML for Information Management – Day 3Airi Salminen
10
The XML specification uses the EBNF (Extended Backus-Naur Form) notation with metasymbols ?, *, +, |, and ( )
The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b].
The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.
2. Formal grammars in XML
XML for Information Management – Day 3Airi Salminen
11
A? A is optionalA | B A and B are alternativesA + A occurs once or moreA* A may be missing or occurs once or moreA - B A but not B A B B after A( ) grouping
document ::= prolog element Misc*prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?Misc ::= Comment | PI | SComment ::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'
2. Formal grammars in XML
Example rules in XML 1.0:
XML for Information Management – Day 3Airi Salminen
12
Production rules in a DTD:
<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ELEMENT line (#PCDATA)>
DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification.
2. Formal grammars in XML
XML for Information Management – Day 3Airi Salminen
13
XML spesification defines the concrete syntax of XML documents.
The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed four slightly different models to describe the abstract syntax:
2. Formal grammars in XML
• XML Information Set• DOM model• XPath 1.0 model• XQuery 1.0 and XPath 2.0 data model
Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.
XML for Information Management – Day 3Airi Salminen
14
3. Natural languages in XML documents
Natural language may occur in XML marked up text in the:
•content of elements
•markup
• element, attribute, and entity names
• attribute values
• comments
XML for Information Management – Day 3Airi Salminen
15
3. Natural language in XML documents
•human individuals in• reading the markedup text• information access• communicating with other individuals about the schema or marked up content
•some software applications, for example, text analysis software
Natural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by
XML for Information Management – Day 3Airi Salminen
16
4. Adding meaning by markup
It is important that the element and attribute names are meaningful to human readers.
The names are not useful in information access
<AAA XXX= "5" ><rki YYY= "Hamlet" >Where wilt thou lead me? speak; I'll go no further.</rki><rki YYY="ghost">Mark me.</rki></AAA>
XML for Information Management – Day 3Airi Salminen
17
4. Adding meaning by markup
Natural language in XML documents provides semantic information to human readers and for human communication.
Meaningful markup is useful for human users in information retrieval and in specifying transformations.
Markup may provide rich semantic and linguistic information.
XML for Information Management – Day 3Airi Salminen
18
4. Adding meaning by markup
Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
She smelled like trees.<Chapter section = '1' > <Paragraph id='143' FragmentCode='1.12'> <Narration narrator='Benjy'> <Subject person='Caddy'>She</Subject> <Senses mode='smell'>smelled</Senses> like <Imagery referent='tree'>trees</Imagery> </Narration> </Paragraph></Chapter>
Example of combining structural, semantic and linguistic markup:
XML for Information Management – Day 3Airi Salminen
19
4. Adding meaning by markup
<Chapter section = '1' > <Narration narrator='Benjy'> <Imagery place='tree' mode='simile' sense='smell'> <Fragment code='1.12'> <Paragraph id='143'> <Subject person='Caddy'>She</Subject> smelled like trees. </Paragraph> </Fragment> </Imagery> </Narration></Chapter>
She smelled like trees.
Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.
Another markup for the same text:
XML for Information Management – Day 3Airi Salminen
20
4. Adding meaning by markup
Some other examples:
http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htm
http://www.cs.cmu.edu/~awb/festival_demos/sable.html
http://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html
XML for Information Management – Day 3Airi Salminen
21
4. Adding meaning by markup
In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form.
The concepts and meanings are defined in formal ontologies.
Software applications can understand the meanings.
XML for Information Management – Day 3Airi Salminen
22
5. Text indexing
documents
index
search enginequery
answer
In information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.
XML for Information Management – Day 3Airi Salminen
23
6. Logical structure of XML documents
• declarations
• elements
• comments
• processing instructions
Components of the logical structure
XML for Information Management – Day 3Airi Salminen
24
6. Logical structure of XML documents
document ::= prolog element Misc*
declarationscommentsprocessing instructions
elementscommentsprocessing instructions
commentsprocessing instructions
XML for Information Management – Day 3Airi Salminen
25
‣ XML declaration [23]
‣ document type declaration [28]
‣ markup declaration [29]
• element type declaration [45]
• attribute list declaration [52]
• entity declaration [70]
• notation declaration [82]
‣ encoding declaration [80]
‣ standalone document declaration [32]
‣ text declaration [77]
Declarations:
6. Logical structure of XML documents
to constrain the logical structure
to constrain the physical structure
XML for Information Management – Day 3Airi Salminen
26
Typical element type declarations:
6. Logical structure of XML documents
mixed content defined
element content defined
<!ELEMENT product (mfg, model, description, clock?)><!ELEMENT model (#PCDATA)><!ELEMENT description (#PCDATA | feature)*><!ELEMENT clock EMPTY>
empty element defined
XML for Information Management – Day 3Airi Salminen
27
6. Logical structure of XML documents
empty element defined:
<clock></clock><clock/>
<!ELEMENT clock EMPTY>
two forms of the element allowed in a well-formed document:
XML for Information Management – Day 3Airi Salminen
28
6. Logical structure of XML documents
element content: definition by content models with metasymbols
* iteration (none or more)+ iteration (once or more)| alternatives? optional, successive( ) grouping
#PCDATA is not accepted in the content model!
<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
Example from XHTML 1.0 Strict DTD:
XML for Information Management – Day 3Airi Salminen
29
6. Logical structure of XML documents
mixed content: definition has basically two forms
(#PCDATA)(#PCDATA | e1 | … | en)*
<!ELEMENT text (#PCDATA)><!ELEMENT section (#PCDATA | subsection)*><!ELEMENT section (#PCDATA | subsection | paragraph)*>
#PCDATA is always included in the content specification and comes first in the list of alternatives
examples:
XML for Information Management – Day 3Airi Salminen
30
• to define the set of attributes pertaining to a given elemen type
• to establish type constraints for these attributes
• to provide default values for attributes
Attribute list declarations
6. Logical structure of XML documents
XML for Information Management – Day 3Airi Salminen
31
attribute name
<!ATTLIST poem author CDATA #REQUIRED >
attribute type: string
constraint: the attribute must be specified for all elements of type poem
element type
6. Logical structure of XML documents
XML for Information Management – Day 3Airi Salminen
32
Defining constraints
#REQUIRED: attribute must always be provided in all elements of the given type
#IMPLIED: attribute can be provided in a element; no default value is provided
AttValue: default value is given between single or double quotes
#FIXED AttValue: instances of the attribute must match the given default value
[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue)
6. Logical structure of XML documents
XML for Information Management – Day 3Airi Salminen
33
Attribute types
[54] AttType ::= StringType | TokenizedType | EnumeratedType
• ENTITY, ENTITIES: entity names
• NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names
• ID: names that uniquely identify elements
• IDREF, IDREFS: references to ID type identifiers
tokenized types:
enumerated types:• NOTATION, NOTATIONS: identify notations• enumeration
6. Logical structure of XML documents
XML for Information Management – Day 3Airi Salminen
34
<?xml version= "1.0"?><!DOCTYPE text [<!ELEMENT text (line+)><!ELEMENT line (#PCDATA)><!ATTLIST line
id ID #REQUIREDseeline IDREFS #IMPLIED> ]>
<text><line id= "r1">This is the first line</line><line id= "r2" seeline= "r1" >This is the second line, but look at the first too</line></text>
6. Logical structure of XML documents
XML for Information Management – Day 3Airi Salminen
35
6. Logical structure of XML documents
<Chapter section = '1' ><Narration narrator='Benjy'><Imagery place='tree' mode=simile sense='smell'><Fragment code='1.12'><Paragraph id='143'><Subject person='Caddy'>She</Subject>smelled like trees.</Paragraph></Fragment></Imagery></Narration></Chapter>
XML-aware web browsers support the visualization of the tree structure: example
XML for Information Management – Day 3Airi Salminen
36
6. Logical structure of XML documents
Different abstract models to decribe the tree in slightly different ways.
<poem author = "Murasaki Shikibu" born = "974"><!-- The poem is translated from Japanese by Kenneth Rexroth --><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms</line><line>which bloom and fade in a day. </line></poem>
XML for Information Management – Day 3Airi Salminen
37
poem
line
line
lineAuthorMurasaki Shikibu
line
born 974
This life of ours would not cause you sorrow
if you thought of it as like
which bloom and fade in a day.
the mountain cherry blossoms
Root node
Element node
Attribute node
The poem is translated from Japanese by Kenneth Rexroth
Text node
Comment node
poem
6. Logical structure of XML documents
Node types of XPath 1.0
Recommended