View
224
Download
0
Category
Preview:
Citation preview
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
A brief introduction to the GeM annotation schema for complex document layout
John BatemanRenate Henschel Judy Delin
talk given by: Guowen Yang
Taipei, September 2002
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Overview of TalkOrientation:
– describing the approach to annotation of multimodal documents developed in the GeM project
• What is the GeM project?– goals, methods, requirements
• Summary of annotation problems raised
• Annotation solutions adopted
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The GeM project (‘Genre and Multimodality’)
– supported by the British Economic and Social Research Council (ESRC)
– Cooperation: • University of Stirling• University of Bremen• Enterprise Information Design Unit
– Goal: to put the description of multimodal page-based documents on a sound
empirical footing
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The problem of data
– there is now much theorizing about how multimodal documents work
– but the empirical basis of this theorizing is often less than strong
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Basic GeM hypotheses– Documents belonging to different kinds of ‘genres’ will exhibit different
kinds of multimodal patterns just as text sorts exhibit different lexicogrammatical patterns
– It should be possible to map out these patterns for different genres
– There should be a regular relationship between genre-type and the patterns found
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Requirement– An annotated corpus needed to be constructed
containing the extra information that we know/expect to be most useful in establishing descriptions of multimodal documents
– The extra information is then to serve as the basis for generalizations about genre
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The problem of data selection and description
– what kinds of documents are we talking about?
– what kinds of annotation do we need?
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
– what kinds of documents are we talking about?
Any page-based medium which combines information
from a variety of modalities in order to get its message
across
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Initial genres selected for the GeM corpus
– field guides (birds)
– instruction manuals (telephones)
– print newspapers
– electronic web-based versions of newspapers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Field guides
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Instruction manuals
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Print newspapers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Web-based newspapers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Motivations for selections– all contain combinations of graphical, textual,
photographic material
– all use the layout of these elements in complex ways
– for all the documents taken we were able to obtain feedback and discussion from their designers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Some Relations to Natural Language Processing
– it is our belief that we can approach the design and function of these documents using established linguistic techniques
– the ‘unit of analysis’ is scaled-up from the sentence or the text to the page (at least)
– given a formal specification of the motivation and realization of such documents, we can consider their automatic generation
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The problem of data selection and description
– what kinds of documents are we talking about?
– what kinds of annotation do we need?
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
– what kinds of annotation do we need?
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The GeM annotation layers
• Content structure
• Rhetorical structure
• Layout structure
• Navigation structure
• Linguistic structure
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
The GeM annotation layers
• Content structure
• Rhetorical structure
• Layout structure
• Navigation structure
• Linguistic structure
genre
form
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Practical information required– the GeM model also takes seriously the notion that the
concrete, practical conditions of production (technology, material, time-available, etc.) all contribute substantially to the properties of a genre
• Canvas constraints
• Production constraints
• Consumption constraints
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Requirements
– From the GeM perspective, a page-based multimodal document requires analysis from at least these levels and considering the sources of constraint identified.
– Only then do we have enough information to consider:• motivation of design• critique of design and communicative effectiveness• repurposing• automatic generation
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Pointers– The assumptions made, and the particular layers of
analysis adopted, are motivated and introduced at length in:
• Delin/Bateman/Allen: Information Design Journal• Delin/Bateman: Document Design
– Details on the website http://purl.org/net/gem
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Overview of TalkOrientation:
– describing the approach to annotation of multimodal documents developed in the GeM project
• What is the GeM project?– goals, methods, requirements
• Summary of annotation problems raised
• Annotation solutions adopted
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Overview of TalkOrientation:
– describing the approach to annotation of multimodal documents developed in the GeM project
• What is the GeM project?– goals, methods, requirements
• Summary of annotation problems raised
• Annotation solutions adopted
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Summary of annotation problems raised
– form of annotation to select
– criteria for recognising units
– multiple non-isomorphic intersecting hierarchies
– non-linear information
– complex query requirements
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Annotation solutions adopted
– form of annotation to select
TEI: Text Encoding Initiative
CES: Corpus Encoding Standard
XCES: XML version
GEM annotation scheme
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Annotation solutions adopted
– criteria for recognising units• basic vocabulary of the page:
images, signs, sentences, numbers, ...
• layout units: hierarchy determined visually and by considering the degree to which elements ‘belong together’
• rhetorical structure: traditional analysis according to Mann&Thompson’s rhetorical structure theory (RST)
• navigation units: elements pointing elsewhere in the document
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Annotation solutions adopted
– multiple non-isomorphic intersecting hierarchies
• stand-off annotation...
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
XML stand-off annotation for encoding the GeM layers
• a single ‘base’ element annotated file
• several ‘stand-off’ layers of annotation
• a Document Type Definition (DTD) for each layer of annotation
• each annotation layer corresponds to a GeM analysis layer
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
GeM layers: the base file <unit id="u-21.5">---------------</unit> <unit id="u-21.6" src="gannet.jpg" alt="gannet-photo"/> <unit id="u-21.7"> Huge (90cm) unmistakable seabird. </unit> <unit id="u-21.8"> Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. </unit> <unit id="u-21.9"> In summer, yellow head of adult inconspicuous. </unit> <unit id="u-21.10"> Plunges spectacularly for fish. </unit> <unit id="u-21.11"> Sexes similar. </unit>
Basic ‘vocabulary’ of the page, segmented
and numbered.
Actual ordering and positioning on the page irrelevant at this stage.
Predominantly ‘flat’ structure.
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
base units
Layout SemanticContent
RSTsegments
navigationalelements
layout units
Distribution of information across layers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
base units
Layout SemanticContent
RSTsegments
navigationalelements
layout units
Distribution of information across layers
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
• Working visually from the page, decompose the objects on the page in terms of their visual unity
• Transform the page decomposition into a hierarchical structure
• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.
Example: Derivation of layout structure
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
• provides a place for assigning specific information about the layout units
• contents given by collections of the base units of the page
Complete layout structure
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
GeM layers: layout units (1)
Layout units content defined by cross
references (xrefs) to base units
Content here not formally used and may
be ommitted
<layout-unit id="lay-flegg-text" xref="u-21.7 u-21.8 u-21.9 u-21.10 u-21.11"> Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar.</layout-unit>
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
GeM layers: layout units (2)
Layout units contain typographical details
common over the unit and its children
Layout units again identified via cross-references
Typographical information modelled on CSS and
XSL:FO
<text xref="lay-21.12 lay-21.14 lay-21.16 lay-21.18 lay-21.20" font-family="sans-serif" font-size="10" font-style="normal" font-weight="bold" case="mixed" justification="right" color="black"/>
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
GeM layers: layout units (3)<layout-root id="page-21"> <layout-leaf xref="header-21"/> <layout-chunk id="body-21"> <layout-leaf xref="lay-21.2"/> <layout-leaf xref="lay-21.3"/> </layout-chunk> <layout-leaf xref="page-no-21"/></layout-root>
Layout structure is recursive
page-21
header-21 body-21 page-no-21
lay-21.2 lay-21.3
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Annotation solutions adopted
– non-linear information• positioning of layout units within a page is
specified two-dimensionally with respect to a generalized page model
• the page model decomposes the page area into a hierarchy of grids
• specifying the grid for a page is part of the annotation task.
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
• Working visually from the page, decompose the objects on the page in terms of their visual unity
• Transform the page decomposition into a hierarchical structure
• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.
• Inspect the page for any local or global grid structure
• Relate layout units to grid positions
Example: Derivation of layout structure
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
• each sub-tree can additionally be assigned to a position in a hierarchically ordered page grid
Complete layout structure + page model
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Complete layout structure + page model
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
85%
5%
10%
14cm
GeM layers: area model
Layout units are related to identified elements of a
hierarchical grid specified in the area model
<area-root id="page-frame" cols="1" rows="3" hspacing="100" vspacing="10 85 5" height="16cm" width="14cm"> <sub-area id="body-frame" location="row-2" cols="2" rows="1" hspacing="50 50" vspacing="100"/></area-root>
<layout-root id="page-21"> <layout-leaf xref="header-21" location="row-1"
area-ref="page-frame"/> ... </layout-root>
16cm
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Annotation solutions adopted
– complex query requirements
• Xpath Queries using standard
tools
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Conclusions
• The annotation scheme allows detailed annotation of complex page-based documents
• Regularities can be sought using complex Xpath queries
• The system is open-ended and extensible without any redefinition of existing resources
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Ongoing Work• Further collection and ongoing annotation of corpus
– http://purl.org/net/gem
• Use of results for criticism of document design and for exploring the relation between layout and rhetorical structure– Delin/Bateman: Document Design, 2002
• Use of Xpath queries within sequences of extensible style sheet transformations for automatic document generation– Henschel/Bateman/Delin: Konvens2002, Saarbrücken
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Future Work• Extension of annotation schemes
– Current violations of the grid area model handled by relative offsets, need more flexible approach
– non-rectilinear grids for more complex design– consideration of dynamic elements, animation, etc.
• Extension of genres considered– advertisements– scientific documents
• Extension of languages considered
Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10
Thank you !
Recommended