36
ENG L501 TEXT ENCODING WORKSHOP 16 SEPTEMBER 2010

ENG L501 text encoding workshop 16 September 2010

  • Upload
    indiya

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

ENG L501 text encoding workshop 16 September 2010. Looking back…. - PowerPoint PPT Presentation

Citation preview

Page 1: ENG L501 text encoding workshop 16 September 2010

ENG L501TEXT ENCODING WORKSHOP16 SEPTEMBER 2010

Page 2: ENG L501 text encoding workshop 16 September 2010

Looking back…

It is important to remember the introduction of the World Wide Web in the mid-1990s in evaluating the statements of computing humanists, for prior to the Web's arrival, while a great deal was written on the uses of and audiences for electronic text, almost no one foresaw such a powerful tool for the wide distribution of electronic texts, or that wide distribution for a general reading public would become the most successful use made of electronic texts (Willett).

Page 3: ENG L501 text encoding workshop 16 September 2010

Looking back, cont’d

Knowing the chronology of the genesis of the Internet, the 1987 initial meeting of the TEI and the creation of the Poughkeepsie Principles indicates a sharp vision by the first individuals and institutions who participated in the development of encoding guidelines.

Page 4: ENG L501 text encoding workshop 16 September 2010

Still looking back…

Main types of e-text projects: Concordance and Text Retrieval Programs Literary Analysis Linguistic Analysis Stylometry and Attribution Studies Textual Critical and Electronic Editions Dictionaries and Lexical Databases

Page 5: ENG L501 text encoding workshop 16 September 2010

For consideration…

… scholars developed a vision of the importance of electronic texts for the humanities, and developed the standards by which they are created (Willett).

Willett asks: What is Electronic Text? an electronic transcription of a literary text an encoded text

Sometimes there is an apparent divide between the ideas behind the creation of text and its use

Page 6: ENG L501 text encoding workshop 16 September 2010

Text Encoding Overview: Introduction to TEI Motivations for text encoding Principles governing text encoding Advantages of text encoding Challenges with text encoding Introduction to the Text Encoding Initiative

(TEI)

Page 7: ENG L501 text encoding workshop 16 September 2010

Motivations for Text Encoding

Store information Access Preservation

Share information Searching/Browsing Interoperability & Portability: Harvesting/Repurposing

Analyze information Linguistic analysis Concordances

Visualize information Interactive timelines Map-based interfaces

Page 8: ENG L501 text encoding workshop 16 September 2010

Principles Governing Text Encoding

Representing the text (a.k.a. descriptive or document-centric markup) Structural

Text divisions (chapters, sections, etc.), paragraphs, lists, tables, line groups, lines, etc.

Semantic Metadata for the electronic and for the source document References to people, places, events, organizations, etc. within

the text (phrase-level) Stylistic

Typographic features like bold, italics, small case, indentations, etc.

Page 9: ENG L501 text encoding workshop 16 September 2010

TEI in Action: Indiana Magazine of History Indiana Magazine of History Swinburne Project

Page 10: ENG L501 text encoding workshop 16 September 2010

Advantages of Text Encoding

Re-use and flexibility: build once, use many

Presentation and output of text controlled by style sheets (e.g., generate different views of the same text and different formats: PDF, HTML, etc.)

The document and the markup can serve as an object of analysis and increased discoverability

Page 11: ENG L501 text encoding workshop 16 September 2010

Challenges with Text Encoding

Presentation is variable (difficult to predict); structure, however, is constant

Text encoding is not necessarily simple data entry/capture; interpretation and/or research are often at play

Text encoding is not neutral or objective (thus the need for specific encoding guidelines to govern encoding projects)

Text encoding is a strategic representation of the text (made more complicated by level of faithfulness to the source text)

Often, there’s more than one way to encode a particular aspect of the text

Page 12: ENG L501 text encoding workshop 16 September 2010

Introduction to the Text Encoding Initiative (TEI) Technically: a standards organization for

humanities text encoding Organizationally: an international

membership consortium Socially: a community of people and

projects For our purposes: a set of guidelines and

XML specifications

Page 13: ENG L501 text encoding workshop 16 September 2010

Quickie Introduction to XML

XML, or eXtensible Markup Language, is a meta language for creating markup languages suited for different tasks, domains, and disciplines.

An XML markup language consists of "tags" used to define the structure and other features of a text.

XHTML: <p>(paragraph of text)</p> <img src="buffy.jpg"> <a href="http://www.indiana.edu">

TEI: <sp who="#rosamond"> (speech) <lg> (line group, stanza) <p>(paragraph of text)</p>

Page 14: ENG L501 text encoding workshop 16 September 2010

XML Key Terms

elements are the basic, named structural units of an XML document (nouns of encoding) <title>The Odyssey</title>

attributes are name/value pairs (name="value") associated with elements (adjectives of encoding) <creator type="author">Homer</creator>

An element may have multiple attributes DTDs and Schema DTDs (Document Type

Definitions) and Schema define the rules that govern a particular type of XML document. They declare elements and attributes and the allowable content for those elements and attributes (grammar rules)

Page 15: ENG L501 text encoding workshop 16 September 2010

XML: Anatomy of an Element

Page 16: ENG L501 text encoding workshop 16 September 2010

XML Representation: Boxes

Page 17: ENG L501 text encoding workshop 16 September 2010

XML Representation: Tree

Page 18: ENG L501 text encoding workshop 16 September 2010

XML Representation: Markup

Page 19: ENG L501 text encoding workshop 16 September 2010

XML: Well-formed

A well-formed document follows the basic rules of XML. These rules include:

Open and close all tags Empty-element tags end with /> (e.g. <pb />) There is a single root element Elements may not overlap Attribute values are quoted < and & are only used to start tags and entity references Only the five predefined entity references are used:

&amp; &lt; &gt; &apos; &quot; Plus more…

Page 20: ENG L501 text encoding workshop 16 September 2010

XML: Validity

A valid document is both well-formed and conforms to the rules of a DTD or Schema which adds further constraints on available elements and attributes and the allowable content of those elements and attributes.

lexicon or available vocabulary: elements & attributes grammar for how the lexicon is used: rules for nesting,

sequencing, etc. e.g., a paragraph can be inside a chapter, but a chapter

cannot be inside a paragraph e.g., a chapter must begin with a heading followed by at

least one paragraph

Page 21: ENG L501 text encoding workshop 16 September 2010

Introduction to the TEI Guidelines and Tag Set TEI Guidelines: Quick Overview TEI P5 Guidelines TEI Basic Components Basic Markup: Prose Basic Markup: Verse Basic Markup: Drama Basic Markup: Letters

Page 22: ENG L501 text encoding workshop 16 September 2010

TEI Guidelines: Quick Overview

Text Encoding Initiative (TEI) / Guidelines for Electronic Text Encoding and Interchange (TEI)

The TEI Guidelines "are addressed to anyone who works with any text in electronic form. They provide means of representing those features of a text which need to be identified explicitly in order to facilitate processing of the text by computer programs” (Sperberg-McQueen).

TEI provides elements, attributes, and other mechanisms for encoding prose, poetry, drama, dictionaries, critical apparatus, linguistic corpora, and other scholarly and non-scholarly texts.

Can be applied strictly or loosely Can adapt to local conditions Designed as a set of modules/mechanisms that can be

selected as needed

Page 23: ENG L501 text encoding workshop 16 September 2010

P5 Guidelines: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html Prose documentation with examples

P5 Tag/Element Set: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html “Data dictionary” view of the tag set with

examples and relevant links to prose documentation

TEI P5 Guidelines

Page 24: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Components

<TEI>: The root element of a TEI document <teiHeader>: The metadata header for a TEI

document. Includes bibliographic, technical, administrative, and other metadata about the digital file and the analog source, if one exists.

<text>: The text itself, e.g., the title page and chapters of a novel, the acts and scenes of a drama, the books or cantos of a long poem. The <text> element is further subdivided into: <front>: Front matter, e.g, the title page(s), table of

contents, potentially a preface or dedication. <body>: The main body of a document, excluding front

and back matter. <back>: Back matter, e.g., indices, appendices.

Page 25: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Prose

<div>: (division) is used for basic structural divisions of a text, e.g, volumes, chapters, sections, cantos, tables of contents, indices, appendices, etc. The @type attribute may be used to designate the type of the division. <div type="chapter">…</div> <div type="section">…</div> <div type=”contents">…</div> <div type="canto">…</div>

<head>: (heading) contains any type of heading, for example the title of a section, or the heading of a list, figure, table, etc.

<p>: (paragraph) <pb>: (page break) marks the boundary between one page of

a text and the next

Page 26: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Prose

Chapter 1: The Manor House    Charles hadn’t visited the manor house since Easter,1955,

and now he remembered why. “Hullo”, he called out as he walked up the drive, and then, as if to himself, “To be or not to be?, to walk or not to walk...oh, hang it all!” His meditation on Hamlet was interrupted as he collided with a peacock. “Sacré bleu!” he exclaimed with irritation, his sang-froid completely deserting him. It was going to be a long week. His catalog of irritations included:  1. The weather 

 2. The peacocks   3. His meager grasp of French

Page 27: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Prose

Page 28: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Verse/Poetry

<lg>: (line group) contains a group of verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc. The @type and @subtype attributes may be used to classify the type of line group

<l>: (line) contains a line of verse

Page 29: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Poetry/Verse

HEART-ECHOES FROM OLD SHELBY. "HEABT-ECHOES FROM OLD SHELBY!” Down the swiftly flying years

Comes a gentle retrospectionThat it fills mine eyes with tears, Bearing with it sainted mem'riesOf the days departed long,Thrilling all the halls of beingLike the cadence of a song!

"HEART-ECHOES FROM OLD SHELBY!”Olden visions bring to me,And the dear forms rise in raptureThat I've longed so much to see,When the burdens that I've carriedHave produced a deadened spot, And the tears of disappointmentHave o'erflooded, blistering hot!

Page 30: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Poetry/Verse

Page 31: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Drama

<sp>: (speech) contains individual speech in a performance text, or a passage presented as such in a prose or verse text.

<speaker>: contains a specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment.

<stage>: (stage direction) contains any kind of stage direction within a dramatic text or fragment.

Page 32: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Drama

Scene 1Enter FayFay: I say, Dinah, has anyone seen my gloves?Enter DinahDinah:No, miss, perhaps the parakeet has got

them again?Exit Fay and Dinah

Page 33: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Drama

Page 34: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Letters

<opener>: groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.

<closer>: groups together dateline, byline, salutation, and similar phrases appearing as a final group at the end of a division, especially of a letter. <dateline>: contains a brief description of the place,

date, time, etc. of production of a letter, prefixed or suffixed to it as a kind of heading or trailer.

<salute>: (salutation) contains a salutation or greeting in the closing of a letter, preface, etc.

<signed>: (signature) contains the closing salutation

Page 35: ENG L501 text encoding workshop 16 September 2010

TEI P5: Basic Markup: Letters

Page 36: ENG L501 text encoding workshop 16 September 2010

Hands-on Exercises: Basic Genres

https://wiki.dlib.indiana.edu/confluence/display/vwwp/Brief+Genre+Exercises

Open “genre examples” in a new tab or window in your browser

Launch Oxygen Steps are in the wiki, but you can follow me,

too. Complete exercises one at a time: Prose, Verse,

Drama and Letters Save file: USB Flash Drive or Oncourse Drop Box