7
DIGITAL LIBRARIES: THE SYSTEMS ANALYSIS PERSPECTIVE Library as virtual abbey Robert Fox Notre Dame University Libraries, Notre Dame, Indiana, USA Abstract Purpose – The purpose of this paper is to explore the current state of the text encoding initiative (TEI) community and suggests directions in which that community should strive based on recommendations from experts in the field. Design/methodology/approach – Looks at the history of, the present state of and future of TEI. Findings – This column is simply exploratory, and examines issues regarding the TEI and the TEI consortium. Practical implications – TEI is a very robust and expressive markup language used in the analysis of literature in the humanities fields. The community is encouraged to take proactive steps to ensure TEI as a viable markup language for the next 20 years, at least. Originality/value – This column examines the enormous contribution that TEI has made to the humanities fields and explores ways in which the usage of TEI, even by non-experts, can be expanded in order to enrich scholarship. Keywords Markup languages, Digital libraries Paper type Conceptual paper Beginning around the ninth century in Europe, and for the following 400 or so odd years, a renaissance occurred that would have an unprecedented impact on western civilization, the influence of which has been felt in the literary world ever since. It was during this period, the early portion of which is known as the Carolingian Renaissance that dramatic changes occurred in the way in which important texts were copied and composed. Prior to this era, there were no standards for manuscript copying and editing which made future transcription very difficult both for scholars and copyists. Following the Carolingian reforms (instituted under the emperor Charlemagne, d. 814), new standards were put into place which dictated a new kind of writing style known as Carolingian miniscule. This style established a new method of writing letters such that each letter was made uniform, with rounded portions, capital letters at appropriate points, and a standardized way of writing words and sentences which increased the legibility of texts exponentially. There may have been political motives behind this reform (after all, it’s important for the emperor’s subjects to be able to read imperial decrees), but the overall effect had an incredible impact in the realm of scholarship both then and now. For the first time, texts could fairly easily be compared for veracity against an autograph (the original text), writing styles could be analyzed, and thoughts could be better organized. Today, the bulk of our knowledge of classical texts is derived from texts written using this miniscule script, and more refined styles that appeared later in history. Modern typefaces have their roots in Carolingian miniscule. A similar but hardly known or marketed enterprise is taking place in the digital world. The text encoding initiative (TEI) is in many ways the modern equivalent to the The current issue and full text archive of this journal is available at www.emeraldinsight.com/1065-075X.htm OCLC 24,2 80 OCLC Systems & Services: International digital library perspectives Vol. 24 No. 2, 2008 pp. 80-86 q Emerald Group Publishing Limited 1065-075X DOI 10.1108/10650750810875421

Library as virtual abbey

  • Upload
    robert

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

DIGITAL LIBRARIES: THE SYSTEMSANALYSIS PERSPECTIVE

Library as virtual abbeyRobert Fox

Notre Dame University Libraries, Notre Dame, Indiana, USA

Abstract

Purpose – The purpose of this paper is to explore the current state of the text encoding initiative(TEI) community and suggests directions in which that community should strive based onrecommendations from experts in the field.

Design/methodology/approach – Looks at the history of, the present state of and future of TEI.

Findings – This column is simply exploratory, and examines issues regarding the TEI and the TEIconsortium.

Practical implications – TEI is a very robust and expressive markup language used in the analysisof literature in the humanities fields. The community is encouraged to take proactive steps to ensureTEI as a viable markup language for the next 20 years, at least.

Originality/value – This column examines the enormous contribution that TEI has made to thehumanities fields and explores ways in which the usage of TEI, even by non-experts, can be expandedin order to enrich scholarship.

Keywords Markup languages, Digital libraries

Paper type Conceptual paper

Beginning around the ninth century in Europe, and for the following 400 or so oddyears, a renaissance occurred that would have an unprecedented impact on westerncivilization, the influence of which has been felt in the literary world ever since. It wasduring this period, the early portion of which is known as the Carolingian Renaissancethat dramatic changes occurred in the way in which important texts were copied andcomposed. Prior to this era, there were no standards for manuscript copying andediting which made future transcription very difficult both for scholars and copyists.Following the Carolingian reforms (instituted under the emperor Charlemagne, d. 814),new standards were put into place which dictated a new kind of writing style known asCarolingian miniscule. This style established a new method of writing letters such thateach letter was made uniform, with rounded portions, capital letters at appropriatepoints, and a standardized way of writing words and sentences which increasedthe legibility of texts exponentially. There may have been political motives behind thisreform (after all, it’s important for the emperor’s subjects to be able to read imperialdecrees), but the overall effect had an incredible impact in the realm of scholarship boththen and now. For the first time, texts could fairly easily be compared for veracityagainst an autograph (the original text), writing styles could be analyzed, and thoughtscould be better organized. Today, the bulk of our knowledge of classical texts isderived from texts written using this miniscule script, and more refined styles thatappeared later in history. Modern typefaces have their roots in Carolingian miniscule.

A similar but hardly known or marketed enterprise is taking place in the digitalworld. The text encoding initiative (TEI) is in many ways the modern equivalent to the

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/1065-075X.htm

OCLC24,2

80

OCLC Systems & Services:International digital libraryperspectivesVol. 24 No. 2, 2008pp. 80-86q Emerald Group Publishing Limited1065-075XDOI 10.1108/10650750810875421

standardization of manuscript copying and editing, and in the library and scholarlyworld, it seems to have even broader implications. Whether the TEI will have as greatan impact on the literary and scholarly world is improbable at this stage, but it is nonethe less one of the many uncelebrated enterprises in the digital library field, due notonly to it’s value as a comprehensive tool for textual analysis but also to the amazingamount of work that has been done with the TEI and digital texts but is largelyunrecognized. This is very akin to the work done throughout Europe in the middle agesto preserve and illuminate (both artistically and in a scholarly fashion) important textsin the western canon, in abbeys and monestaries where monks quietly labored at anextremely vital task.

I had the pleasure recently of attending the 20th anniversary Conference of the TEIConsortium at the University of Maryland, and it was very impressive to see so manypeople from such a wide range of backgrounds and scholarly interests coming togetherto celebrate and promote the TEI standard. It was a very humbling experience in manyways because of the incredible amount of work that has been done in analyzing textsand applying scholarly principles in a technological field using XML and a verycomplex, well thought out schema for analyzing digital texts of all variety. As has beenspoken about in this journal and elsewhere, the TEI standard is an amazinglyexpressive tool which is used for the analysis of texts of literally every kind whetherthose texts include irregular type setting, illustrations, illuminations, tables,photographs, etc. The standard has the ability to analyze poetic style, textorganization, editions of works, linguistic styles and more. While a great quantity ofwork has been done using this tool, there is so much more that could be done, and thework is obviously labor intensive and requires expertise in both history and literaryanalysis which is why this is both a product of experts in the markup language field aswell as top notch scholars. It was interesting to note, though, that there was also at thisconference an air of uncertainty about the future of the TEI. While it’s clear that theTEI has a great deal of potential and demonstrated value, what remains uncertain isthe level at which it will be adopted by future scholars, librarians and archivists andthis was a common theme at the TEI@20 conference. Part of the uncertainty, for betteror worse, seems to stem from the history of the TEI and how it was developed as wellas how it has been maintained over the last 20 years.

History of a standardThe inception of the TEI began as a response to the lack of standards when digitaldocuments began to proliferate over twenty years ago. This was new technology whichbrought along with it it’s share of apprehension on the part of librarians and archivists.Given the lack of standards, it was doubtful that any serious fruit could be harvestedfrom the enormous diversity of tools and software being used to create and maintaindigital texts. It was in 1987 that an NEH sponsored event was held at Vassar College[1]that eventually culminated in 1990 with the first TEI standard (the “P1” standard).Almost every year following that initial standard a new standard was released suchthat in 1994 the TEI was up to version P3. It’s important to realize that at this stage, theTEI was not an XML standard. That did not arrive until June 2002 with the P4 version.In the intervening years, a new TEI Consortium had been formed which had beenincorporated in the year 2000. This consortium has been guiding the development ofthe standard since then up to the present version, P5. With P5, a host of new features

Library asvirtual abbey

81

has been added to the standard allowing for the analysis of almost any textual format.It is now at such a level of complexity that it requires training to use the markup in anappropriate manner, or at the very least, to take advantage of the standards expressivecharacter.

The great weight of scholarly effort has gone into the P5 standard, and it is nowmore useful than ever for analyzing historical and highly complex literary texts as wellas editions of the same work. While the benefit of TEI has grown as a valuable asset tothe humanities research community, the complexity of the markup has risen to a levelwhere it has become opaque to the new user. It is very rare to find a comprehensiveintroduction to the use of the TEI in any format, even as strides are being taken toremedy the situation. The keynote speaker on the first day of the TEI@20 Conference,B. Tommie Usdin, delivered a very incisive message to the TEI consortium: becomerelevant or fade away. Tommie Usdin (as she prefers to be addressed) has a very longand respected history in the markup language world. She was involved in thedevelopment of some of the initial tools used to edit and maintain SGML (the precursorto HTML) and has watched closely as the standards for HTML and XML havedeveloped over the last 20-30 years. Her experience far precedes the commercial utilityof the web when HTML/SGML took on a character vastly different from whenTimothy Berners-Lee had originally intended (Berners-Lee and Fischetti, 1999). In asense, the TEI has taken an opposite sort of trajectory, and instead of being exploitedby the commercial enterprise, the TEI consortium has become very insular, accordingto Ms Usdin. If TEI is to remain relevant Tommie Usdin suggests, the community mustbecome amenable to the needs of newcomers and must consider the issue of marketingand instruction.

Digital scribes in actionIt is beyond a doubt that TEI has demonstrated its usefulness within the scope of it’soriginal conception. The examples are far too numerous to expound upon but a fewexamples here should be sufficient to demonstrate the point. It is truly amazing thedepth of analysis and the enhancement to scholarship that can be accomplished whenTEI is used to it’s greatest potential. In the humanities field, TEI is analogous to theuse of data mining techniques for a data warehouse depending upon the granularity towhich a text has been marked up. In fact, fascinating and intricate relations betweenideas within a text and between texts can emerge with the analytical detail that TEIoffers.

A project that grew out of two sister projects, the Nora project (www.noraproject.org/) and the Wordhoard project at Northwestern University (http://wordhoard.northwestern.edu/userman/index.html), the MONK (Metadata Offer New Knowledge)project (http://monkproject.org) aims to merge these two fascinating initiatives into acombined effort with a goal to increase the possibility of text mining and analysis tohumanities scholars. They give an analogy in their documentation describing theMONK project as an effort to potentially analyze in a comprehensive manner, texts inthe same way that medieval monks created concordances for the Bible. In this way,patterns, themes and relations between words and phrases can be discerned using thepower of a computer application with the expertise of a humanities scholar. This isonly possible because enormous amounts of effort have gone into parsing out thesedocuments with TEI. While it’s possible that other standards or independent analytic

OCLC24,2

82

techniques could be utilized, they cannot possibly approach the same level of detailthat TEI offers.

One of the goals of the MONK project is to potentially provide for scholars a visualrepresentation of the relations between ideas in texts, which will be more intuitivelydiscernable. This provides the scholar with a “bird’s eye view” of a set of texts whichcould lead to further analysis and exploration. The MONK project provides, on theirwiki, some examples of how this technique could be used. For example, the goal couldbe an examination of texts to determine how readers identify genre or themes basedon the overall syntactical methods employed in a given set of literature. This couldinvolve a multi-layered study with Bayesian statistical analysis, a detailedexamination of lexical features, or straight forward numerical studies. Or, a collegeseminar class may be interested in exploring the rhetorical usage of explicitly politicaland/or theological themes in a particular set of genres during the sixteenth century.Once a list of keywords or phrases has been assembled, they can use MONK to applythat criteria to a sample of texts in order to analyze patterns of usage. The trueintelligence behind such an application, though, is the previous work that had beendone in TEI parsing out documents using scholarly expertise and it is highly doubtfulthat an algorithm could be developed which could independently perform as accuratean assessment of a text due to the complex nature of linguistics and contextual usage.This goes far beyond simple keyword searches within a compiled index. In that sense,the effort of the scholars employing TEI is truly analogous to the work of the monksfrom eleventh to thirteenth centuries.

In a similar fashion, the Perseus Digital Library has made great strides withclassical texts (www.perseus.tufts.edu). They have even gone a step further bywedding the textual analytics with GIS in order to reveal relations between texts over“time, space, and language”. The project has also made an effort to be as crossdisciplinary as possible, in order for scholars to communicate and share knowledge at alevel previous prohibitive. In the area of linguistic study, they have linked all of theversions of a text that is available with tools such as lexicons, concordances, wordfrequency charts, and geographic locations. One of the primary mechanisms behindthis functionality is TEI markup. The latest version of the Perseus tool set (Perseus 4.0)includes the ability to extract well formed XML fragments of primary sources inTEI-conformant markup, if desired.

The Perseus Library demonstrates just how broad and deep the applicability of TEIactually is. The project is now immense regarding it’s work to combine and interlockvarious kinds of data, which is a demonstrated success from, for example, one“humble” testbed of TEI encoded documents of over five million words, 10,000 scansof illustrations and 2,400 pictures of London (within the London Bolles collection)(Crane, 2000). Crane (2000) states:

The techniques that we had developed in our work on Greco-Roman Perseus did, as wehad hoped, constitute a solid foundation for this project, while the richness of the data allowedus to experiment with new ways of representing and visualizing people, things, space andtime.

In many ways, this project and others like are just scratching the surface of what couldbe done in the world of TEI if more institutions could embrace the technology and useit to it’s complete potential.

Library asvirtual abbey

83

Barriers to adoptionThe TEI@20 Conference carried with it a leitmotif with a look at another of the TEIcommunities’ weaknesses: the distinct lack of TEI examples for educational purposes.Melissa Terras, an instructor from University College, London, gave a very engagingtalk on this subject, presenting a plea as someone who is in the front lines of educationwithin the TEI community. Her students are primarily professionals who work in thehumanities, librarians or other information specialists who do day to day work withincreasing amounts of digital material and need tools such as TEI to help themaccomplish their work. Her plea is very similar to Tommie Usdin’s: the TEI communityneeds to become more open to those new to the TEI and markup in general.

A new generation of information workers is entering the field who do not have theextensive background that the existing TEI users have grown accustomed to over thecourse of the last 20 years, and yet there is a dire need for training and education bothin the theory behind the TEI and by example. Melissa made a very convincing casethat there needs to be more tutorials at all levels of usage from beginner to advance.And, not only is this information required, but also it should be provided by the TEIcommunity itself. Not wasting any time, Melissa has been very active in this regard.She has created a site called “TEI by example” (www.kantl.be/ctb/project/2006/tei-ex.htm) and it includes a software toolkit as well as documentation of the methodologyand workflow of the project. The site also provides PDF versions of the tutorials alongwith the software toolkit as well as a downloadable CD ROM image for courseparticipants. This is a great beginning as a remedy to the problem of education andTEI adoption.

Tommie Usdin as well as Melissa Terras also chastised the TEI community for notcreating a web site that is user friendly. A redesign just before the conference was anattempt to correct this issue, but there is still a distinct lack of officially supportedexamples and education materials that are accessible from the primary TEI site. Theonly page on the site, which is not immediately obvious, to provide such material is a“support” page with links to introductions to XML and one link to tutorials. However,the tutorials page doesn’t include a link to Melissa Terras’ “TEI by example” site.While there is a small amount of information, much of it assumes some priorknowledge, and none of it appears to be a sufficient introduction, for example, to thelatest P5 standard. Ms Terras’ point, though, is that without a sufficient quantity ofmaterial giving an overview (by example) of how to use TEI, the complexity of thestandard will remain a significant barrier to adoption.

The future of TEIWhat will the next 20 years hold for TEI? The TEI (as an initiative) fills a gapwithin the scholarly community that cannot be filled in like manner using anothermarkup standard. The enhancements for scholarly research are a celebrated factamong specialists who know and use the standard. And, even for those who areunaware of the standard, TEI undergirds many projects that have become a gift to theacademic community. No one wants to see happen to the TEI what happened to z39.50and other standards that expired under their own weight. Given that there is very little,if any, competition in this arena it is highly unlikely that a replacement for TEI willcome into being, but there is a danger that it’s usage will become marginalized.

OCLC24,2

84

As Tommie Usdin and Melissa Terras have suggested by word and deed, there areseveral potential remedies for the situation.

Among the potential solutions is the effort to educate along a broad front. Melissadoes this primarily among non-specialits in the field, and it seems necessary that thisbe expanded. While TEI was developed among humanities experts and developed forthose experts, more than just technologists and scholars need to understand how to useTEI. In order for the richness of the markup to be fully exploited, education on how toapply TEI to texts needs to occur with many individuals who are not experts, and whocan apply that knowledge to a vast quantity of texts. It may be possible then to createcollaborative efforts within and between projects such as MONK and Perseus. Therewill always be a need for the humanities scholar to create a framework in whichthe markup of texts can occur, and there will also be a need for editing work. But giventhe vastness of the texts that could be analyzed, both at a work level and at an editionlevel, more folks need to be educated in the complexity of the various sub-areas (suchas manuscript analysis) in TEI.

The second area which has already been pointed out by Melissa Terras is in theprovision of examples. There is a need both for examples in the various genres ofliterature that can be marked up such as medieval manuscripts, poetry, prose fromvarious historical periods, literature with illustrations, irregular text blocks, etc. It isnot as though there are no examples available, and work that has been accomplishedcould certainly be used as prototypes, but usually work that has been completed is atsuch a level of complexity that it is inaccessible to beginners. Therefore, exampleswhich demonstrate technique at a very simple level are needed in order to get peoplestarted. As folks progress, then it would be possible for them to consult mentors in thefield as the work becomes more complicated.

Tommie Usdin also suggests that TEI be “marketed” to a certain extent in order tomake scholars and technologists more aware of it’s potential. For years, collaborationand communication on TEI and the resulting consortium occurred mainly betweenspecialists at participating institutions, and very little communication took placeoutside of that context. One primary means of communication is, of course, the TEIweb site. While this has been recently revised and ongoing work is being done, thereare still areas where usability is an issue. Several presenters at the TEI@20commented on this issue, and the maintainers of the site are aware of the issues.Scholars also have a responsibility to share the details of their work with colleaguesand encourage participation in projects that involve TEI. In that way, familiarity withTEI will dispel potential “technophobia” and also demonstrate what can be done withthe standard.

Finally, just as easy to use tools have developed to help in the creation and editing ofHTML and other XML documents, new tools which incorporate a bit of intelligenceregarding the standard syntax also need to come on the scene. This will make adoptionby inexperienced users easier, and lower the threshold of complexity. A fewpresentations were given at the TEI@20 Conference, among my own, regarding thevarious tools that are currently in existence for working with TEI both on theformulation end and on the presentation end. And, while there are tools availablethat do basic validation and other related tasks, there are none catered exclusively forTEI, which incorporate the many versions which may be present in existing TEIdocuments. It would be very helpful to have tools that could create a dynamic

Library asvirtual abbey

85

document map, along with error correction and syntax auto-completion. Such toolsexist for many programming languages and they certainly exist for HTML, but again,given the nature of the TEI community they have not yet attracted individuals with thecomprehension of TEI and the requisite programming skills to create such a tool (or setof tools).

It is definitely worth the time and effort, on the part of the TEI user community andthe TEI consortium to invest effort in these areas. While a great deal of passion hasgone into the development of the standard, it is now time to branch outward andembrace a larger audience. This will help to ensure that TEI does not ultimatelybecome a niche standard, and expand it’s utility to many more areas of academics andresearch. If those steps are taken, then there is definitely much to look forward to in thenext 20 years of TEI encoding.

Note

1. Please see www.tei-c.org/About/history.xml for a more complete history of the TEI.

References

Berners-Lee, T. and Fischetti, M. (1999), Weaving the Web: The Original Design and UltimateDestiny of the World Wide Web by its Inventor, Harper, San Francisco, CA.

Crane, G. (2000), “Designing documents to enhance the performance of digital libraries: time,space, people and a digital library on London”, D-Lib Magazine, Vol. 6 Nos 7/8, availableat: www.dlib.org/dlib/july00/crane/07crane.html

OCLC24,2

86

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints