18
Transcriber: Development and use of a tool for assisting speech corpora production q Claude Barras a, * , Edouard Georois b , Zhibiao Wu c , Mark Liberman c a Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France b DGA/CTA/GIP, 16 bis av. Prieur de la C^ ote d’Or, 94114 Arcueil Cedex, France c LDC, 3615 Market Street, Suite 200, Philadelphia, PA 19104-2608, USA Accepted 2 August 2000 Abstract We present ‘‘Transcriber’’, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support ad- ditional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of dierent data formats will become easier. Ó 2001 Elsevier Science B.V. All rights reserved. R esum e Nous pr esentons ‘‘Transcriber’’, un outil d’aide a la cr eation de corpus de parole, et nous d ecrivons des el ements de son d eveloppement et de son utilisation. Transcriber a et e conc ßu pour permettre la segmentation manuelle et la transcription d’enregistrements de nouvelles radio-dius ees de longue dur ee, ainsi que l’annotation des tours de parole, des th emes et des conditions acoustiques. Cet outil tr es portable, reposant sur le langage de script Tcl/Tk et des ex- tensions telles que Snack pour les fonctionnalit es audio et tcLex pour l’analyse lexicale, a et e test e sur di erents syst emes Unix et sous Windows. Le format de donn ees respecte le standard XML avec un support d’Unicode pour les trans- criptions multilingues. Distribu e sous license libre pour encourager la production de corpus, faciliter leur echange, augmenter le retour d’exp erience des utilisateurs et motiver les contributions logicielles ext erieures, Transcriber est utilis e depuis plus d’un an dans plusieurs pays. Suite a cette utilisation, de nouveaux besoins sont apparus comme le support de formats de donn ees suppl ementaires, de la vid eo, et un meilleur traitement de la parole conversationnelle. En utilisant le mod ele des graphes d’annotation formalis er ecemment, l’adaptation de l’outil vers de nouvelles t^ aches et le support de di erents formats de donn ees sera facilit e. Ó 2001 Elsevier Science B.V. All rights reserved. www.elsevier.nl/locate/specom Speech Communication 33 (2001) 5–22 q This work was done while the first author was working with DGA. * Corresponding author. Tel.: +33-1-69-85-80-61; fax: +33-1-69-85-80-88. E-mail address: [email protected] (C. Barras). 0167-6393/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 6 7 - 4

Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

Transcriber: Development and use of a tool for assisting speechcorpora production q

Claude Barras a,*, Edouard Geo�rois b, Zhibiao Wu c, Mark Liberman c

a Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403 Orsay Cedex, Franceb DGA/CTA/GIP, 16 bis av. Prieur de la Cote d'Or, 94114 Arcueil Cedex, France

c LDC, 3615 Market Street, Suite 200, Philadelphia, PA 19104-2608, USA

Accepted 2 August 2000

Abstract

We present ``Transcriber'', a tool for assisting in the creation of speech corpora, and describe some aspects of its

development and use. Transcriber was designed for the manual segmentation and transcription of long duration

broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable,

relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for

lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard

with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production

of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use

for over a year in several countries. As a result of this collective experience, new requirements arose to support ad-

ditional data formats, video control, and a better management of conversational speech. Using the annotation graphs

framework recently formalized, adaptation of the tool towards new tasks and support of di�erent data formats will

become easier. Ó 2001 Elsevier Science B.V. All rights reserved.

R�esum�e

Nous pr�esentons ``Transcriber'', un outil d'aide �a la cr�eation de corpus de parole, et nous d�ecrivons des �el�ements de

son d�eveloppement et de son utilisation. Transcriber a �et�e concßu pour permettre la segmentation manuelle et la

transcription d'enregistrements de nouvelles radio-di�us�ees de longue dur�ee, ainsi que l'annotation des tours de parole,

des th�emes et des conditions acoustiques. Cet outil tr�es portable, reposant sur le langage de script Tcl/Tk et des ex-

tensions telles que Snack pour les fonctionnalit�es audio et tcLex pour l'analyse lexicale, a �et�e test�e sur di��erents syst�emes

Unix et sous Windows. Le format de donn�ees respecte le standard XML avec un support d'Unicode pour les trans-

criptions multilingues. Distribu�e sous license libre pour encourager la production de corpus, faciliter leur �echange,

augmenter le retour d'exp�erience des utilisateurs et motiver les contributions logicielles ext�erieures, Transcriber est

utilis�e depuis plus d'un an dans plusieurs pays. Suite �a cette utilisation, de nouveaux besoins sont apparus comme le

support de formats de donn�ees suppl�ementaires, de la vid�eo, et un meilleur traitement de la parole conversationnelle. En

utilisant le mod�ele des graphes d'annotation formalis�e r�ecemment, l'adaptation de l'outil vers de nouvelles taches et le

support de di��erents formats de donn�ees sera facilit�e. Ó 2001 Elsevier Science B.V. All rights reserved.

www.elsevier.nl/locate/specomSpeech Communication 33 (2001) 5±22

q This work was done while the ®rst author was working with DGA.* Corresponding author. Tel.: +33-1-69-85-80-61; fax: +33-1-69-85-80-88.

E-mail address: [email protected] (C. Barras).

0167-6393/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 6 7 - 4

Page 2: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

Keywords: Transcription tool; Speech corpora; Broadcast news; Linguistic annotation formats

1. Introduction

Speech research has long been conducted usingsmall- or medium-sized databases recorded incontrolled conditions. Until a few years ago, theyoften consisted of short duration recordings, andthe speech was read by or elicited from a well-identi®ed speaker. For read speech, orthographictranscription was not much of a problem since thecontent was known in advance. The need totranscribe appeared with spontaneous speech, butfor short duration recordings made in a controlledenvironment transcription was easy and a classicaltext editor associated with a simple sound playerwas generally enough.

With the advent of work on long duration re-cordings of uncontrolled speech, the situation haschanged. Navigation in a long duration recordingbecomes an issue, as well as time-alignment of theannotations with the signal. Additional informa-tion like background conditions, speaker turns oroverlapping speech should be indicated along withthe orthographic transcription. Further annota-tions can be needed by new research areas likenamed entities or topic detection. Therefore, newtools are required. Furthermore, for large quanti-ties of data, productivity becomes a concern andcan be increased by ergonomic tools.

In the framework of the Defense AdvancedResearch Project Agency (DARPA) programs, theLinguistic Data Consortium (LDC) has producedseveral hundreds hours of manually transcribedBroadcast News data, and developed speci®c toolsand internal know-how for this production. Thereis now a growing need for producing similar datain other places. For instance, a project for tran-scription and indexing of multilingual BroadcastNews started at the French D�el�egation G�en�eralepour l'Armement (DGA) in 1997. A software en-vironment was needed for creating the necessarycorpora. After examination of existing solutions, itappeared that no available transcription softwarecompletely ®lled the needs, and it was decidedto develop a new tool. The development of

``Transcriber'' started at the DGA in coordinationwith the LDC in late 1997, and the ®rst release waspresented in May 1998 (Barras et al., 1998). Sincethen, development went on and new features havebeen added according to the needs, until reachinga stable state. Besides, the experience acquiredwhile using the tool and the desire to address newtasks have raised more scienti®c issues related tothe format and the structure of the annotations.This article describes the current status of the tool,the experience gained and some future directions.

In the next section, we present the major re-quirements identi®ed for the tool and explain whyexisting annotation tools could not ful®l ourneeds. Section 3 describes the main features ofTranscriber, the format of the transcriptions, andexplains the main implementation choices. Someexperience of using the tool is presented in Sec-tion 4. Future directions and format evolution arediscussed in Section 5.

2. Motivations

2.1. Data characteristics

A tool for the manual transcription of largeamounts of radio and television soundtrack re-cordings was needed in order to create corporaand develop automatic speech recognition systemsfor indexing and retrieval of Broadcast News inseveral languages. The DARPA Broadcast Newstranscription task started in 1995 with the ®rstformal evaluation campaign in 1996 (Stern, 1997),and a project on the same task started at DGA in1997.

The Broadcast News task was the ®rst wide-scale e�ort to address speech which has not beenproduced speci®cally for research purposes. Re-cordings can have durations from several minutesto several hours. Annotations have to provide thefollowing information:· an orthographic transcription along with a pre-

cise description of all audible acoustic events,

6 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 3: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

including hesitations, repetitions, vocal non-speech events and external noises;

· a division into speech turns, with an identi®ca-tion of the speaker for each turn;

· a division into larger sections, such as ``stories'',including a clear separation of advertising andnews sections;

· indication of variations in transmission channelor acoustic background conditions.

Turns, section boundaries and changes of acousticconditions have to be temporally localized. Theorthographic transcription also needs to be pre-cisely and frequently synchronized with the speechsignal (breakpoints can be located at pauses,breaths, sentences or any other convenient places),thus de®ning shorter segments. There are frequentportions of overlapping speech in spontaneousdialogs which need to be addressed. All thesefeatures imply some speci®c requirements for theannotation tool.

2.2. Requirements

The main requirement is to allow the user tomanage long duration signals and input the vari-ous annotations described in the previous sectionas e�ciently as possible. We also wanted a toolwhich can be easily installed and used.

2.2.1. User interfaceTranscribing audio or video recordings is a very

time-consuming task. It is usually done by edu-cated native speakers of the language with nospeci®c skill in computer science. Therefore, atranscription tool should mimic as much as pos-sible the user interfaces of standard o�ce software,so as to reduce training time. Its use should beintuitive, in order to lower the cognitive load anddecrease error rates. In particular, it must providean easy and intuitive association between the timecourse of the speech signal and the textual repre-sentation of the transcription and other annota-tions. Users should ®nd it easy to navigate withineither the audio stream or the textual transcrip-tion. Navigation and modi®cation in either do-main should automatically translate intoappropriate changes in the other domain, and themethods for creating links between text and time

must be easy and intuitive. In addition, fast re-sponse is crucial. Indeed, regardless of the inter-face design, a tool will not be accepted by usersunless it responds quickly to user actions (Mc-Candless, 1998).

Two features deserved special attention. First,in order to help navigation into the signal andsegmentation, a cursor on the waveform shouldshow the current position in the signal even whilelistening, i.e., the cursor should move in synchro-nization during playback. This feature is notstraightforward to implement in a portable way.Second, the user should not experience any delaywhen navigating in long duration signals, i.e.,displaying of such signals, including scrolling andzooming, should be very fast and reactive. Thisfeature requires speci®c optimizations.

2.2.2. Multilingual transcriptionsIn the framework of a multilingual indexing

project, support for multiple languages is needed.Several aspects are involved: keyboard input,character display with speci®c issues on bi-direc-tional scripts (for languages like Arabic), and in-ternal data encoding with adequate ®le input/output. The localization of the interface is alsouseful, though less critical.

2.2.3. Easy deploymentWe wanted a tool that would work on inex-

pensive computers, in order to reduce the cost perworkstation. This implies that the interface shouldremain reactive even with limited computingpower. More generally, we wanted a portable toolwhich could be easily installed on already existingcomputers and environments, and in particularwhich works on most Unix systems and on Win-dows.

To further ease deployment, the tool should notbe encumbered by proprietary licence issues, bothfor ourselves and for potential partners. Of course,using free software also reduces the per-user cost.

2.3. Existing annotation tools

We ®rst considered using existing transcriptiontools. One of the most well-known tools for signalanalysis is Entropic's product ESPS/waves+

C. Barras et al. / Speech Communication 33 (2001) 5±22 7

Page 4: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

(formerly known as Xwaves) which e�cientlymanages signal and spectrogram displays and al-lows the user to edit a segmentation of the signal(e.g., at the phonetic level or at word level).However, it is not adapted to the transcription ofbroadcast news or of spontaneous conversations.For the transcription of multilingual telephoneconversations and broadcast news recordings inthe framework of the DARPA programs, the LDCdeveloped a tool based on an interface betweenwaves+ and the Emacs text editor. The resultingtool runs on Unix workstations, and requires asigni®cant amount of training and supervision,since users must learn basic Unix skills, basicEmacs skills, and basic waves+ skills. The Entropic``annotator'' product has similar characteristics.These solutions were unsatisfactory because of theissues of user training and supervision, and hard-ware and software expense 1.

Another, independent, annotation tool (namedTNG) was developed at the LDC in Java a fewyears ago. However, compared with waves+, thewaveform display and its update in response touser requests were relatively sluggish, so that itrequired a high-end workstation to be usable. Itcould not display a moving cursor during play-back, and the ®rst version of Java could onlysupport 8-bit mu-law audio. Furthermore, thestatus and the licensing policy of Java and of somelibraries needed for the user interface or audiomanagement remained unclear for a long period.This direction was thus not pursued.

Many speech research laboratories have devel-oped software for their own needs and some ofthem have released these tools publicly (withvarying licensing schemes). First versions of theOGI CSLU Toolkit (Schalkwyk et al., 1997) in-cluded Lyre, a signal viewer with some segmenta-tion capabilities. SFS tools from UniversityCollege, London (Huckvale, 1987±1998) are a setof powerful programs for speech processing, in-cluding display, but not designed for interactive

user interfaces. The Spoken Language SystemsGroup from MIT has described the architecture oftheir speech analysis and recognition tool SAP-PHIRE (Hetherington and McCandless, 1996),which includes graphical tools; the design ofSAPPHIRE seems promising but the tool is notpublicly available. The EMU Speech DatabaseSystem from Macquarie University (Sydney) is acollection of software tools for developing andextracting data from speech databases, includingthe creation of hierarchical and sequential labels ofspeech utterances (Cassidy and Harrington, 2001).The CHILDES system developed at CarnegieMellon University provides tools for studyingconversational interactions and for linking tran-scripts to digitized audio and video (MacWhinney,2000), and large databases are available in theassociated CHAT coding.

Since this ®rst overview in late 1997, new toolsappeared. The problem of synchronization be-tween ethnographic speech data and related an-notations has been addressed by the LACITOArchive project (Jacobson et al., 2001); the toolSoundIndex, initially written for the Macintoshplatform, is used for time-alignment. The Institutefor Signal and Information Processing (ISIP,Mississippi State University) provides severalpublic domain software in the ®eld of speech rec-ognition and signal analysis, and the same groupreleased Segmenter, a graphical tool to aid inperforming segmentation and transcription oftwo-channel telephone speech data (Deshmukhet al., 1998); recent versions are also available forthe Broadcast News task. The tool TransEdit hasbeen developed at Carnegie Mellon University forthe Windows platform (Burger, 1999). It was de-signed following speech annotators' requests with¯exibility and multimedia support in mind, re-sulting in very user-friendly tool. A more completesurvey of existing annotation tools is availableonline (Bird and Liberman, 1999±2000). Some ofthem have also been evaluated in the framework ofthe EC-funded MATE project, which started onMarch 1998, and aims to develop a standard forspoken dialogue corpus annotation, and a relatedset of tools (McKelvie et al., 2001).

To summarize, a wide range of tools exist, butno solution adapted to the needs was available at

1 In addition, following the acquisition of Entropic by

Microsoft, its product line of speech tools has been terminated,

so that future availability of software relying on waves+ is

compromised.

8 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 5: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

the time of our choice. In particular, no one pro-vided a really interactive management of longdurations signals synchronized with the tran-scription. We therefore considered adapting exist-ing tools. Solutions relying on commercialproducts or on software covered by restrictive li-cences could not be easily modi®ed nor redistrib-uted. Among the freely available tools, some hadinteresting features, but were not designed forBroadcast News transcription. We tried to reusecomponents of existing tools, but it proved to be adi�cult software re-engineering problem, and itsoon appeared that it would be more e�cient tostart the development of a new tool.

2.4. Development and distribution of Transcriber

Development of Transcriber began in late 1997.In May 1998, a ®rst release was made publiclyavailable, presented at the LREC conference(Barras et al., 1998) and put into daily usage atDGA. We chose to distribute the tool as freesoftware, under the GNU general public license(Free Software Foundation, 1991). We mainlywanted to ease the production of speech corporaand encourage their sharing. We also believe in thee�ciency of open source for software development(Stallman, 1998). Having developed a new tool,the additional cost of distributing it and main-taining a Web site is modest, and we expected anincrease in user feedback and contributions fromexternal developers.

Transcriber is now used in many places (at thetime of writing, more than 60 persons from 17countries have subscribed to the announcementmailing list), and we regularly receive valuablefeedback from users. Since the ®rst release, manynew features have been implemented, portabilityand robustness have been improved, and the dataformat has been enriched, while always maintain-ing backward compatibility. The tool has reacheda stable state, which we now describe.

3. Description of Transcriber

This section describes the user interface, withemphasis on the features relevant to the structure

of speech annotations and speci®c to Transcriber,then presents the data formats, and explains someimplementation choices.

3.1. User interface

The user interface of the tool comprises twomain parts (cf. Fig. 1): a text editor in the upperhalf of the screen, and a signal viewer in the lowerhalf of the screen, along with the temporal seg-mentation at the di�erent levels. In between, amaskable button bar provides tape-recorder-likeicons for signal playback and shows the name ofthe ®les currently being edited.

The interface appearance (fonts, colors, local-ization) and behavior (keyboard shortcuts, play-back mode, etc.) are user-con®gurable. Thesecon®guration options can be saved. The ®le thatthe user is working on and the cursor positions canalso be saved so that the session con®guration isautomatically restored when restarting the tool.Users can thus resume their work as if they hadnot exited from the tool.

3.1.1. Text editorThe text editor allows for creating, displaying

and editing the transcription. A transcriptionconsists of plain text and various markers. Stan-dard features of a text editor are provided: cut/copy/paste of the selection, ®nd and replace, spellchecking, and a limited undo. Markers are createdusing the menus or keyboard shortcuts and can beedited by clicking on them to pop up a dialogwindow.

Two types of markers can be distinguished.Some are used for structuring: the transcription isdivided into segments, which are grouped intoturns which are themselves grouped into sections,and change in acoustic background conditions canappear at any point in time (cf. Section 2.1). Thesemarkers bear time-stamps, which correspond tothe boundaries in the segmentation displayed un-der the signal in the lower half of the screen. Theyare displayed in the text editor in di�erent waysdepending on their type (cf. Fig. 1):· a new section is indicated by a button in the

middle of a line with the topic name;

C. Barras et al. / Speech Communication 33 (2001) 5±22 9

Page 6: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

Fig. 1. Screen shot of the user interface.

10 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 7: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

· a new speech turn is indicated by a button at theleft of a line with the speaker name;

· the beginning of a segment in the orthographictranscription is indicated by a large dot to theleft of a line; the text in the following paragraphbelongs to that segment;

· a change in acoustic conditions is indicated by amusic icon inside the text.

Turns and sections have attributes, which can beedited by clicking on the button. The speaker as-sociated to the turn can be chosen from a list of allexisting speakers, or a new speaker can be created.Speakers' identities can be searched for in thetranscriptions, and can also be imported fromother transcriptions. A speci®c mechanism is pro-vided for the annotation and transcription ofoverlapping speech involving two speakers. Simi-lar functions are provided for the topics associatedto the sections. Background conditions (appear-ance or disappearance of background conversa-tions, music, electric noise or any other kind ofnoise) can also be edited by clicking on the icon.

Other markers can be inserted in the text forany non-speech event, short noise, lexical anno-tation, language change or free comment. An openlist of prede®ned descriptions for each kind ofevent is proposed to the transcriber. The eventdescriptions are task-speci®c but can be modi®ed.These markers bear a ¯ag indicating the extent ofthe marker in the text. Some events do not extendover other words, e.g. most of the speakers' vocalnon-speech sounds. By default they are displayedbetween square brackets, e.g. �i� for an inspira-tion. Other events do, e.g. external noises whichoften overlap with speech or language changes. Bydefault they are displayed in a slightly di�erentway, e.g. �nÿ� . . . �ÿn� for a beginning and end of ageneric noise. These markers do not bear a tem-poral synchronization in the current implementa-tion, but could do in the future.

3.1.2. Signal display and playbackThe signal is displayed under the text editor.

The signal waveform can be interactively scrolledand zoomed, even during playback. A portion ofthe signal can be selected for zooming or restrict-ing playback to the selected region. Two views ofthe signal at di�erent scales can be simultaneously

displayed, which is useful for having a global viewof the context in addition to a more precise, localview. When the audio ®le contains several chan-nels, the waveforms are displayed in parallel.

Playback is controlled by tape-recorder-likebuttons or by keyboard shortcuts. Various play-back modes are provided, to suit the di�erentstages of the transcription: continuous playback isuseful for segmenting the signal, playback of thecurrent segment for transcribing it, or continuousplayback with a short pause at each segmentboundary for veri®cation. During playback, thecursor in the signal moves continuously in syn-chrony with the sound. This allows the user toassociate the location on the waveform to whatthey hear and eases signal segmentation.

All functions remain available during playback.The user can thus annotate continuously. Asplayback can be controlled by keyboard shortcuts,he can also almost always keep the focus in thetext editor. One exception is for moving aboundary, which requires mouse dragging in thesegmentation display in the lower half of thescreen.

3.1.3. Signal segmentationThe temporal segmentations at the di�erent

levels (orthographic transcription, speech turn,topic change, acoustic conditions) are drawn un-der the signal and are synchronized with it duringscrolling or zooming operations. The informationassociated to each segment is displayed entirely orpartially according to the zoom level. Each seg-mentation level in each view can be independentlymasked at user option.

The segment boundaries can be edited bydragging them with the mouse. A new boundarycan be inserted at the current cursor position usingthe menu or a keyboard shortcut (by default thereturn key, as a new line is created in the text ed-itor). Since this is possible during playback, arough segmentation can be quickly created byhitting a key at desired segmentation points whilelistening. A more precise positioning of theboundaries can be achieved in the second phaseusing the mouse to drag them to the correct po-sitions.

C. Barras et al. / Speech Communication 33 (2001) 5±22 11

Page 8: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

A new speech turn or section can be insertedat any previously created boundary. Changes inacoustic background conditions can be insertedat any position, using speci®c commands. Whena boundary is shared across levels, dragging itat one level automatically moves it at the otherlevels. Sequentiality of the time marks is alwaysensured. A boundary normally cannot bemoved past its neighbors, but can be forced tomove further and push its neighbors accord-ingly.

3.1.4. Synchronization between text and signalThe text editor and the temporal segmentation

under the signal can be considered as two di�erentviews of the same transcription object. Anychange in the text editor is immediately displayedin the temporal segmentation. Two cursors aresimultaneously active, one in the text editor(where text can be inserted in the transcription)and one in the signal viewer (where playback willstart). Both cursors are synchronized and con-strained to be always consistent, i.e., they have toalways stay within the same temporal segment: assoon as one cursor moves to another segment, theother cursor automatically moves to the samesegment, and the windows are automaticallyscrolled when needed. The current segment ishighlighted both in the text editor and in thesignal segmentation display. During playback, thetext of the segment being currently played canthus be easily followed in the text editor. If thecursor is moved to another segment while listen-ing, playback is interrupted and restarts at thebeginning of the new segment.

3.2. Data format

The set of annotations includes not only theorthographic transcription, but also all the in-formation about turns, speakers, sections, acous-tic condition changes, and other events. Thesedata need to be stored in a ®le, processed invarious ways, and exchanged easily. The dataformat thus needs to be chosen carefully. Itshould as far as possible follow existing stan-dards, or at least be easily converted to some ofthem.

3.2.1. File formatObviously, Unicode which is the most standard

multilingual character encoding (The UnicodeConsortium, 2000) should be supported. Unicodeprovides a unique encoding for every character inalmost all existing languages and thus allows textsin several languages to appear within a singledocument.

Besides, transcriptions are complex objects, anda structured machine-readable format is needed.We considered Standard Generalized MarkupLanguage (SGML) and its more recent subsetExtensible Markup Language (XML) (Bray et al.,1998). Both allow a document to be structured as atree. Each node of the tree contains a set of attri-butes with a value. The syntax used in the docu-ment can be speci®ed in a document typedeclaration (DTD). Tools exist for ensuring au-tomatically the well-formedness and validity of adocument, that is, that it correctly follows theSGML or XML syntax as well as its speci®cDTD. More importantly, SGML and XML arewidespread standards, which helps sharing docu-ments. In addition, they support Unicode charac-ter codes. Automatic processing of XMLdocuments is much easier than SGML, and thusXML was adopted.

3.2.2. DTD designThe format was designed as being backward

compatible with a previous format used at theLDC for the DARPA Broadcast News evalua-tions. The transcriptions have three hierarchicallyembedded layers of segmentation (orthographictranscription, speaker turns, sections), plus afourth level of segmentation (acoustic backgroundconditions) which is independent of the other three(cf. Fig. 2). A global list of speakers along withtheir attributes is also managed inside a tran-scription, as is a list of topics. Fig. 3 shows amanually indented sample of a transcription ®lecorresponding to the screen shot of Fig. 1.

In our case, the validation of a document is notenough to ensure its logical consistency; indeed,some properties ± e.g., the fact that the ``start-Time'' and ``endTime'' attributes must bear nu-merical values which are in increasing order, orthat each of the four types of segmentation is

12 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 9: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

constrained to be a partition of the whole signal ±exceeds the capabilities of a DTD and have to beveri®ed afterwards in the application. Some ofthese issues could be addressed using Cascading

Style Sheets (CSS) and Extensible StylesheetLanguage (XSL) which aim to provide morecomplex manipulations of XML ®les (Clark,1999).

Fig. 3. Sample of a transcription ®le.

Fig. 2. The four segmentation levels of a transcription.

C. Barras et al. / Speech Communication 33 (2001) 5±22 13

Page 10: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

The default event description provided with thetool is currently speci®c to the task and to thetranscriber's language. Agreement could bereached on an international set of non-speechevents or other annotations. This would ease theinternational exchange of produced corpora.However, deciding which annotations are lan-guage-independent is not straightforward, and thetranscriber should remain able to add his or herown annotations.

In 1998, NIST designed an Universal Tran-scription Format or UTF based on previous LDCformats for the production of Hub-4 BroadcastNews and Hub-5 Conversational speech corpora(NIST, 1998). Conversions between our formatand UTF are partially lossy in both directionsbecause of slightly di�erent orientations (our for-mat supports improved speaker characteristics butnot yet the named entities optionally present inUTF). A version of Transcriber has been producedthat can read, edit and write transcripts in theCHILDES format (MacWhinney, 2000). This in-volves a very di�erent DTD, expressing a di�erent(and considerably more elaborate) set of annota-tion categories. We aim to address the problem ofmaking it easy to adapt Transcriber for use with anearly unlimited variety of di�erent annotationframeworks.

3.3. Implementation issues

This section presents the main developmentchoices which were made, in line with the re-quirements.

3.3.1. Programming language and developmentmode

We were confronted with the choice of a lan-guage for the development. Over the last few years,there has been a growing interest in variousscripting languages (Ousterhout, 1998). One of themost open and successful ones is Tcl/Tk. It is amultiplatform script language available for severalUnix systems, Macintosh and Windows (Ous-terhout, 1994). The syntax of the Tcl language israther simple, but a complex user interface can bewritten in a few lines using the Tk graphical li-brary. The absence of compilation signi®cantly

speeds up the development process, and computershave become powerful enough nowadays to pro-vide rapid reactions even with interpreted appli-cations. The need for a C or C++ development isreduced to the critical or system-dependent partswhich can easily be interfaced with the Tcl script.Tcl/Tk was therefore chosen for the developmentof Transcriber. At the time of writing, Transcriberruns under several Unix systems (Linux, Solaris,SGI) and Windows, and a port to the Macintosh isunder way.

Combined with the free distribution, the use ofa scripting language allowed rapid prototypingdevelopment with quick user feedback on the tool.Numerous functions were modi®ed or added ac-cording to user requests. For example, manage-ment of overlapping speech was changed severaltimes in order to provide a more intuitive userinterface. This development mode lasted over ayear with monthly updates.

3.3.2. Multilingual text editorThe standard Tk text widget was chosen for

editing the transcription. Multilingual transcrip-tions are possible, since recent Tk versions managethe display of Unicode characters. We also con-sidered the Emacs text editor, which is a free,powerful text editor and supports multilinguality;however it would have become harder to providean integrated tool with a consistent user interface.

Unicode characters are managed internally inTcl, and can be easily re-mapped to various al-ternative encodings. However, we have not ex-perimented widely with non-roman scripts. Themain limitation on script choice at present is theTk text widget, which cannot yet handle bi-direc-tional text or general rendering of compositeUnicode characters (e.g., with diacritics). Howev-er, we hope that these capabilities will be added,since multilinguality and Unicode support are highon the list of priorities for the developers of Tcl/Tk.

No generic architecture for input methods isnow available in Tcl/Tk. Keyboard con®gurationcan often be handled at the operating system level;but if needed, it is easy to con®gure the tool tobind any keyboard combination to a given Uni-code character.

14 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 11: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

3.3.3. Interactive display of long duration wave-forms

Since providing interactive display and play-back of long duration signals was a high priority,scrolling and zooming of the waveform had to beachieved without freezing the interface, even on alow-cost computer.

A speci®c waveform display module has beendeveloped for Transcriber. This time-critical partis written in C, and is optimized for interactivezooming and scrolling the sound ®les without in-terrupting real-time output. The sound ®le is neverloaded in memory, since a single hour of signalcould easily exceed the available memory. The ®rsttime a long sound ®le is accessed, a low resolutiontemporal envelope of the waveform (minimal andmaximal sample values for each 10 ms segment)can optionally be computed and stored on disk inorder to speed up later display. In this case thedisplay is computed using only the pre-computedenvelope instead of the much bigger sound ®le. Ifthe pre-computation of the envelope is disabled,the low-resolution display is disabled as well toavoid any sluggish display. During scrolling, onlythe required part of the waveform is computed,not the whole display. Signal segmentation displayhas also been designed for e�ciency. All theseoptimizations dramatically increase the interac-tivity of zooming and scrolling.

As an option, remote sound ®le access is pro-vided through a server controlled with sockets andspeci®cally optimized for the tool, thus being moree�cient than a standard network ®le access. Forsignal display, the waveform is computed on theserver and is transmitted over the network insteadof accessing the whole signal through the network.This feature makes it possible to centralize all re-cordings on a server, allowing interactive remoteaccess without duplication of resources. This fea-ture is mainly intended for the consultation of re-mote archives.

3.3.4. Audio management with SnackSynchronization of the cursor during playback

usually requires low-level access to the audiodriver, which can limit portability. Much time wasspent during early development for a reliablesound control, especially because of hardware or

of low-level operating system problems. The Snackaudio extension provided a good solution to thesemultiplatform audio di�culties.

Snack is an extension for the Tcl/Tk scriptinglanguage which provides multiplatform audiomanagement. It was developed by K. Sj�olander atKTH speech laboratory (Sj�olander, 1997±2000;Sj�olander et al., 1998). Most commonly usedsound ®le formats are supported, playback is e�-ciently supported for Windows and several Unixsystems including Linux, and it runs in the back-ground while staying under the control of the ap-plication. These excellent technical characteristicsand the fact that it is distributed as free softwaremade Snack obviously the best choice for multi-platform audio management. It was thus chosenfor use within Transcriber.

3.3.5. Implementation of the parserAn XML parser was needed to make the in-

terface between the application and the data, en-suring that any well-formed XML ®le will becorrectly read or written. Furthermore, productionof valid documents according to their DTD isimportant for their automatic exploitation, and wetherefore needed a validating parser. At the time ofdevelopment, no free validating XML parser wasavailable for Tcl/Tk. A parser was therefore de-signed using tcLex, a lexical analyzer generatorextension to Tcl and distributed as free software(Bonnet, 1998±1999). Unicode encoding is sup-ported and automatically detected upon reading.The internal representation of the transcriptionwas chosen to consist mainly in the XML datastructure, which as a result is always kept inmemory and dynamically updated according totranscription modi®cations. Saving the transcrip-tion only requires a dump of the existing data.When a DTD is active, each modi®cation of theXML data structure in memory is immediatelyvalidated, which ensures that saving the currentXML image to a ®le will produce a valid XML ®le.

4. Experience

Transcriber has been used for the DGA projecton Broadcast News for over a year. It has also

C. Barras et al. / Speech Communication 33 (2001) 5±22 15

Page 12: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

been used by the French company VECSYS forseveral months in the framework of the EuropeanLanguage Engineering project OLIVE (de Jong etal., 2000). In this section, we report on the prac-tical use in these two places, and on some of theexperience gained. We describe the material whichwas transcribed, the working conditions and theproductivity, and the transcription guidelineswhich were provided to the transcribers.

4.1. Material transcribed

The reference material for the DGA projectconsists of 20 h of morning news program (7±9 h)recorded in December 1998 (10 weekdays from 2consecutive weeks) from the national French radiostation ``France-Inter''. This choice was motivatedby the fact that the distribution rights for this datacould be obtained, and by the news-oriented butvaried content. The typical 2-h program contains 3news bulletins (for a total of about 50 min), spe-cialized news (20 min), various chronicles (10 min),review of the French press and of the Europeanpress (15 min), interviews and live questions fromlisteners (20 min), and weather reports (5 min).The review of the European press was done by anon-native speaker, and contained, of course, a lotof foreign names and expressions.

The material transcribed by VECSYS included15 h of radio recordings from French programs``France-Inter'' and ``France-Info'', and 65 h oftelevision soundtracks from various channels inFrench and German (23 h of ``Arte'' programs inFrench, 30 h of ``Arte'' programs in German, and12 h of French channels ``France 3'', ``France 2''or ``TF1''). ``Arte'' programs consisted mostly ofnews bulletins and documentaries on social orpolitical issues.

4.2. Working conditions

Two half-time transcribers were hired for theDGA project. They were educated, native Frenchspeakers. Both were given a PC (Pentium Pro 200MHz) under Linux with headphones and loud-speakers. Each one had to transcribe a set of 10one-hour sound ®les copied to their hard disks.They worked in the same room and could share

their experiences. They had dictionaries and lists ofjournalists' names at their disposal. They went togreat lengths to ®nd the correct spelling of propernames, despite the fact that a speci®c marking wasavailable for uncertain orthography. They wereinformed in advance of the recording sessions thatthey would have to transcribe, and decided to getnewspapers from the corresponding days. TheEuropean press review proved to be a di�cultchallenge, since foreign newspapers were moredi�cult to get. When they had completed a one-hour sound ®le, an additional veri®cation wasdone in the presence of a speech researcher in or-der to discuss the speci®c problems which arose.Further checking and normalizations were per-formed on the whole set of transcriptions, and thetranscribers had feedback about the errors.

Eight half-time native speakers of French andGerman produced the transcriptions for VECSYS.They started with a 15 day training period in thecompany, and then they were provided with a PCrunning Linux, a modem and the sound ®les on aCD-ROM and worked at home. They were alsogiven lists of journalist names, and paper draftswhen available for the Arte programs; otherwisethey relied on their own resources ± for example,some did name spell checking via the Internet. Theproduced transcriptions were sent to the companyby e-mail. They were veri®ed and corrected by aperson specializing in this task, and who had use ofall the necessary dictionaries.

4.3. Productivity

A monitoring function was added to the tool inorder to be able to analyze the production oftranscriptions and estimate the amount of workneeded for the transcription of 1 h of material.This was also a user's request, since they were in-terested in monitoring their own daily progress.Time spent using the tool was measured and re-corded, along with various measures of the tran-scription task (number of temporal breakpoints, ofspeech turns, of words. . .).

The total time needed for the production of 1 hof transcribed material, including careful veri®ca-tion of the transcription, amounted to around 50 hfor both DGA transcribers. Of interest is that they

16 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 13: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

did not follow the same strategy: the ®rst onechose to segment and annotate the whole signal®rst, performing the orthographic transcription ina second pass; the second one did segmentation,annotation and transcription in parallel. The su-periority of one strategy over the other one couldnot be demonstrated. However, getting accuratesegmentations took a lot of time. This was an in-dication that a good automatic segmentation ofthe signal into short segments might speed up theoverall transcription work. We have thereforegiven the transcribers an automatically computedpre-segmentation into breath groups produced bythe LIMSI speech partitioning system (Gauvainet al., 1998), which they could modify as necessary,and they found it useful. Indication of speakerchanges were also provided, but the transcribersfound them more confusing than helpful. Theseare subjective appreciations from the transcribers,and further investigation is necessary beforedrawing conclusions.

Mean transcription time for the VECSYSexperience also amounted to around 50 timesreal time, with a large disparity depending onthe program. Radio news programs wereeasier, and television debates were much harderdue to frequent overlapping speech and the di�-culty of speaker identi®cation from the soundtrackonly.

4.4. Transcription guidelines

Transcribers were provided with a writtendocument describing the transcription guidelines,i.e., explanations about what should be annotatedand how to annotate it. Initial guidelines werewritten by LIMSI. They were intentionally keptsimple (and thus predictably incomplete) in their®rst version, and were augmented as necessarywhen speci®c questions arose.

The transcription guidelines covered the fol-lowing topics:· What should be annotated: orthographic tran-

scription of the foreground; non-speech eventsand background noise conditions; speech turnswith a precise identi®cation of the speaker(name, gender, accent in the case of foreignspeakers) and topics.

· What should not be annotated, such as tran-scription of commercials.

· How to add punctuation to increase readabilitywithout interfering with automatic processing.

· How to deal with numbers, spelled letters, un-known words, etc.

· How to mark pronunciation errors, truncatedwords, overlapping speech, noises, etc.

· How to mark utterances in foreign languages, orisolated foreign word or expressions.

Designing good guidelines proved to be far fromstraightforward. They have to meet several,sometime con¯icting, requirements: they must en-sure usability for several types of automatic pro-cessing, and take into account readability of thetranscriptions by humans; they must help thetranscribers in ambiguous situations and stan-dardize the expected annotations, without both-ering them with too many conventions whichmight be di�cult to remember or causing lost timeon ®ne details; they must cover most cases withoutbecoming inconsistent. To summarize, they haveto keep a good balance between completeness andsimplicity.

In practice, the initial transcription guidelineshave evolved to deal with the problems encoun-tered during the sessions and the transcribers'questions. They were concerned with the use ofcapitalization, spelling of acronyms, marking offoreign words, etc. The tool itself also evolvedaccordingly, a good example being the manage-ment of overlapping speech.

4.5. Management of overlapping speech

Our priority was the transcription of single-channel broadcast news recordings for speechrecognition systems training, and within thisframework overlapping speech segments are cur-rently discarded from further automatic exploita-tion. However, future tasks may use them. Theymake the transcription more complete, and it wasjudged less frustrating for the transcriber to beable to transcribe overlapping speech, whether thisdata will be used or not. Di�erent situations wereidenti®ed in the broadcast news task:1. clear foreground speech with background

speech ± e.g., translation with the original

C. Barras et al. / Speech Communication 33 (2001) 5±22 17

Page 14: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

foreign voice in background: in this case, onlythe foreground voice had to be transcribed withan acoustic condition marker indicating back-ground speech;

2. limited interjections from other speakers(e.g., hum, yes. . .): they were indicated as in-stantaneous noises inside the main speakertranscription;

3. a dialog between two speakers with fre-quent overlapping at the boundaries: when fea-sible, it could be transcribed using the speci®cmechanism for simultaneous speech describedlater;

4. more than two overlapping speakers: the tran-scribers were requested not to annotate these.

It proved to be di�cult to provide an ergonomicuser interface for overlapping speech. In a ®rstimplementation, the constraint that the segmen-tations should be a strict partition of the signalwas relaxed, and the last speech segment of oneturn could overlap with the ®rst speech segment ofthe next turn (solution 1 in Fig. 4). The overlap-ping segments could be drawn in the temporalsegmentation under the signal, but the resultingdisplay in the text editor was confusing, becausethe two overlapping speech segments belonged totwo separate speech turns and their simultaneitydid not appear clearly enough. Several interfaceswere tried and changed at the user's request beforeeventually choosing another representation (solu-

tion 2 in Fig. 4). The overlapping part is clearlymarked as a speech turn with two speakers. De-spite the creation of this arti®cial speech turn, thisled to a more acceptable solution in the interface.In the text editor, the parallelism between the twoutterances appears clearly (Fig. 1).

In conversational speech, overlapping is oftenso common that this approach becomes problem-atic both for the transcriber and for the eventualuser. In the case of telephone speech recordings,two simultaneous speakers are often well enoughseparated on the separate channels for automatedprocessing to go forward without special source-separation algorithms. In this case, it is mucheasier for the transcriber to segment and transcribeeach channel as an independent stream, and theresult is also more easily assimilated by training ortesting programs as well as by human users. Thisapproach to the transcription of heavily over-lapped speech with a separate audio channel foreach speaker (which is essentially the one that theLDC has been using) requires a di�erent user in-terface as well as a di�erent transcription speci®-cation. Providing such a solution in Transcriber isone of our goals for the future. Meanwhile, weunderstand that one user has solved the problemtemporarily by running two simultaneous invoca-tions of Transcriber, one for each channel! Theresulting ®les are then merged (or split) automat-ically later on. A better solution will be to integrate

Fig. 4. Two solutions tested for the representation of overlapping speech.

18 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 15: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

the parallel streams of transcription under simul-taneous program control.

4.6. Relevance of implementation choices

When looking back at the choices performed,we feel that the use of a scripting language con-siderably speeded up the development. The choiceof Tcl is not mandatory, and the Tk widget hasalso been interfaced with the Perl scripting lan-guage. For a development restricted to the Win-dows platform, Visual Basic would bring similaradvantages. In a multiplatform framework, theavailability of the Snack extension for audiomanagement in Tcl would be currently a decisiveargument for still choosing Tcl.

A validating XML parser has been developedfor the tool in Tcl using the tcLex library. How-ever, XML parsing and validating in an inter-preted language proved to be rather slow,especially with Unicode support. The current ver-sion of the parser would not be adapted forreading a long annotation ®le with word-levelsynchronizations or even phonetic annotations.We consider using another XML parser in thefuture, especially with the development of stan-dard programming interfaces for the manipulationof XML documents (e.g., with the document ob-ject model or DOM, Wood et al., 1998). Thiswould also reduce the maintenance workload forthis part in Transcriber.

Other limitations remain. The ``undo'' functionshould be improved to allow an unlimited numberof undoes. Right-to-left writing and bi-directionalsupport, which is needed for some languages,seems di�cult to implement correctly with thecurrent version of the Tk text widget. Display oftranscription ®les for material exceeding 1 h be-comes slow in our con®gurations, mostly becauseof the numerous embedded buttons and imagesinside the text widget. Added to the parsing du-ration, this can make the launching of the tool lastseveral tens of seconds, and scrolling in the texteditor is also a bit less reactive. On the other hand,signal display remains perfectly reactive for signalsup to several hours. This second feature, combinedwith the permanent and ¯uid synchronization with

the text editor, seems to be currently the mostappreciated feature of the tool.

5. Future directions

Though Transcriber has reached a stable state,its dissemination has prompted new needs. Userswould bene®t from further help such as automaticconsistency checking, automatic alignment oftranscription with signal, video display, or vari-able-speed playback. New application domainscall for an increased ¯exibility in sound ®lesmanagement and annotation formats. This sectionpresents these possible extensions.

5.1. Consistency checking

More tools are clearly needed for ensuringconsistency of the transcriptions. Help should beprovided for checking the consistency of propernames throughout the various transcriptions. Auser-de®ned glossary and editable shortcuts havebeen introduced in the tool at the request of users;however, this is not yet completely satisfactory. Amechanism of automatic completion using previ-ously written names in all existing transcriptions(compiled by hand or even automatically) seems tobe an interesting solution and remains to be im-plemented. Online dictionaries, encyclopedias, oreven maps for place names, should be made easilyavailable to the transcriber. The LDC uses exter-nal databases of names, accessed via client±serverconnections, and it will be useful to some appli-cations to provide support for this feature.

5.2. Automatic speech processing

Creating a pre-segmentation (cf. Section 4.3) orchecking the transcription by aligning it auto-matically with the signal is currently done by re-searchers using independent, elaborate tools (aspeech recognition engine, acoustics models, and,for the alignment, a lexicon). It might be interest-ing to integrate such tools with Transcriber, forexample to display the segments where pooralignment was detected. This might be usefulfor researchers, but could also, if the interface is

C. Barras et al. / Speech Communication 33 (2001) 5±22 19

Page 16: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

user-friendly enough, be used directly by tran-scribers to check their transcription.

5.3. Multimedia

Speaker identi®cation on television soundtracksis very di�cult, because speakers are not intro-duced by the presenter in the same way as on theradio, their visual appearance being generallysu�cient. In the short term, watching the videoduring the veri®cation phase is an alternative (ashas been the practice at the LDC). But the bestsolution for this problem would be to provide thecomplete video recording, not only the audiotrack. This would also ease the whole transcriptionprocess in the case of background noise. With thecurrent development of video capabilities onstandard computers, it can be hoped that easytechnical solutions for interfacing the tool with avideo player will be available in the near future.Such an interface will also be useful for other ap-plications in which video recordings are to betranscribed or annotated, such as the study ofgesture in communicative interaction.

5.4. Sound ®les management

Multiple sound ®les could be managed in asingle transcription ®le. Speci®c functions shouldbe available for multichannel sound ®les (e.g.,telephonic conversations as in the Switchboardtask), for instance for playback of one channel at atime. It might then become useful to extend theinterface to manage multiple windows. Addition-ally, variable-speed playback (as is commonlyavailable in analog tape-based transcriptionsystems, and in some software systems) willhelp productivity by permitting faster ``proof-listening.''

5.5. Format evolution

In the project, most e�ort was initially devotedto the user interface. The format choice was ratherconservative and derived from existing LDC for-mats which proved already adapted to the broad-cast news task. We also kept the single treestructure, which brought serious limitations to

further extensions. Also, the tool is very sensitiveto the modi®cations of the DTD. This limitation isnot due to the XML paradigm which can be usedfor virtually any kind of data structure, but to thecurrent implementation.

However, most user interface concepts in thetool which proved attractive for the users are notspeci®c to the broadcast news task, and it quicklyappeared useful to open the tool to other formats.A large number of other formats is currently usedin the ®eld of speech research. As an attempt tobetter coordinate existing e�orts, the text encodinginitiative (TEI) provided in 1994 recommendationsfor the transcription of written and also spokenmaterials in SGML (Sperberg-McQueen andBurnard, 1994); current e�orts aim at adaptingTEI to XML and expanding its coverage. TheMATE project is also trying to provide a standardformat for spoken dialogue annotation (McKelvieet al., 2001). Various existing annotation formatsare referenced online (Bird and Liberman, 1999±2000).

As a ®rst step, the tool was adapted to theCHAT coding used in the CHILDES system(MacWhinney, 2000). Large amount of tran-scribed conversational speech is available in thisformat, and some researchers studying languageacquisition would be interested in a version ofTranscriber devoted to their needs, as an alterna-tive to already existing tools. The DTD was ex-tended with new tags and the source code had tobe slightly modi®ed for this task. But a more ge-neric solution would be preferable, e.g., by simplyreading the DTD or any adapted formal descrip-tion of the format and having the interface of thetool automatically adapted to the chosen format.

Current developments are based upon Bird andLiberman's re¯ections about annotation graphs(Bird and Liberman, 2001). They show that vir-tually any existing annotation can be viewed as alabeled acyclic graph, in which some nodes bearordered time values, and they develop a completeformalism for annotation graphs. Within thisframework, all segments of the transcriptions arestored as an unordered set of typed arcs betweenidenti®ed nodes.

Switching to this framework for internal datamanagement and for the reference transcription ®le

20 C. Barras et al. / Speech Communication 33 (2001) 5±22

Page 17: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

format will lead to a much more generic tool, andconversion to other formats will become easier(Geo�rois et al., 2000). This does not preclude al-ternative formats, with time-ordered segments or ina human-readable format. For example, the inter-nal format will no longer constrain new sections toimpose a new turn, though such constraints canremain in the interface of the tool for a speci®c task.

6. Conclusions

We have presented Transcriber, a tool for as-sisting in the creation of speech corpora. It pro-vides an intuitive and interactive interface fortranscribing and annotating long duration signals.

Interface prototyping in a scripting languagewas shown to be an e�ective development ap-proach, when robust libraries are available. Beingdistributed as free software, our project has beenfollowed by numerous speech scientists and engi-neers who gave valuable hints for further devel-opments that made the tool much more portableand usable. A web site has been designed for thedistribution of the tool, and an announcement anda developer mailing list are in use. Our aim is todevelop the future versions with the potential co-developers in a modular fashion with an interac-tive dialog, taking full advantage of the opensource development framework.

After more than one year of testing the system,we feel that Transcriber is suitable for large-scaleproduction of speech resources. It is now used byseveral research or development teams in variouscountries. Our initial target was very focused to-wards broadcast news transcription. But the in-terest in the tool showed that other areas needinteractive tools that are easy to use. Future de-velopments will use a generic data structure basedon annotation graphs and provide multimediaextensions. This will lead to a much more user-con®gurable and task-con®gurable tool.

Acknowledgements

The design of the initial transcription conven-tions for French at LIMSI was coordinated by

Martine Adda-Decker. Martine Garnier-Rizetcoordinated the use of Transcriber within VEC-SYS and gave us valuable feedback on this.Snack's developer K�are Sj�olander was very helpfulin always taking into account the changes whichwere needed for Transcriber. We are also glad tothank here all users and testers who gave us re-ports about their experience and their problems,since this helped us very much in improving thetool. We give many thanks to the transcribers fortheir patience using a program under development.Finally, the reviewers and the editors gave nu-merous and always judicious comments on thestructure and on the style of this paper.

References

Barras, C., Geo�rois, E., Wu, Z., Liberman, M., 1998.

Transcriber: A free tool for segmenting, labeling and

transcribing speech. In: Proceedings of the First Interna-

tional Conference on Language Resources and Evaluation

(LREC 98), Granada, Spain, 28±30 May 1998, pp. 1373±

1376 (http://www.etca.fr/CTA/gip/Projets/Transcriber/).

Bird, S., Liberman, M., 1999±2000. Linguistic Annotation

(http://www.ldc.upenn.edu/annotation/).

Bird, S., Liberman, M., 2001. A formal framework for linguistic

annotation. Speech Communication 33 (1±2), 23±60.

Bonnet, F., 1998±1999. tcLex: a lexical analyzer generator for

Tcl (http://www.multimania.com/fbonnet/Tcl/tcLex/index.

en.htm).

Bray, T., Paoli, J., Sperberg-McQueen, C.M. (Eds.), 1998.

Extensible Markup Language (XML) 1.0. W3C Recom-

mandation. See also http://www.w3.org/XML/.

Burger, S., 1999. TransEdit ± a transcription tool from

transcribers for transcribers. In: Presentation at the Ninth

International COCOSDA Workshop, Budapest, Hungary,

10 September 1999 (for information contact sburger@cs.

cmu.edu).

Cassidy, S., Harrington, J., 2001. Multi-level annotation in the

Emu speech database management system. Speech Com-

munication 33 (1±2), 61±77. See also http://www.shlrc.mq.e-

du.au/emu/.

Clark, J. (Ed.), 1999. XSL Transformations (XSLT) Version

1.0. W3C Recommendation. See also http://www.w3.org/

Style/.

de Jong, F., Gauvain, J.-L., Hiemstra, D., Netter, K., 2000.

Language-based multimedia information retrieval. In: Pro-

ceedings of the Sixth RIAO Conference, Paris, France, 12±14

April 2000. See also http://twentyone.tpd.tno.nl/olive/.

Deshmukh, N., Ganapathiraju, A., Gleeson, A., Hamaker, J.,

Picone, J., 1998. Resegmentation of Switchboard. In:

Proceedings of the Fifth International Conference on

Spoken Language Processing (ICSLP'98), Sydney, Australia,

C. Barras et al. / Speech Communication 33 (2001) 5±22 21

Page 18: Transcriber: Development and use of a tool for assisting ...languagelog.ldc.upenn.edu/myl/Barras2001.pdf · presented in May 1998 (Barras et al., 1998). Since then, development went

30 November±4 December 1998, pp. 1543±1546. See also

http://www.isip.msstate.edu/projects/speech/software/.

Free Software Foundation, 1991. GNU General Public License

(http://www.gnu.org/copyleft/gpl.html).

Gauvain, J.-L., Lamel, L., Adda, G., 1998. Partitioning and

transcription of Broadcast News data. In: Proceedings of

the Fifth International Conference on Spoken Language

Processing (ICSLP'98), Sydney, Australia, 30 November±4

December 1998, pp. 1335±1338.

Geo�rois, E., Barras, C., Bird, S., Wu, Z., 2000. Transcribing

with annotation graphs. In: Proceedings of the Second

International Conference on Language Resources and

Evaluation (LREC 2000), Athens, Greece, 31 May±2 June

2000, pp. 1517±1521.

Hetherington, L., McCandless, M., 1996. SAPPHIRE: An

extensible speech analysis and recognition tool based on Tcl/

Tk. In: Proceedings of the Fourth International Conference

on Spoken Language Processing (ICSLP'96), Philadelphia,

USA, 3±6 October 1996, pp. 1942±1945.

Huckvale, M., 1987±1998. SFS Speech Filing System (http://

www.phon.ucl.ac.uk/resource/sfs.html).

Jacobson, M., Michailovsky, B., Lowe, J.B., 2001. Linguistic

documents synchronizing sound and text. Speech Commu-

nication, 33 (1±2), 79±96. See also http://lacito.vjf.cnrs.fr/

ARCHIVAG/ENGLISH.htm.

McCandless, M.K., 1998. A model for interactive computation:

Applications to speech research. Ph.D. Thesis, Massachu-

setts Institute of Technology.

McKelvie, D., Isard, A., Mengel, A., Mùller, M.B., Grosse, M.,

Klein, M., 2001. The MATE workbench ± An annotation

tool for XML coded speech corpora. Speech Communica-

tion 33 (1±2), 97±112. See also http://mate.nis.sdu.dk/.

MacWhinney, B., 2000. The CHILDES project: Tools for

Analyzing Talk, third edition. Lawrence Erlbaum Associ-

ates, Mahwah, NJ. See also http://childes.psy.cmu.edu/.

NIST, 1998. A universal transcription format (UTF) anno-

tation speci®cation for evaluation of spoken language

technology corpora (http://www.nist.gov/speech/hub4_98/

hub4_98.htm).

Ousterhout, J.K., 1994. Tcl and the Tk Toolkit. Addison

Wesley, ISBN: 3-89319-793-1. See also http://www.

scriptics.com/.

Ousterhout, J.K., 1998. Scripting: Higher-level programming

for the 21st century. IEEE Computer Magazine 31 (3).

Schalkwyk, J., de Villiers, J., van Vuuren, S., Vermeulen, P.,

1997. CSLUsh: An extensible research environment. In:

Proceedings of the Fifth European Conference on Speech

Communication and Technology (Eurospeech 97), Rhodes

(Greece), 22±25 September 1997, pp. 689±692. See also

http://cslu.cse.ogi.edu/toolkit/.

Sj�olander, K., 1997±2000. The Snack Sound Visualization

Module (http://www.speech.kth.se/snack/).

Sj�olander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson,

R., Granstr�om, B., 1998. Web-based educational tools for

speech technology. In: Proceedings of the Fifth Interna-

tional Conference on Spoken Language Processing (ICS-

LP'98), Sydney, Australia, 30 November±4 December 1998,

pp. 3217±3220.

Sperberg-McQueen, C.M., Burnard, L. (Eds.), 1994. TEI

Guidelines for Electronic Text Encoding and Interchange

(P3) (http://etext.lib.virginia.edu/TEI.html). See also http://

www.tei-c.org/.

Stallman, R., 1998. The GNU project (http://www.gnu.org/

philosophy/). In: DiBona, C., Ockman, S., Stone, M. (Eds.),

1999. Open Sources-Voices from the Open Source Revolu-

tion. O'Reilly, ISBN: 1-56592-582-3.

Stern, R., 1997. Speci®ction of the 1996 Hub 4 Broadcst News

evaluation. In: Proceedings of the DARPA Speech Recog-

nition Workshop, Chantilly, USA, 2±5 February 1997,

pp. 7±14.

The Unicode Consortium, 2000. The Unicode Standard,

Version 3.0. Addison-Wesley Longman Publisher, ISBN

0-201-61633-5. See also http://www.unicode.org/.

Wood, L. et al. (Eds.), 1998. Document object model (DOM)

Level 1 speci®cation. W3C Recommendation, Available

from http://www.w3.org/DOM/.

22 C. Barras et al. / Speech Communication 33 (2001) 5±22