31

Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Diagnostic Evaluation in

Linguistic Word Recognition

Julie Carson�Berndsen and

Martina Pampel

Universit�at Bielefeld

Report ��August ����

Page 2: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

August ����

Julie Carson�Berndsen andMartina Pampel

Universit�at Bielefeld �UBI�Fakult�at f�ur Linguistik und Literaturwissenschaft

Universit�atsstr �Postfach �� �� �

��� Bielefeld

Tel� ����� ��� � ������e�mail� fberndsen� martinag�asl�uni�bielefeld�de

Geh�ort zum Antragsabschnitt� ��� Interaktive Phonologische Interpre�tation

Das diesemBericht zugrundeliegende Forschungsvorhaben wurde mit Mittelndes Bundesministers f�ur Forschung und Technologie unter dem F�orderkenn�zeichen �� IV ��� B gef�ordert Die Verantwortung f�ur den Inhalt dieserArbeit liegt bei dem Autor

Page 3: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Contents

� Introduction �

� Problems of Evaluation in Linguistic Word Recognition �

� Diagnostic Evaluation �

� Logical Evaluation using a Data Model � � � � � � � � � � � � � �� Empirical Evaluation and Linguistic Word Recognition � � � � �

Bielefeld Extended Evaluation Toolkit for Lattices of Events �

�� Reference File Generation � � � � � � � � � � � � � � � � � � � � ��� Lexicon Generation � � � � � � � � � � � � � � � � � � � � � � � � ��� Lexicon Consistency Test � � � � � � � � � � � � � � � � � � � � � ���� Top�down Event Generation � � � � � � � � � � � � � � � � � � � ���� Linguistic Word Recognition � � � � � � � � � � � � � � � � � � � ���� Braunschweig Evaluation � � � � � � � � � � � � � � � � � � � � � ��� BELLE Evaluation � � � � � � � � � � � � � � � � � � � � � � � � �

� Open Issues �

Bibliography ��

Page 4: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Abstract

This report is concerned with a new method of evaluation for theLinguistic Word Recognition component of the Verbmobil�Project�Architektur� A two stage model of diagnostic evaluation is presentedconsisting of logical and empirical evaluation steps� Logical evalua�tion is carried out according to a data model which acts as optimalinput in order that each component participating in the evaluationprocess can be tested for soundness and completeness� Inconsistenciescan thus be remedied before empirical evaluation of the model is un�dertaken using real data� The diagnostic evaluation method has beenoperationalised within the Bielefeld Extended Evaluation Toolkit forLattices of Events �BEETLE��

Page 5: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Introduction

This report is concerned with a new approach to evaluation which has beendeveloped in connection with Linguistic Word Recognition in the VerbmobilProject ���� Interactive Phonological Interpretation In the sections below�it is demonstrated how current evaluation procedures are not su�cient forcatering for the area of Linguistic Word Recognition Instead a new diagnos�tic approach to evaluation of individual components within a spoken languagerecognition system is presented which takes soundness and completeness is�sues into account The standard evaluation procedure allows for evaluationat the word and sentence level However� it is claimed here that evaluationis necessary at all levels of recognition and that although current Verbmobilevaluation software �� � �termed Braunschweig evaluation in the rest of thisreport� may be adapted to cater for evaluation at the syllable and phonemelevel� the constraint that the output lattices of the component must containat least one connected path imposes a restriction upon the components ofLinguistic Word Recognition which has negative e�ects and which stands inopposition to the aims of Interactive Phonological Interpretation

In Verbmobil Project ��� Interactive Phonological Interpretation� wordrecognition is performed by applying linguistic knowledge below the wordlevel Based on recent developments in the areas of phonology and morphol�ogy� a more �exible approach to word recognition is followed which is basedon the notion of events ��� � �� ��� ��� The motivation for the applicationof the event concept in the area of linguistic word recognition concerns theprojection problem at the phonetics�phonology interface ��� In Project ���temporal relations between phonological events form the basis of a grammarwhich allows a projection of a �nite set of actual structures �ie in the corpus�onto an in�nite set of potential structures allowing for the treatment of newsyllables and words A nonconcatenative� compositional approach to phonol�ogy and morphology based on autosegmental tiers of events avoids a rigidsegmentation at the phonetics�phonology interface and thus coarticulatione�ects �overlap of properties� can be described A decision on segmentationinto phonological or morphological units can be postponed by underspecify�ing the autosegments both temporally and in terms of their features untilsu�cient information is available The Linguistic Word Recognition com�ponent supplies linguistic knowledge constraints which have been used foroptimising stochastic systems ���

The aim of Linguistic Word Recognition is� by integrating stochastic

Page 6: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

and linguistic�symbolic approaches with �ne granularity� to achieve corpus�independent and speaker�independent speech recognition BELLEx is partof an experimental development environment in which components candemonstrate di�ering interaction strategies and parameter settings for dif�ferent modes of analysis The free parameterisation of the system allows forlinguistically adequate parameters to be chosen which de�ne a compromisebetween maximal word recognition rates and minimal analysis overhead

The Linguistic Word Recognition Component in Project ��� consists oftwo components� a syllable parser �SILPA� and a morphoprosodic parser�MORPROPA� which together with an acoustic event recogniser �HEAP�Universit�at Hamburg� form the components of the BELLEx word recog�nition system Each component must be evaluated individually in order toassess the value of the system as a whole This report is concerned with theevaluation of the syllable parser and the morphoprosodic parser These twocomponents produce three output lattices which are relevant for evaluation�a phoneme lattice� a syllable lattice and a word lattice The phoneme latticecan be regarded as a side�e�ect of syllable recognition as it is possible toderive the phoneme lattice top�down from the phonemes which occur in therecognised syllables It is important to note� however� that it is not sylla�ble or phoneme hypotheses which are passed from SILPA to MORPROPAbut rather underspeci�ed subsyllable events and that syllable and phonemelattices are generated soley for the purposes of syllable evaluation

The output of the Linguistic Word Recogintion component as a whole isa word hypothesis lattice which di�ers from a connected word graph to theextent that the connectedness condition is not a necessary requirement Theoutput lattice allows the existence of overlapping hypotheses and gaps be�tween hypotheses Although overlap relations and gaps between hypothesescan be interpreted as precedence relations �ie with the help of an absoluteoverlap parameter or value relative to which the degree of overlap is de�ned��the formal criterion for connectedness in the sense of ���� is not ful�lled

Page 7: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Problems of Evaluation in Linguistic Word

Recognition

As mentioned in the previous section� the output of Linguistic Word Recog�nition is a word hypothesis lattice which di�ers substantially from the con�nected word graph in that the connectedness condition is not a necessarycondition Clearly it is possible to construct a connected word graph on thebasis of the word lattice output by arti�cally splitting phonological eventsinto phonemic units using temporal statistics and mapping the lattice to achart in the sense of ��� However� the method preferred in Linguistic WordRecognition corresponds to a delayed level�speci�c segmentation which al�lows overlaps and gaps between hypotheses Example � provides an exampleof an overlap of word hypotheses which must be arbitrarily segmented if aconnected word graph is to be constructed

I m O m E n t

Figure �� Example of overlapping word hypotheses

The phonogical event �m� belongs to two word hypotheses although itstemporal duration does not correspond to the realisation of two separatephonemic segments �mm� An arbitrary splitting into two phonological events�m� and �m� based on temporal statistics is therefore unreliable This case�in particular� shows the argument against an arbitrary segmentation of tem�porally overlaping word hypotheses to be convincing

In connection with evaluation as practiced in Verbmobil� the evaluationprocedure cannot be directly applied to the output of the components ofLinguistic Word Recognition due to the fact that the existence of at leastone connected path through the output lattice is not guaranteed In orderto be able to apply the evaluation procedure� the components of LinguisticWord Recognition have two possible strategies for interpreting overlappinghypotheses �cf �gure � for the construction of a connected graph� eitherhypothesis a can be interpreted as preceding hypothesis b �ie option �i� in�gure � or only hypothesis c can be chosen �ie option �ii� in �gure �

Page 8: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

ab

c

Figure � Overlapping Hypotheses

ba

c

(i)

(ii)

Figure � Mapping to Chart

Stochastic word recognition models provide the n�best hypotheses withrespect to some particular threshold in the form of a connected word graphHowever� due to the fact that the aim of project ��� is to examine to whatextent linguistic knowledge can be applied in word recognition� this type ofoutput is not desirable� since this condition imposes a non�linguistic restric�tion on the output format Gaps may occur in the output for the followingreasons�

� hesitations� pauses� silence� incomplete words and sentences

� underspeci�cation in the input data due to the fact that informationis missing in the signal� although it is not possible to complete thisinformation bottom�up� it may be possible to do so top�down

� incorrect input data from the point of view of the linguistic component�constraint relaxation may be employed here

� the linguistic knowledge base may be incomplete

The Linguistic Word Recognition component caters for gaps between hy�potheses by allowing underspeci�ed event and feature structures in the out�put In�ectional endings� for example� are not easily recognised and therefore

Page 9: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

congruence features can be left underspeci�ed until a de�nite decision canbe made based on further bottom�up or top�down information

In this paper we make the claim that an evaluation procedure must takesignal endpoints into consideration when de�ning the notion of connected�ness A certain �maximal� amount of overlap between units must be allowedand they may still be considered to be connected Reference �les must there�fore contain signal annotations for the utterances� and not merely the utter�ance as a string A disadvantage of our approach is that reference �les mustbe generated for all utterences to be evaluated �ie regardless of whetherthe same utterance is spoken by di�erent speakers or several utterances arespoken by the same speaker�

The Braunschweig evaluation procedure� as employed in Verbmobil hasbeen described in �� � This evaluation procedure has been developed on thebasis of experience with conventional word recognition systems which pro�duce connected word graphs as output and for this task� it is very suitableInput to the evaluation procedure is a word lattice �which is in fact a con�nected word graph� and evaluation is carried out with respect to the bestpath through the lattice However� in addition to the connectedness condi�tion� the evaluation procedure requires a lexicon which covers the relevantcorpus If the recognised units in the lattice are not contained in the lexi�con� a message to this e�ect is produced Since Linguistic Word Recognitionclaims to be able to cater for new words� it clearly does not make sense touse an evaluation procedure which stipulates that the recognised units mustbe contained in the accompanying lexicon Linguistic Word Recognition� onthe other hand� distinguishes internally between actual units which occur inthe relevant corpus and potential units which are wellformed according to thegrammar For the evaluation of the components of Linguistic Word Recog�nition� it was also found useful to have a visualisation tool for the outputlattices This is described in more detail below

The �rst set of data which was evaluated in Linguistic Word Recogni�tion �all utterances from CD�ROM PhonDat Vol II ����� �Zugauskunft� bythe speaker SATD� referred to below as the ASL�Scenario� was that agreedupon by Verbmobil� Architecture for the �rst INTARC demonstrator in April���� After repeatedly performing evaluation� correcting inconsistencies inthe knowledge bases and the parsers� it was discovered that the standardevaluation procedure �� � led to a recognition rate that was lower than out�put which had been evaluated manually For this reason a new diagnostic

Page 10: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

evaluation procedure has been developed for Linguistic Word Recognitionwhich takes overlaps and gaps in word lattices into account The evalua�tion procedure can be applied at all levels of Linguistic Word Recognition�ie can be used for evaluation of phoneme output� of syllable output and ofword output

Page 11: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Diagnostic Evaluation

Hirschman � Thompson ��� distinguish between types of evaluation in speechand natural language processing Adequacy evaluation de�nes the �tnessof a system to the task required This they term evaluation proper Diag�nostic evaluation is the production of a system performance pro�le withrespect to a taxonomisation of the space of possible inputs Performance

evaluationmeasures system performance in one or more speci�c areas Thistype of performance serves as the basis for assessing the progress of a systemThis report is concerned with diagnostic evaluation in the sense describedabove� with the addition that we consider diagnostic evaluation to be a moregeneral term which covers both adequacy evaluation and performance eval�uation We de�ne diagnostic evaluation to consist of two evaluation stages�logical evaluation and empirical evaluation Logical evaluation is undertakenwith respect to a data model The data model de�nes one possible inputspace for the system� namely the optimal input data� and is generated top�down on the basis of labelled speech �les Optimal input data is the inputa system would hope for in the ideal case This concept is relevant onlyfor levels of processing which explicity use structured linguistic knowledgesince only here is it possible to de�ne what optimal input would be Atthe level of sentence syntax� for example� it is possible to de�ne what theoptimal input for a sentence parser would be� a single utterance which isgrammatically correct Clearly� this will not resemble real input in a spokenlanguage recognition system but evaluating the sentence syntax level withoptimal input is equivalent to veri�ng that all components which participatein the logical evaluation are internally consistent Linguistic components ofspeech recognition systems are often criticised for assuming a near optimalinput which leads to problems when the parser is coupled with an acous�tic component However� if optimal input is used for evaluation in order todevelop a consistent linguistic component which has been designed also todeal with suboptimal input� then an evaluation with real data can be carriedout without internal inconsistencies leading to failure As mentioned above�logical evaluation can be applied at all linguistic levels In particular� thisreport is concerned with the evaluation of the components of Linguistic WordRecognition �phoneme� syllable and word recognition�

In some ways logical evaluation is similar to testing a stochastic modelwith training data rather than with test data� such a procedure would notallow for participation in a competitive evaluation of the performance of sev�

Page 12: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

eral systems� but it does indicate the performance levels which can hopedto be attained on real data after tuning has taken place Logical evaluationof linguistic word recognition components di�ers from the above in that arecognition rate of ���� can be achieved if the all components which partic�ipate in the evaluation are sound and complete The stochastic model relieson a certain statistical generalisation which makes a ���� recognition rateon training data more di�cult

Empirical evaluation is de�ned here as evaluation on real input data It isthe recognition rate achieved by empirical evaluation which can be comparedwith current evaluation results in the area of speech and natural languageprocessing

��� Logical Evaluation using a Data Model

As mentioned above� logical evaluation of a component involves a test forsoundness and completeness with respect to a data model This notion was�rst presented in connection with Verbmobil Project ��� by ���

In order to perform a logical evaluation the following steps are necessary�

� Task� Test all the entries in the lexicon with the grammar of thecomponent Consequence� If the grammar does not permit analysisof all lexicon entries� then either a correction of the lexicon or a revisionof the grammar is required

� Task� Generate optimal input data for the component using eitherautomatically phonemically labelled data �as generated by HTK� forexample� or manually corrected label �les and test the componentConsequence� The component must at least be able to analyse whatit considers to be optimal data Otherwise the processing of the com�ponent is incorrect according to the optimal data model

� Task� Generate automatically the reference �les for the evaluation soft�ware using either automatically phonemically labelled data �as gener�ated by HTK� or manually corrected label �les This must correspondto the format chosen in connection with the generation of optimal in�put Consequence� These �les serve as the basis for the evaluationand if inconsistencies are found then evaluation will not be correct

��

Page 13: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Task� Test the evaluation software for inconsistencies Consequence�If the evaluation procedure is inconsistent� then a new procedure mustbe drafted

� Task� Visualise the output lattices of the component Consequence�The user can see the extent of overlap and gaps in the output latticeof the component

� Task� Visualise the output of the evaluation software Consequence�The user can compare this visualisation with that of the output latticesto see whether any information has been lost

In addition� in order to avoid inconsistencies� the lexicon for the compo�nent should also be generated automatically However� this is a knowledgeacquisition task rather than a step in the logical evaluation procedure As willbe seen below� these two tasks are interrelated and important for a successfuldiagnostic evaluation

As was mentioned above� the assumption is made here that in linguisticprocessing� a recognition rate of ���� can be achieved by the logical evalu�ation Only a component which achieves this rate is sound and complete forthis data model It is not until this recognition rate has been achieved� thatan empirical evaluation should be carried out

After an iterative logical evaluation had been performed for the sylla�ble recognition module on the ASL scenario ��� utterances of the speakerSATD�� a logical recognition rate of ���� was achieved using the Braun�schweig software The error rate of ��� is due to the fact that no phono�logically and morphologically relevant temporal statistics were calculated forthis scenario and therefore long segments were not divided into two sepa�rate segments and therefore two overlapping syllables were generated �cf�� However� the visualisation indicated that both syllables had been foundbut that they did not stand in a connectedness relationship and thereforea substitution was assumed by the Braunschweig software Since this typeof phenonemon is likely to occur even more frequently in real data� it wasdecided to develop an evaluation procedure which caters for the needs of theLinguistic Word Recognition component This evaluation procedure is termBELLE evaluation

The BELLE evaluation procedure has been implemented and is describedin section �� in detail below It is based on the notion that a reference�le which only de�nes the connectedness relationship between units is not

��

Page 14: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

su�ucient for evaluation purposes since signal time also plays a role Therecognised units must also correspond to a subsection of the signal Tempo�ral annotations for the units of reference �les can be de�ned top�down fromlabel �les Since it is unlikely that a recogniser of these units will recogniseprecisely these units at precisely these signal time points� a deviation param�eter must be de�ned which speci�es by how much the recognised unit maydi�er from the temporal annotations in the reference �le

The deviation parameter varies according to the size of the unit A syl�labic unit� for example� could be permitted to deviate in its endpoints up to��� ms from the temporal annotations Phonemic units� on the other hand�would only be allowed to deviate by say less than � ms However� sincethe correct deviation factors for the respective levels depends on empiricaltesting� it must be possible to parameterise the evaluation software in orderthat all values can be tested The deviation parameter which produces thebest results is then the most suitable for this unit

The new evaluation software produced a logical recognition rate of ����for the syllable parser Since we knew in advance that only one�segmentoverlaps occurred� it was possible to set the deviation parameter immediatelyto �ms This will always be the upper bound for the deviation parameterwith logical evaluation using a data model However� as will be seen in thenext section� the setting of the deviation parameter plays an important rolein empirical evaluation

The software described in section � below is currently being used to per�form logical and empirical evaluation on the Verbmobil�Scenario data

��� Empirical Evaluation and Linguistic Word Recog�

nition

Empirical evaluation also involvesmost of the steps de�ned under logical eval�uation although the majority of the steps will have been taken in connectionwith logical evaluation and thus will not have to be repeated Empirical eval�uation di�ers to the extent that evaluation of the component is undertakenwith real data rather than with optimal data which is generated top�downfrom labelled data In addition to the evaluation� it is desirable to have avisualisation of both output and evaluation results However� it is obviousthat empirical evaluation is only meaningful when a complete logical eval�uation using a data model with manually corrected labelled data has been

Page 15: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

performed and a recognition rate approximating ���� has been achievedHere also� the �rst set of data which was evaluated in Linguistic Word

Recognition �all utterances from CD�ROM PhonDat Vol II ����� �Zu�gauskunft� by the speaker SATD� was that agreed upon for the �rst INTARCdemonstrator �April ����� Before discussing the results of the empirical eval�uation� a brief description of the parameterisation of the linguistic compo�nents is presented This issue is discussed in more detail by Carson�Berndsenand Drexel in connection with the syllable recognition module in a forthcom�ing report

The BELLE parsing components� the syllable parser �SILPA� and themorphoprosodic parser �MORPROPA� use the notion of parameterised lin�guistic models Both parsers are parameterisable to allow for constraintrelaxation and constraint enhancement That is to say� it is possible to setthe parameters so that in the case of underspeci�ed �or unreliable� input thephonological and morphological constraints can be relaxed and phonologicaland morphological information an be added based on whichever informationis reliable in the input By testing the complete parameter space for each ofthe possible it is possible to de�ne which units �acoustic events� phonologicalevents� are unreliably recognised by the responsible components

Using constraint relaxation and enhancement� an empirical evaluation ofthe syllable parser SILPA produced a phoneme recognition rate of �� anda syllable recognition rate of �� As mentioned above� the phoneme recog�nition rate is a side�e�ect of syllable recognition since the phonemic units arecalculated top�down from the recognised syllables Word recognition ratewas ���

This report is primarily concerned with the method of diagnostic eval�uation for the components of Linguistic Word Recognition More detailedresults of logical and empirical evaluation of the system components will bediscussed in a separate report when diagnostic evaluation has been completedfor the Verbmobil�scenario data for the INTARC I demonstration in April����

In the remaining chapters of this report� a front�end for diagnostic eval�uation of Linguistic Word Recognition is presented which is being developedat the University of Bielefeld by Verbmobil AP ��� The system is calledBEETLE �Bielefeld Extended Evaluation Toolkit for Lattices of Events�

Page 16: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Bielefeld Extended Evaluation Toolkit for

Lattices of Events

BEETLE is a toolkit for diagnostic evaluation in Linguistic Word Recogni�tion which has been developed at the University of Bielefeld in connectionwith Verbmobil Project ��� BEETLE allows both logical and empiricalevaluation to be done automatically given phonemically labelled data � in aprede�ned format� and a linguistic analysis component �currently a syllablerecognition component or a word recognition component� �

Figure � shows the main BEETLE window with the possible stages ofdiagnostic evaluation In addition to the steps de�ned in connection withlogical evaluation in Section �� an additional step has been incorportatedwhich concerns the acquisition of linguistic knowledge The lexicon for eachlinguistic component of word recognition is generated automatically on thebasis of the phonemically labelled speech data

Figure �� Beetle User Interface

In the following subsections� the individual steps in diagnostic evaluation

�The term phonemically labelled data as used in this report refers to either automati�

cally phonemically labelled data or to manually corrected labelled data�The main implementation of BEETLE has been undertaken by Julie Carson�Berndsen

and Frederek Altho� at the University of Bielefeld� Thanks is due also to Guido Drexel�

Katrin Kirchho�� Martina Pampel� Christoph Schillo and Markus Vogt for implementation

of individual tools�

��

Page 17: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

are described in detail

��� Reference File Generation

This stage of diagnostic evaluation concerns the generation of reference �lesfor the components output unit �ie in this case phonemes� syllables orwords� The unit is parameterised and can be set by the user The input isa script �le listing the phonemic label �les in the following format�

Start�Time End�Time Label Con�dence�Valuewhereby Start�Time and End�Time are given in ns ������� to give ms�As was discussed in Section �� it was necessary to de�ne a new evalua�

tion procedure for Linguistic Word Recognition and therefore� two di�erentreference �les are generated� the reference �le for the Braunschweig eval�uation software and the reference �le for the Bielefeld evaluation software�BELLE Evaluation�

Figure �� Reference File Generation

Reference �le generation works in the following way The phonemic label�les are converted into an internal format Then for the complete utteranceall possible substrings are calculated and are presented to the user who marksthe required structures using the cursor and return keys �cf �gure �� Itwould be possible to insert a parser which generates only the relevant stringsfor the unit but this tool has been kept more general in order that othersubstructures such as demisyllables or bigrammes can also be marked Anexample of a section of such a �le is provided in �gure � where only therelevant syllables are marked When this marked �le is stored by the user�it is converted into the label format de�ned above providing� in this case� asyllable label �le On the basis of this second label �le� two reference �lesare generated

��

Page 18: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Figure �� Marking Syllable Units for Reference Files

The �rst reference �le is that required by the Braunschweig evaluationsoftware An example showing the �le format is provided in �gure � Thereference �le consists of an indication of the utterance number and then alist of the correct items in a de�ned order The utterance number is providedat the beginning of the line due to the fact that the Braunschweig softwarein addition to allowing evaluation of a single utterance� also allows manyutterances to be evaluated together However� for the purposes of diagnos�tic evaluation� in particular in connection with visualisation of results it isnecessary to be easily able to access the reference for a single utterance Forthis reason� the �lename also indicates the utterance number When manyutterances are to be evaluated as is described in Section �� below� then thereference �les may be concatenated at the time of evaluation

530 IC m9C t@ fOn mYn Cn y b@ nY6n bE6k nax ham bU6k fa: r@n

531 IC bIn In k9ln Un m9Ct In 6 na hal bm StU n nax mYn Cn fa6n

533 IC mYs t@ Y b6 dY sl d)6f na ham bU6k fa:n

534 IC b@ n2 tI g@ aI n@ tsuk f6 bIn dUN fOn mYn C na:x a: xn

535 gu dn mO6 N IC m9C t@ hOY d@ tsvI Sn axt Und Elf u6 a: bnts In han bU6k zaIn

532 IC braU x@ hOY t@ aI n@ f6 bI nUN ap k9ln

Figure �� Example of Reference File in Braunschweig Format

��

Page 19: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

The second �le format is that required by the BELLE evaluation softwareIn this case� the reference �le consists of a set of PROLOG clauses whichde�ned the correct unit together with its temporal annotations �start andend points� based on the original phonemic label �le The �le name indicatesthe utterance number and therefore this is not provided in the �le itself

In addition to the generation of reference �les� ESPS label �les can begenerated for all marked substructures of the utterance on the basis of theformat de�ned above A direct alignment of units to the signal is thereforepossible

��� Lexicon Generation

Lexicon generation for the components output unit �ie in this case againphonemes� syllables� words� can be performed according to the unit selectedby the user using the intermediate notation of the reference �le generationThis is the default setting Lexicon generation expects as input a scriptlisting �les from which a lexicon is to be generated It is possible to use �lesfrom another source rather than the default setting These �les must be inthe format de�ned in the previous section� however Figure � shows the userwindow for lexicon generation

Figure �� Lexicon Generation

The output of lexicon generation is currently component�speci�c Thatis to say� in the context of Verbmobil Project ���� the syllable and wordrecognition components have their own internal lexicon formats It is theseformats which are generated here However� the infomation contained inthese lexica is feature�event�based and includes information on frequency ofoccurrence and average temporal duration of the units and the phonemictranscription which could clearly be output in another more general formatif required

��

Page 20: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

In addition to the generation of component�speci�c lexica� this moduleis also responsible for generation the unit�speci�c lexicon required by theBraunschweig software As was mentioned in Section � this latter lexiconis not meaningful in the context of Linguistic Word Recognition However�it must be generated in order for the Braunschweig evaluation to functioncorrectly

��� Lexicon Consistency Test

Figure � shows the user window for the lexicon consistency test

Figure �� Lexicon Consistency Test

Lexicon consistency is component�speci�c and involves testing the gen�erated lexicon �in phonemic format� with the grammar of the syllable andmorphoprosodic parsers and verifying that all lexicon entries may be anal�ysed by the grammar The output of the lexicon consistency test is theset of lexicon entries which may not be analysed by the grammar If thisset is empty� then the lexicon consistency test is successful As mentionedin section �� this is a possible source of inconsistency of the system Ifthe grammar does not permit analysis of all lexicon entries� than either acorrection of the lexicon or a revision of the grammar is necessary

��� Top�down Event Generation

The tool used for top�down event generation has been described in detail in�� and ��� Here event structures are derived top�down from phonemicallylabelled data These structures correspond to optimal input data and formthe basis for the data model Figure �� shows the user window for top�down

��

Page 21: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

event generation Input is a script of phonemic label �les in the formatde�ned in Section �� The default setting is the original input script tothe reference �le generation The output type is selected by the user �iein this case either as input for the syllable parser SILPA or as input forthe morphoprosodic parser MORPROPA� For each label in the input �les�lookup is performed which maps the phoneme labels to the relevant events orfeatures and the obligatory contour principle �smoothing� is performed �cf�� and ��� for further details�

Figure ��� Top�down Event Generation

The output of top�down event generation is an optimal input �le for therelevant parser which has a direct alignment to the signal

��� Linguistic Word Recognition

This is the section which is concerned with the parser call which producesthe output lattices for the evaluation The main window is shown in �gure�� A selection is made by the user as to whether syllable parsing �SILPA�or word parsing �MORPROPA� is to be performed Each of the parsers isparameterised �cf section � An additional window allows the user toalter the system con�guration The default settings are shown in the �gures

Figure ��� Linguistic Word Recognition

��

Page 22: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

The parameter window for the con�guration of the syllable parser SILPAis shown in �gure � The parameters are discussed by Carson�Berndsen andDrexel in a forthcoming report

Figure �� SILPA Con�guration

Each parser component produces output lattices for their own units �iesyllable parser produces a syllable lattice and in addition a phoneme latticewhich is derived top�down from the recognised syllables� The lattice has thefollowing format�

Node�� Node� Label Con�dence Start�Time End�TimeThe standard output produced by SILPA is in the following tuple nota�

tion�

Page 23: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

h Syllable�Phoneme� Start�Time� End�Time� Lex�Key iLex�Key refers to the type of syllable recognised with respect to the lex�

icon lab referes to those syllables which are labelled with respect to thecorpus� act refers to those syllables which are current in the German lan�guage and pot refers to those syllables which are wellformed with respect tothe phonotactic constraints of German �ie new syllables�

The con�guration window for the morphoprosodic parser MORPROPAis shown in �gure � The parameters can be set similarily to those for thesyllable recogniser to allow for constraint relaxation and enhancement

Figure � � MORPROPA Con�guration

In the next sections� two evaluation procedures for word recognition aredescribed The Braunschweig evaluation procedure has been described in�� � The BELLE evaluation procedure was introduced in section � aboveand is discussed in more detail here

��� Braunschweig Evaluation

Within BEETLE� the Braunschweig evaluation procedure is called to performevaluation on the output lattices of SILPA and MORPROPA Before this can

Page 24: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

be done� it is necessary to perform a lattice�to�chart�mapping analogously to��� in order to guarantee at least one connected path through the lattice Inorder for the Braunschweig evaluation to function within the context of Lin�guistic Word Recognition� a framework has been implemented in BEETLEwhich maps the output lattice to a chart allowing for a certain amount ofoverlap �as de�ned by the parameter prepeval�� which calls the Braunschweigevaluation software and which optionally provides a visualisation of both theinput lattice and the output chart

Figure ��� Braunschweig Evaluation

The framework also includes additional information on date of evalua�tion and on the reference �les used Examples of the text output of theBraunschweig evaluation software is given in �gures �� and ��

Page 25: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Figure ��� Evaluation phonemes

Figure ��� Evaluation syllables

The visualisation which has been incorporated into the framework aroundthe Braunschweig evaluation was developed as a general tool for displayinghypotheses and showing their alignment to the signal The visualisation tool�GraphHypo �� displays the hypotheses with respect to either the signaltimes or the logical nodes of the chart An example of the visualisation ispresented in �gure �� in connection with the BELLE evaluation

�implemented by Frederek Altho�� ����

Page 26: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

�� BELLE Evaluation

The BELLE evaluation procedure was introduced in section � above It wasdeveloped for the evaluation of the parsing components of Linguistic WordRecognition in BELLEx It caters for the notion of overlaps and gaps inthe hypothesis lattice and does not assume that a connected path throughthe lattice exists The reference �les used for the evaluation procedure takethe temporal annotations of the reference units into account and a unit isregarded as recognised if� in the output lattice� there is a corresponding unitwith temporal annotations which deviate from the reference unit by not morethat the speci�ed deviation parameter

Figure ��� BELLE Evaluation

Figure �� shows the BEELE evaluation con�guration window Inputto the evaluation procedure is a script de�ning the �les to be evaluatedThe default is the script given to Reference File Generation The deviationparameter may be set by the user� eg ���ms for syllable units As mentionedabove� the deviation parameter varies according to the size of the unit Sincethe correct deviation factors for the respective units depends on empiricaltesting� the evaluation software has been parameterised in order that allvalues can be tested The deviation parameter which produces the bestresults is then the most suitable value for this unit

Page 27: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Figure �� shows a visualisation of the output of the BELLE evaluationprocedure for a syllable lattice The shaded area represents hypotheses foundin the output lattice which correspond to the reference �le for this utter�ance The nonshaded hypotheses represent the reference path as generatedby Reference File Generation Hypotheses which do not correspond to thereference path are not shown� for reasons of clarity As can be seen fromthe �gure� there is an overlap of nasality in the combination �fOnmYn� invon M�unchen These syllables would not be regarded as recognised usingthe Braunschweig software without an arbitrary splitting into two nasal seg�ments However� since the place of articulation is underspeci�ed in the inputto the syllable parser� an arbitrary splitting of such a nasal segment into twofurther segments is not justi�ed

The BELLE evaluation procedure provides a test output similar to that ofthe Braunschweig evaluation except it contains no reference to substitutionsand insertions

Both the Braunschweig and the BELLE approaches allow for the eval�uation of single and multiple utterrances Each of the steps of BEETLEdiagnostic evaluation as described in the previous subsections can be per�formed individually

Page 28: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Figure ��� Visualisation

Page 29: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

� Open Issues

An issue which has not yet been considered in connection with evaluationof the components of Linguistic Word Recognition is the notion of phono�logical proximity Standard evaluation selects the n�best paths through theoutput lattice with respect to the con�dence values provided for each unitSince Linguistic Word Recognition utilises underspeci�ed phonological eventstructures� it implicitly de�nes the notion of phonological similarity betweenunits However� it would seem more suitable� instead of selecting the best hy�pothesis �wrt some con�dence value� at a particular point in time� to selectthe hypothesis which is the nearest� phonologically speaking� to the referenceunit As was mentioned explicitly in the introductory section� it is not syl�lable or phoneme hypotheses which are passed from SILPA to MORPROPAbut rather underspeci�ed subsyllable events and that syllable and phonemelattices are generated soley for the purposes of syllable evaluation Since thesyllable and phoneme hypotheses are generated by multiplying out the un�derspeci�ed phonological event structures� the corresponding fully speci�edevent structures are clearly more closely related phonologically than otherfully speci�ed event structures which are not subsumed by this

In this report� there has been no discussion on evaluation criteria for Lin�guistic Word Recognition This is clearly an issue which must be consideredAlthough it is possible for the Linguistic Word Recogniser �BELLEx � as awhole to take part in the standard evaluation of word recognition� it has beenshown in this report that such an evaluation procedure imposes restrictionswhich stand in opposition to the aims of interactive phonological interpre�tation as followed by Verbmobil AP ��� The aim of this report has beento show the shortcomings of the standard evaluation procedure with respectto new linguistic approaches to word recognition and to o�er an alternativeevaluation procedure which is in line with the notion of delayed� nonrigidsegmentation Evaluation criteria for Linguistic Word Recognition can nowbe drawn up on the basis of the diagnostic approach to evaluation as de�nedabove

Page 30: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

Bibliography

��� Julie Carson�Berndsen Ereignisstrukturen f�ur phonologisches ParsenASL TR � ��UBI� ���� University of Bielefeld

�� Julie Carson�Berndsen Computational tools for the developmentof event phonologies In Konvens �� Konferenz �Verarbeitungnat�urlicher Sprache� pages �� ��� N�urnberg�� ���

� � Julie Carson�Berndsen An event based phonotactics for German ASL TR � ��UBI� ��� University of Bielefeld

��� Julie Carson�Berndsen Tools for the development of event phonologiesASL TR � ��UBI� ��� University of Bielefeld

��� Julie Carson�Berndsen Time Map Phonology and the Projection Prob�lem in Spoken Language Recognition PhD thesis� University of Bielefeld�April ���

��� Julie Carson�Berndsen and Dafydd Gibbon Event relations at the pho�netics�phonology interface In Proceedings of the ��th InternationalConference on Computational Linguistics �COLING � pages ��� �� � Nantes� ���

��� L�F Chien� KJ Chen� and L�S Lee An augmented chart data struc�ture with e�cient word lattice parsing scheme in speech recognitionapplications In Proceedings of the ��th International Conference onComputational Linguistics �COLING � � pages � �� ��� Helsinki�����

��� Dafydd Gibbon Architekturkonzepte f�ur Worterkennung in multilin�gualen Dialogen Talk presented at the Verbmobil�Workshop Architek�tur� Erlangen� April ����� ���� University of Bielefeld

��� Lynette Hirschman and Henry Thompson Overview of evaluation inspeech and natural language processing Unpublished Draft� ����

���� Kai H�ubener and Julie Carson�Berndsen Phoneme recognition usingacoustic events In Proceedings of the �rd International Conference onSpoken Languauge Processing� vol �� pages ���� ��� Yokohama�Japan� ����

Page 31: Diagnostic Evaluation - COnnecting REpositories · 2017. 4. 27. · amp el Univ ersit at Bielefeld UBI F akult at f ur Linguistik und Literaturwissensc haft Univ ersit atsstr P ostfac

���� Kai H�ubener and Julie Carson�Berndsen Phoneme Recognition UsingAcoustic Events Verbmobil Technical Report No ��� ���� Universityof Hamburg and University of Bielefeld

��� A Jusek� H Rautenstrauch� GA Fink� F Kummert� G Sagerer�J Carson�Berndsen� and D Gibbon Dektektion unbekannter w�ortermit hilfe phonotaktischer modelle In Mustererkennung �� ��� DAGM�Symposium Wien� Berlin� ���� Springer Verlag erscheint

�� � Michael Lehning Ein programmsystem zur evaluierung der signalnahenspracherkennung Verbmobil Technischer Bericht Nr �� ���� TechnischeUniversit�at Braunschweig

���� Elmar N�oth and Bernd Plannerer Schnittstellende�nition f�ur den Wor�thypothesengraphen Verbmobil Memo � ���� Universit�at Erlangen�N�urnberg� Technische Universit�at M�unchen