17
Syntactic language modeling with formal grammars Tobias Kaufmann , Beat Pfister Speech Processing Group, Computer Engineering and Networks Laboratory, ETH Zurich, Gloriastrasse 35, 8092 Zurich, Switzerland Received 4 March 2010; received in revised form 29 December 2011; accepted 3 January 2012 Available online 12 January 2012 Abstract It has repeatedly been demonstrated that automatic speech recognition can benefit from syntactic information. However, virtually all syntactic language models for large-vocabulary continuous speech recognition are based on statistical parsers. In this paper, we inves- tigate the use of a formal grammar as a source of syntactic information. We describe a novel approach to integrating formal grammars into speech recognition and evaluate it in a series of experiments. For a German broadcast news transcription task, the approach was found to reduce the word error rate by 9.7% (relative) compared to a competitive baseline speech recognizer. We provide an extensive discussion on various aspects of the approach, including the contribution of different kinds of information, the development of a precise formal grammar and the acquisition of lexical information. Ó 2012 Elsevier B.V. All rights reserved. Keywords: Large-vocabulary continuous speech recognition; Language modeling; Formal grammar; Discriminative reranking 1. Introduction Acoustic models in automatic speech recognition have difficulty in distinguishing between alternative transcrip- tions that are phonetically similar. The role of a language model is to complement the acoustic model with prior information about the word sequences that occur in a particular language. The most widely used class of language models, the so-called n-grams, is based on the assumption that the probability of a word only depends on a small number of preceding words. These models can easily be trained on large amounts of unannotated text, and they perform extremely well despite of their simplicity. However, n-grams fail to capture many dependencies that are present in natural language. In our experimental data, the n-gram language model prefers the incorrect tran- scription einer, der sich fu ¨ r die Umwelt einsetzen(transla- tion of the correct transcription: someone who stands up for the environment). This transcription is ungrammatical because the number of the relative pronoun der (singular) does not agree with the number of the finite verb einsetzen (plural). In order to capture this instance of subject–verb agreement, an n-gram would have to consider at least five words preceding the verb. Moreover, the training corpus would have to contain the correct word sequence der sich fu ¨ r die Umwelt einsetzt(who stands up for the environ- ment), which is quite unlikely. We do not claim that n-grams and variants such as skip n-grams or class-based n-grams cannot possibly deal with this particular example. But it is easy to find many similar examples for any language model that does not properly capture the depen- dencies of natural language. In the last decade, progress in the field of statistical pars- ing has led to the development of more powerful language models that also incorporate syntactic structure. These lan- guage models have been shown to increase recognition accuracy for broad-domain speech recognition tasks. Unlike statistical parsers, formal grammars are designed to accurately distinguish between grammatical and ungrammatical sentences. Formal grammars have been successfully applied to narrow-domain natural language understanding tasks, but not to broad-domain speech 0167-6393/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2012.01.001 Corresponding author. E-mail address: [email protected] (T. Kaufmann). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 54 (2012) 715–731

Syntactic language modeling with formal grammars

Embed Size (px)

Citation preview

Page 1: Syntactic language modeling with formal grammars

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 54 (2012) 715–731

Syntactic language modeling with formal grammars

Tobias Kaufmann ⇑, Beat Pfister

Speech Processing Group, Computer Engineering and Networks Laboratory, ETH Zurich, Gloriastrasse 35, 8092 Zurich, Switzerland

Received 4 March 2010; received in revised form 29 December 2011; accepted 3 January 2012Available online 12 January 2012

Abstract

It has repeatedly been demonstrated that automatic speech recognition can benefit from syntactic information. However, virtually allsyntactic language models for large-vocabulary continuous speech recognition are based on statistical parsers. In this paper, we inves-tigate the use of a formal grammar as a source of syntactic information. We describe a novel approach to integrating formal grammarsinto speech recognition and evaluate it in a series of experiments. For a German broadcast news transcription task, the approach wasfound to reduce the word error rate by 9.7% (relative) compared to a competitive baseline speech recognizer. We provide an extensivediscussion on various aspects of the approach, including the contribution of different kinds of information, the development of a preciseformal grammar and the acquisition of lexical information.� 2012 Elsevier B.V. All rights reserved.

Keywords: Large-vocabulary continuous speech recognition; Language modeling; Formal grammar; Discriminative reranking

1. Introduction

Acoustic models in automatic speech recognition havedifficulty in distinguishing between alternative transcrip-tions that are phonetically similar. The role of a languagemodel is to complement the acoustic model with priorinformation about the word sequences that occur in aparticular language.

The most widely used class of language models, theso-called n-grams, is based on the assumption that theprobability of a word only depends on a small number ofpreceding words. These models can easily be trained onlarge amounts of unannotated text, and they performextremely well despite of their simplicity.

However, n-grams fail to capture many dependenciesthat are present in natural language. In our experimentaldata, the n-gram language model prefers the incorrect tran-scription “einer, der sich fur die Umwelt einsetzen” (transla-tion of the correct transcription: someone who stands up for

the environment). This transcription is ungrammatical

0167-6393/$ - see front matter � 2012 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2012.01.001

⇑ Corresponding author.E-mail address: [email protected] (T. Kaufmann).

because the number of the relative pronoun der (singular)does not agree with the number of the finite verb einsetzen

(plural). In order to capture this instance of subject–verbagreement, an n-gram would have to consider at least fivewords preceding the verb. Moreover, the training corpuswould have to contain the correct word sequence “der sich

fur die Umwelt einsetzt” (who stands up for the environ-

ment), which is quite unlikely. We do not claim thatn-grams and variants such as skip n-grams or class-basedn-grams cannot possibly deal with this particular example.But it is easy to find many similar examples for anylanguage model that does not properly capture the depen-dencies of natural language.

In the last decade, progress in the field of statistical pars-ing has led to the development of more powerful languagemodels that also incorporate syntactic structure. These lan-guage models have been shown to increase recognitionaccuracy for broad-domain speech recognition tasks.

Unlike statistical parsers, formal grammars are designedto accurately distinguish between grammatical andungrammatical sentences. Formal grammars have beensuccessfully applied to narrow-domain natural languageunderstanding tasks, but not to broad-domain speech

Page 2: Syntactic language modeling with formal grammars

716 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

recognition. This may partly be due to the difficulty of scal-ing such approaches to broad domains. Precise broad-domain grammars are notorious for their lack of coverageand at the same time fail to exclude many incorrect butnevertheless grammatical transcriptions. This greatlydiminishes the value of grammaticality information. Inaddition, the development of grammars and lexica forbroad domains is a difficult and laborious task. All theseeffects can be controlled much more easily for restricteddomains.

This situation is somewhat unsatisfying as formal gram-mars do have properties that are interesting for languagemodeling. They appear to be the most adequate means toformalize hard linguistic constraints. If they are comple-mented with statistical models for disambiguation, thesemodels typically require relatively few syntactically anno-tated training data compared to statistical parsers. Andfinally, formal grammars encode linguistic constraints thatare – at least to some degree – orthogonal to word-basedstatistical models, including n-grams and statistical parsingmodels.

In this paper, we present a novel approach of integratingformal grammars into a statistical speech recognitionframework. We show that this approach can significantlyimprove the accuracy of a competitive speech recognizeron a broad-domain task. We further report on a series ofexperiments which reveal different characteristics of theproposed approach and give an idea of how much syntacticinformation can contribute to speech recognition. In addi-tion to these experiments, we will describe the underlyinglinguistic resources and explain the formalisms and meth-ods that were used to develop these resources.

This article is a condensed version of the main author’sPh.D. thesis (Kaufmann, 2009). Some of the experimentalresults have previously been published in (Kaufmann etal., 2009).

1.1. Statistical parsers as language models

Virtually all syntactic language models for large-vocab-ulary continuous speech recognition are based on statisticalparsers.1 A statistical parser basically models the relationbetween a word sequence W and a parse tree T as a prob-ability distribution P(T,W) or P(TjW). The parameters ofthis distribution are estimated on large syntactically anno-tated text corpora, so-called treebanks.

A common way of integrating statistical parsers intospeech recognition is to directly use them as a generativelanguage model P ðW Þ ¼

PT P ðT ;W Þ. This kind of

approach was pursued by Chelba and Jelinek (2000),Roark (2001b), Charniak (2001), Xu et al. (2002), Wangand Harper (2003). More recently, Collins et al. (2005)introduced a discriminative rather than generativeapproach to syntactic language modeling. In their setup,

1 A notable exception is the SuperARV language model by Wang andHarper (2002), which is essentially a class-based language model.

a statistical parser determines the most likely parse treeTw = argmaxTP(T,W) = argmaxTP(TjW) for each recog-nition hypothesis W. A discriminative statistical modelthen chooses the most likely recognition hypothesis basedon a set of features that were extracted from the wordsequences and their respective parse trees. A similarapproach was followed by McNeill et al. (2006).

Syntactic language models based on statistical parsersoffer several advantages. They can consider the dependen-cies between actual word forms and thus are able to capturesemantics and fixed expressions in a similar way to n-grams.For example, they may prefer “he gave a talk” to “he gave a

walk” if the words gave and talk occur in verb-object relationmore frequently than gave and walk. Further, most statisti-cal parsers employ a sufficiently productive (often infinite)set of grammar rules that allows them to produce a parse treefor any word sequence W. This robustness property isimportant for dealing with hypotheses that are ungrammat-ical or whose syntax is not covered by the treebank.

Statistical parsers have been shown to improve speechrecognition on broad-domain tasks such as transcribingspontaneous telephone conversations (Chelba and Jelinek,2000; Wang et al., 2004; Collins et al., 2005; McNeill et al.,2006), broadcast news shows (Chelba and Jelinek, 2000) orread newspaper articles (Chelba and Jelinek, 2000; Roark,2001b; Wang and Harper, 2002; Xu et al., 2002 and Halland Johnson, 2003).

1.2. Formal grammars

It might be appropriate to first explain which particularkinds of formal grammar we are concerned with. A proto-typical formal grammar is intended to discriminate betweenthose word sequences that are grammatical (i.e. that can begenerated by a sequence of rule applications) and thosewhich are not. In natural language processing and computa-tion linguistics, a grammar is typically used to approximatea natural language rather than define an artificial language.In this context, we informally call a grammar precise if itaccurately predicts whether sentences are grammatical withrespect to some natural language or not.

Predicting the grammaticality of a sentence is generallynot sufficient for applications of natural language process-ing. Thus, we make a few additional assumptions:

� The grammar allows to determine the possible syntacticstructures of a grammatical sentence. Based on thesestructures, statistical models can compute probabilitiesof derivations or word sequences.� Grammaticality and syntactic structure can be deter-

mined for linguistically motivated units other thansentences, e.g. noun phrases. This allows to extract lin-guistic information even in the face of ungrammaticalsentences.

Note that the combination of such a formal grammarwith a statistical model is quite different from a statistical

Page 3: Syntactic language modeling with formal grammars

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 717

parsing model. A typical statistical parser is completelyguided by a statistical model and basically allows for anyderivation that is structurally possible. A parser using aformal grammar is restricted by the hard linguistic con-straints encoded in the grammar, though these constraintsare often selectively relaxed to increase robustness. Thus,the statistical model only needs to “fill the gaps” withquantitative information.

1.3. Formal grammars as language models

Formal grammars have been used as language modelssince the beginnings of automatic speech recognition (seeErman et al. (1980) for an early example). In spite of that,formal grammars have almost exclusively been applied tonatural language understanding tasks in rather restricteddomains. Examples of such domains are naval resourcemanagement (Price et al., 1988), urban exploration andnavigation (Zue et al., 1990), air travel information (Price,1990), public transport information (Nederhof et al., 1997)and appointment scheduling and travel planning (Wahl-ster, 2000).

The system by Beutler et al. (2005) is exceptional in thatit is exclusively targeted at speech recognition. It was eval-

(1) arafat raumte fehler seiner regierung allen

arafat cleared mistakes his government to all‘arafat cleared mistakes of his government for everyone’

(2) arafat und der fehler seiner regierung allein

arafat and the mistake his government only‘only arafat and the mistake of his government’

(3) arafat wollte fehler seiner regierung einen

arafat wanted mistakes his government to unify‘arafat wanted to unify mistakes of his government’

uated for an artificial task, namely the transcription of dic-tation texts for pupils. This task was relatively easy due tothe simple language, the small recognizer vocabulary(which completely covered the test utterances) and thegood acoustic conditions.

Some studies reported reductions of the word error ratecompared to a bigram (Kita and Ward, 1991; Goddeau,1992; Jurafsky et al., 1995), a trigram (van Noord, 2001) ora 4-gram language model (Beutler et al., 2005). However,the only statistically significant improvements relative toan acknowledged state-of-the-art baseline system were pub-lished by Moore et al. (1995) for an air travel informationdomain. For the Verbmobil system (Wahlster, 2000), effectson the speech recognition performance were not reported.

1.4. Formal grammars for broad domains

There are several reasons why formal grammars areattractive for small-domain speech understanding tasks.First, if parsing is necessary for extracting the semantic

interpretation of an utterance, there is no significant over-head in applying the same resources and processing tospeech recognition. Further, small domains allow for veryrestrictive grammars that constrain both the syntax andthe semantics of the accepted utterances. In fact, Rayneret al. (2006) created natural language understanding systemsby compiling a general grammar into more restricted,domain-specific grammars. For small domains, restrictivegrammars can still have a reasonably good coverage, andin applications of human–computer interaction, the usercan even be expected to adapt to the grammar with increas-ing experience. Finally, grammar-based language modelsare particularly interesting if there is not enough domain-specific data for the training of statistical language models.

It is much more difficult to exploit the strengths of for-mal grammars in broad-domain large-vocabulary speechrecognition. The syntax of broad domains is very produc-tive, and thus many incorrect hypotheses have to be consid-ered grammatical. In our experimental data, the utterance“arafat raumte fehler seiner regierung ein” (arafat admitted

mistakes of his government) gives rise to 37 recognitionhypotheses, 25% of which are grammatical. A few exam-ples are shown below:

Some of these phrases are clearly marked, but all ofthem are possible in appropriate contexts and with asemantically plausible choice of words. This example sug-gests that simply choosing the acoustically best grammati-cal hypothesis will not work for broad domains. Rather,there should be a component that captures syntactic prefer-ences and allows to compare different grammaticalhypotheses.

If information on grammatical correctness has a reduceddiscriminative power for broad domains, it appears to beeven more important to model linguistic constraints as pre-cisely as possible. In Chapter 2 of Sag et al. (2003) it isargued that context-free grammars are not adequate fordeveloping precise grammars with broad coverage. Rather,more sophisticated grammar formalisms are required (seeSection 3.1).

However, formal grammars that aim at precision inbroad domains are inevitably incomplete. On the one endof the spectrum, there are general grammatical construc-tions that are notoriously difficult to formalize, for exampleellipses, coordinations and parentheses. On the other end,

Page 4: Syntactic language modeling with formal grammars

718 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

there is a wealth of idiosyncratic phenomena such asdeterminer omission in temporal expressions like “next

week”. But even if standard language were completely cov-ered by the grammar, the best available hypotheses maystill be ungrammatical due to out-of-vocabulary words,bad acoustic conditions, incorrect utterance boundariesor ungrammatical utterances. This necessitates a robustapproach in the sense that syntactic cues should also beexploited for hypotheses that are not accepted by thegrammar.

In summary, we argue that a grammar-based approachto broad-domain speech recognition requires a general butprecise grammar (and hence a grammar formalism whichfacilitates the development of such a grammar), a strongmodel for syntactic preferences and a robust way of inte-grating these components into speech recognition.

None of the previous approaches meets all of theserequirements. Goodine et al. (1991), Goddeau (1992),Jurafsky et al. (1995) and Rayner et al. (2006) used statis-tical models for syntactic preferences. However, they eitherconsidered only grammatical hypotheses or relied on somekind of fallback mechanism to handle ungrammaticalhypotheses. On the other hand, Moore et al. (1995), Kieferet al. (2000), van Noord (2001) and Beutler et al. (2005)achieved robustness by means of partial parsing (see Sec-tion 2.3.2) but did not consider syntactic preferences.Finally, only the Verbmobil grammar by Muller and Kas-per (2000) and the grammar used in (Beutler et al., 2005)attempted to model general language. These two grammarswere based on a linguistically motivated grammar formal-ism, whereas most other grammars were context-free oreven LR(1).

1.5. Outline

In Section 2, we will discuss our approach to integratingformal grammars into speech recognition. Our way ofmodeling and processing natural language will be pre-sented in Section 3. This section also aims at pointing outsome of the difficulties that have to be faced when workingwith precise models of general language. Section 4 is dedi-cated to the acquisition of linguistic resources in generaland lexical information in particular. Finally, Section 5reports on our experiments and discusses the results.

2. Integration of syntactic information

2.1. Architecture

The basic architecture is shown in Fig. 1. In a first stage,a speech signal is processed by a baseline speech recognizerand the resulting word lattice is automatically segmentedinto sub-lattices that roughly represent sentence-like units.

For each sub-lattice, the second stage extracts the N besthypotheses with respect to baseline recognizer score. In ourparticular case, the baseline recognizer score is a weighted

sum of a (logarithmic) acoustic likelihood, an n-gram lan-guage model score and a word insertion penalty.

For each hypothesis, a parser determines a unique parsetree and its associated disambiguation score. This scorereflects the plausibility of the parse tree: highly plausibletrees receive large positive scores and highly implausibletrees receive large negative scores.

Finally, a discriminative reranking component choosesthe most promising hypothesis for a given sub-lattice.Reranking is based on various features that are extractedfrom the individual hypotheses, including the baseline rec-ognizer score, the disambiguation score and different prop-erties of the parse tree. These features are used to computea final score for each hypothesis, and the hypothesis withthe maximum score is chosen as the recognition result.Table 1 shows the relevant scores for an example fromour experimental data.

Note that the benefit of the syntactic information caneasily be assessed by comparing the word error rate ofthe baseline speech recognizer (i.e. of the first-best hypoth-eses) to that of the full system.

Also note that the baseline speech recognizer alreadyemploys statistical language models, namely n-grams. Inthe present approach, these models are complemented witha discriminative language model which considers syntacticinformation. The interaction between these models is two-fold. First, the discriminative reranking takes the n-gramlanguage model scores of the speech recognition hypothe-ses into account. Second, n-gram language models areinvolved in choosing the N best hypotheses in the firstplace: the pruning during speech decoding and N-bestextraction both rely on n-grams.

The idea of discriminative reranking with syntactic fea-tures is not new. A similar setup was used by Collins et al.(2005) and McNeill et al. (2006) to integrate statisticalparsers into speech recognition. An actual contribution ofthis work is our approach of using formal grammars withinthis framework. This approach includes a way of determin-ing a parse tree and a plausibility score for arbitrary (evenungrammatical) word sequences, as well as novel featuresfor robust parsing and hypothesis reranking.

The remainder of this section will discuss different build-ing blocks of this approach. Section 2.2 provides a shortintroduction to discriminative reranking with log-linearmodels. This technique is used in robust parsing (Section2.3) as well as hypothesis reranking (Section 2.4). Finally,related work is discussed in Section 2.5. Details about thesegmentation component can be found in (Kaufmann,2009).

2.2. Discriminative reranking with log-linear models

Reranking applies to situations where an observationand several competing candidate interpretations are given.The goal of reranking is to find the most likely interpreta-tion for a given observation. Johnson et al. (1999)proposed a reranking approach based on conditional

Page 5: Syntactic language modeling with formal grammars

sub-lattice sub-latticesub-lattice . . .

speech signal

Speech Recognizer

lattice

Segmentation

1st best hypothesis+ baseline score

2nd best hypothesis+ baseline score

Nth best hypothesis+ baseline score H

ypot

hesi

s R

eran

king

...

transcription

...

N-Best Extraction

sub-lattice

parse tree + disamb. score

parse tree + disamb. score

parse tree + disamb. scoreRob

ust P

arsi

ng

Fig. 1. Basic architecture. An utterance is processed by the baseline speech recognizer and the resulting word lattice is automatically split at potentialsentence boundaries. For each sub-lattice, the N best hypotheses are extracted and the most promising hypothesis is chosen based on syntactic informationand the scores assigned by the baseline speech recognizer.

Table 1The five best hypotheses for the utterance “als Bush im Auto den Reichstag

verliess, konnte er sicher sein” (when Bush left the Reichstag by car, he could

be sure). Each hypothesis is annotated with the baseline speech recognizerscore sbas, the disambiguation score sdis and the final recognition score srec

(the maximum values are printed in bold face). Note that the correcthypothesis on the fifth rank receives the highest final score. Thecomputation of sdis and srec is explained in Sections 2.3 and 2.4,respectively.

1. als burschen autor den reichstag verliess konnte er sicher sein

sbas = �614265.7 sdis = �0.09 srec = �41418.002. als burschen auto den reichstag verliess konnte er sicher sein

sbas = �614275.3 sdis = �0.11 srec = �41418.503. als puschen autor den reichstag verliess konnte er sicher sein

sbas = �614290.1 sdis = �1.72 srec = �41420.064. als wuschen autor den reichstag verliess konnte er sicher sein

sbas = �614290.5 sdis = �2.51 srec = �41420.465. als bush im auto den reichstag verliess konnte er sicher sein

sbas = �614291.3 sdis = +4.16 srec = � 41416.90

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 719

log-linear models. In the notation of Charniak and John-son (2005), the corresponding probability distribution isas follows:

P hðyjYÞ ¼1

ZhðYÞe

Pj

hjfjðyÞð4Þ

ZhðYÞ ¼Xy02Y

eP

jhjfjðy0Þ ð5Þ

In the above equations, the observation is implicitly repre-sented by the set Y of all candidate interpretations that areconsistent with the observation. An individual candidatey 2 Y is described by real-valued feature functions fj(y).An important property of log-linear models is that thesefeature functions are not assumed to be statistically inde-pendent. Given a set of feature weights hj, the most likelycandidate yw can be obtained as follows:

yH ¼ argmaxy

P hðyjYÞ ¼ argmaxy

Xj

hjfjðyÞ ð6Þ

Thus, a candidate y maximizes the conditional probabilityP hðyjYÞ if and only if it maximizes the term

PjhjfjðyÞ.

This term will subsequently be called the score of candi-date y.

The model parameters h = (h1, . . . ,hn) are chosen tomaximize the (regularized) log-likelihood of the trainingdata:

LðhÞ ¼ logY

s

P hðyðsÞjYðsÞÞ �X

j

h2j

2r2j

ð7Þ

The regularization termP

jh2j=2r2

j was introduced byChen and Rosenfeld (1999). It amounts to a Gaussian priordistribution on the feature weights and prevents overfittingby penalizing large weights. We follow Charniak and John-son (2005) in using a single regularization constant cinstead of the different 1=2r2

j , which results in the regular-ization term c

Pjh

2j . The constant c has to be optimized on

held-out data. An efficient method for maximizing L(h) hasbeen proposed by Malouf (2002), and an implementationwas presented in (Charniak and Johnson, 2005).

2.3. Robust parsing

The goal of robust parsing is to determine a uniqueparse tree for any word sequence. Thus, an approach torobust parsing has to resolve ambiguities by making a goodchoice among alternative parse trees, and it has to offer amethod for dealing with ungrammatical word sequences.These two problems will be discussed in the following. Sec-tion 2.3.1 will give an introduction to the predominantapproach to disambiguation, which is also adopted in thiswork. In Section 2.3.2, we describe our approach to robust-ness. The types of features that we use for robust parsingare outlined in Section 2.3.3.

2.3.1. Parse disambiguation

Parse disambiguation can be regarded as a rerankingtask where the observation is a word sequence and the can-didates are the parse trees that are consistent with thisword sequence. A feature function fj(y) is typically assumedto have to following form:

Page 6: Syntactic language modeling with formal grammars

root

PP autor NP verliess S

als burschen den reichstag konnte VP

er VP

sicher sein

Fig. 2. An artificial parse tree for the (ungrammatical) first hypothesis ofTable 1. It consists of five partial parse trees: the prepositional phrase “als

burschen” (as busters), the noun phrase “den reichstag” (the Reichstag), thesentence “konnte er sicher sein” (could he be sure?) and the single wordsautor (author) and verliess (left).

720 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

fjðyÞ ¼X

n2nodesðyÞejðnÞ ð8Þ

In the above equation, y denotes a parse tree and nodes(y)is the set of nodes in that parse tree. The indicator functionej(n) is 1 if a certain “linguistic event” is associated withnode n, and 0 otherwise. For example, ej(n) may be 1 ifand only if the node n represents a pronoun. In this case,the feature fj(y) simply counts the number of pronouns thatoccur in the parse tree.

The set of parse trees derived by the parser is usuallyrepresented as a packed parse forest (Tomita, 1991). Apacked parse forest describes an ambiguous structure bymeans of disjunctive nodes which represent alternative sub-trees. Enumerating all unambiguous parse trees that areimplicitly represented in a packed parse forest is typicallyinfeasible. However, there exist algorithms for ambiguity

unpacking that extract the most likely parse trees from apacked parse forest without enumerating all possible trees.Some of these algorithms impose certain restrictions uponthe indicator functions ej.

For this work, we adopted the selective unpackingalgorithm by Carrol and Oepen (2005). This algorithmincrementally extracts parse trees in decreasing order ofprobability. It assumes that the indicator functions ej(n)only consider subtrees of n with a fixed maximum height.Increasing the maximum height allows for a greater flexi-bility in feature design but increases the computationalcost. We chose the maximum height to be 3, i.e. four levelsof nodes could be accessed. An alternative approach wouldhave been the beam search algorithm by Malouf and vanNoord (2004). This algorithm allows for arbitrary indicatorfunctions but does not guarantee to find the optimalsolution.

Discriminative reranking has emerged as the standardapproach to parse disambiguation, in particular for gram-mars that are not context-free. It has been applied to pre-cise formal grammars by Riezler et al. (2002), Maloufand van Noord (2004) and Toutanova et al. (2005). Collinsand Koo (2005) and Charniak and Johnson (2005) used thesame technique to improve the output of statistical parsers.

2.3.2. Robustness

Our approach to robustness is based on partial parsing.In partial parsing, the parser identifies and analyzes allphrases that can be found somewhere in the given wordsequence W. Thus, parsing results in a set of partial parse

trees spanning different subsequences of the given wordsequence.

Robustness can be achieved by allowing to analyze aword sequence as an arbitrary sequence of partial parsetrees. Note that such a sequence is guaranteed to exist ifa single word is regarded as a degenerate partial parse tree.In the following, a sequence of partial parse trees will bethought of as an artificial parse tree that is created byattaching the partial parse trees to a common root node.An example of an artificial parse tree is shown in Fig. 2.

The key idea of our approach is to let the disambigua-tion component determine the most likely artificial parsetree (and thus the most likely sequence of partial parsetrees). To this end, the reranking model is extended witha set of dedicated feature functions that are sensitive toproperties such as the number and types of partial parsetrees.

2.3.2.1. Disambiguation with artificial parse trees. Extract-ing the most likely artificial parse tree from a packed parseforest is straight-forward if the indicator functions ej(n) canonly access subtrees of n and if ej(nroot) = 0 for the rootnode nroot. For an artificial parse tree �y with partial parsetrees ym, it then holds that scoreð�yÞ ¼

Pjhjfjð�yÞ ¼P

mscoreðymÞ. Thus, it is sufficient to find a sequence of par-tial parse trees which maximizes the sum of the individualscores. This can be done by means of dynamic program-ming. If L(i, j) denotes the maximum score of all partialparse trees spanning the subsequence (wi, . . . ,wj) ofW = (w1, . . . ,wN), the score of the most likely artificialparse tree can be computed as follows:

MðlÞ ¼0 if l ¼ 0

max06k<l

ðMðkÞ þ Lðk þ 1; lÞÞ if 1 6 l 6 N

(

The score of the most likely artificial parse tree issdis = M(N). In order to recover the actual parse tree, it isnecessary to memorize the arguments of all maximizationsthat are performed when computing L and M.

This algorithm already allows to include certain featuresthat characterize the artificial parse tree. For example, anindicator function ej(n) can be defined to be 1 if and onlyif n is the root of a partial parse tree. The respective featurefj(y) then counts the number of partial parse trees and thefeature weight hj represents a penalty for introducing anadditional partial parse tree. Other features may count par-tial parse trees with a specific syntactic category, e.g. nounphrases or adverbials. In (Kaufmann, 2009), we showedthat this approach can also be extended to less restrictedindicator functions, e.g. functions that depend on theactual number of partial parse trees. Such a function could,

Page 7: Syntactic language modeling with formal grammars

Table 2The features used for hypothesis reranking.

Features provided by the baseline speech recognizer

Acoustic log-likelihoodWeighted language model score + word insertion penaltyBaseline recognizer score (the sum of the former two)

Features introducing syntactic information

Disambiguation score of the best artificial parse treeAll features used for disambiguation (see Section 2.3.3)

Features combining syntax and recognizer information

Partial parse tree counts for non-first-best hypothesesProsodic partial parse tree features

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 721

for example, indicate that the artificial parse tree containsat least three partial parse trees.

2.3.2.2. Training with artificial parse trees. The training ofthe discriminative reranking model is complicated by thefact that the number of candidates (i.e. the number of pos-sible artificial parse trees) can be prohibitively large. Weadopted an approach that was originally proposed byOsborne (2000) and Malouf and van Noord (2004): we cre-ated a manageable candidate set by subsampling the full setof candidates. We did not employ uniform random sam-pling because this would have led to impoverished sets inwhich candidates with few partial parse trees are underrep-resented. Instead, a training example was constructed byrandomly selecting 10 artificial parse trees for each possiblenumber of partial parse trees. The best possible artificialparse tree was added to the candidate set if it was notalready part of it.

2.3.2.3. Related work and discussion. Common approachesto robust parsing use simple heuristics to break an ungram-matical word sequence into grammatical parts, completelyignoring the plausibility of partial parse trees (van Noord,2001; Riezler et al., 2002; Kaplan et al., 2004; Beutler et al.,2005; van Noord et al., 2006). For example, the fewest-

chunks heuristics minimizes the number of parse trees, evenif this means to choose a highly unlikely partial parse tree.Prins and van Noord (2003) address this problem by essen-tially filtering unlikely parse trees by means of a statisticalpart-of-speech tagger. However, tagging cannot resolve allambiguities and it may introduce errors from which theparser cannot recover. In our approach, the quality of atree sequence naturally includes the quality of the individ-ual trees.

Zhang et al. (2007) used a dedicated model for the prob-ability of breaking a word sequence into a specific sequenceof chunks. However, this model is again based on a heuris-tics, and the authors did not provide an exact and efficientalgorithm for finding the optimal sequence of partial parsetrees.

2.3.3. Features

This section outlines the most important feature classesthat were used for robust parsing. A description of thecomplete feature set (which consists of roughly 12,000 fea-tures) is beyond the scope of this paper and can be found in(Kaufmann, 2009).

2.3.3.1. Partial parse tree features. The partial parse treefeatures have already been introduced as a key componentof our approach to robustness. One feature simply countsthe number of partial parse trees. Other features arerestricted to single words, certain complete categories(e.g. noun phrases, main clauses or adverbials), or toincomplete categories such as noun phrases with a missingdeterminer. The latter were included to gather syntacticinformation in the face of elliptical speech and other out-

of-grammar utterances. Other features are 1 if and only ifthere are at least m partial parse trees. The feature form = 2 is particularly important as it indicates whether thereis a single complete parse tree or not.

2.3.3.2.Part-of-speech mismatch features. In many incorrectsyntactic analyses, words are assigned the wrong parts ofspeech. In order to improve disambiguation, we use the out-put of a statistical part-of-speech tagger. This information isintegrated by means of features that count mismatchesbetween the tagger output and the parts of speech that areassumed in the parse tree. Apart from a general mismatchfeature, there are features that only consider words with aparticular assumed part of speech, e.g. proper noun or pro-noun. Prins and van Noord (2003) used tagging informa-tion to exclude unlikely lexicon entries from parsing. Thisstands in contrast to our approach where tagging informa-tion can be overruled by other evidence.

2.3.3.3. Generic rule features. The generic rule features aimat covering a large space of linguistic events that are poten-tially relevant for disambiguation. For each grammar rule,there is a feature that counts how often that rule has beenapplied in the course of the derivation. More specific fea-tures describe the event that a given rule was applied tophrases that belong to specific classes, e.g. that a particularbinary rule was applied to a verb and a noun phrase with-out determiner. Other features count how often a phrase iscreated by consecutively applying the rules r1, r2, . . . , rm to aparticular element of the phrase, the so-called head element.In our experiments, we used m 2 {2,3}.

2.3.3.4. Miscellaneous features. In addition to these rathergeneral features, there is a variety of very specific featuresthat are motivated by linguistic intuition. These featuresare used to capture preferences with respect to the constit-uent order within clauses, the difference in length of twoconjuncts and the underspecification of case, just to men-tion a few.

2.4. Hypothesis reranking

The goal of hypothesis reranking is to choose the mostpromising speech recognition hypothesis for a given

Page 8: Syntactic language modeling with formal grammars

722 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

utterance. This is done by means of a conditional log-linearmodel that estimates the conditional probability of eachhypothesis. In fact, it is sufficient to compare the scoressrec ¼

PjhjfjðyÞ of the different hypotheses y, as was

pointed out in Section 2.2. The features of the log-linearmodel are shown in Table 2.

Even though the acoustic log-likelihood and the lan-guage model score are already contained in the baselinerecognizer score, these two components are representedby dedicated features. This enables the model to changethe relative influence of the statistical language modelwhich is now complemented with syntactic information.

As for the syntactic features, it should be noted that thedisambiguation model is trained to discriminate betweenalternative parse trees for the same word sequence. Thissuggests that the disambiguation score is only partly ade-quate for comparing parse trees of different hypotheses,even though it turns out to be useful for this purpose (seeSection 5.6). For this reason, the disambiguation score iscomplemented with the full set of features that are usedfor disambiguation. These features allow to capture syntac-tic patterns that are characteristic of correct or incorrecthypotheses.

Some of the features combine information about par-tial parse trees with information that is specific to speechrecognition. For example, there is an additional set of fea-tures that indicate whether there are at least m partialparse trees. In contrast to the original set, these featuresare defined to be zero for the first-best hypothesis. Theyenable the model to selectively penalize non-first-besthypotheses and thus to control the tendency to choose ahypothesis other than the first-best one. The prosodic par-tial parse tree features were intended to reward partialparse trees that are adjacent to prosodic phrase bound-aries. As the benefit of these features was only modest,they are not discussed here.

2.5. Discussion and related work

In the previous sections, we have proposed an approachof integrating formal grammars into automatic speech rec-ognition. The approach offers ways to deal with the limitedcoverage of precise formal grammars and the fact thatincorrect recognition hypotheses are often grammatical.Both problems are tackled by means of partial parsingand considering the plausibility of parse trees.

One novel aspect of this work is the proposed approachto robustness which allows to determine a parse tree and a“plausibility score” for word sequences that are not cov-ered by the grammar. Another novel aspect is our particu-lar way of integrating syntactic information into hypothesisreranking. This includes the use of the disambiguationscore and the partial parse tree features. The latter indicate,among others, whether a hypothesis is covered by thegrammar or not. This information is already reflected inthe disambiguation score, but it can also explicitly be takeninto account for hypothesis reranking.

Our approach is based on N-best reranking rather thanlattice parsing or even tighter integration into the speechrecognition decoder. One reason for this is that parsersfor precise broad-coverage grammars are not suited forsuch processing. They tend to consume too much memoryfor storing a large number of partial results. Further, effi-cient search techniques such as Aw are not applicablebecause such parsers typically do not operate from left toright. However, there is evidence that N-best rescoring issufficient for investigating the benefit of syntactic informa-tion. Collins et al. (2004) observed that the best results forspeech recognition with statistical parsing models wereobtained by means of N-best parsing. Several authorsreported that lattice parsing led to gains in processingspeed, but to a reduced or only marginally improved accu-racy (Roark, 2001a; Hall and Johnson, 2003; Collins et al.,2004).

The most closely related work is that of Collins et al.(2005) and McNeill et al. (2006). They used statistical pars-ers to derive parse trees for the N best speech recognitionhypotheses. The hypotheses were reranked by means ofsyntactic features that were extracted from the parse trees.As statistical parsers cover arbitrary word sequences bydesign, robustness was not an issue.

The features of Collins et al. (2005) representedsequences of words and part-of-speech tags, context-freerule applications, head word annotations of non-terminalsymbols and dependencies between two head words. Theirreported improvements were mostly due to features thatare based on part-of-speech tags and did not involve parsetrees at all.

McNeill et al. (2006) considered the 20 best parse treesof each hypothesis. This makes the approach less sensitiveto errors of the statistical parser and it allows to computemore reliable estimates for the feature values. Their featureset was richer than the one of Collins et al. (2005) andincluded features for specific syntactic constructions. Theyfurther considered the (generative) probability of a parsetree. This probability roughly corresponds to our disam-biguation score. However, the latter originates from a dis-criminatively trained model and thus represents a differentkind of information.

3. Formalizing natural language

Although our approach to speech recognition does notrely on a particular grammar formalism, this section willgive a very brief introduction to the formalism which wasemployed in the present work. The reason for this is thatthe choice of the grammar formalism does have an impacton the practicability of a grammar-based approach. This isparticularly true if one attempts to model a large fragmentof a natural language by means of restrictive grammarrules.

We argue that with an adequate grammar formalism,the number and complexity of grammar rules is manage-able even for precision grammars with a large coverage

Page 9: Syntactic language modeling with formal grammars

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 723

of linguistic phenomena. By way of illustration, we willprovide a short outline of our German grammar. A morein-depth discussion of both the grammar and the grammarformalism can be found in (Kaufmann, 2009).

3.1. The grammar formalism

In our experiments, we used a Head-driven PhraseStructure Grammar (HPSG). Good introductory text-books on this formalism are Sag et al. (2003) and – for Ger-man HPSG – Muller (2007). The foundations of HPSGwere published as Pollard and Sag (1987) and furtherdeveloped in (Pollard and Sag, 1994).

HPSG is based on context-free phrase structure ruleswhich describe how words or phrases can be combined toform larger linguistic units. But unlike context-free gram-mars, HPSG represents linguistic units as hierarchical fea-ture structures rather than atomic pre-terminal symbols.Feature structures can describe a linguistic unit on a highlevel of abstraction. For example, they can specify withwhich kind of unit some word or phrase can combineand in what way.

A consequence of this high level of abstraction is the lex-icalization of the grammar: the lexical representation of aword (i.e. its feature structure) bears precise informationon how the word interacts with other linguistic units. Thus,a considerable part of the grammar can be thought of asbeing represented in the lexicon. Consequently, many syn-tactic peculiarities can be accounted for by simply intro-ducing specific lexicon entries rather than changing thegrammar rules.

If linguistic information is adequately organized in fea-ture structures, it is possible to state grammar rules whichcapture general linguistic concepts. For example, a singlerule might account for the way an adjective attaches to anoun or an adverb to a verb. Such a rule would requirethe respective lexicon entries to specify which kind of lin-guistic unit they can attach to. The interaction of few gen-eral rules can account for a large number of syntacticconstructions which otherwise would have to be describedindividually. However, it is still possible to use construc-tion-specific phrase structure rules where necessary.

In summary, HPSG facilitates the development of large-coverage precision grammars by means of lexicalizationand generalization. Lexicalization allows to handle manykinds of exceptions in a simple way, and the ability to cap-ture general linguistic concepts leads to a relatively smallnumber of grammar rules. An example of such a grammarwill be outlined in the next section.

3.2. The grammar

Our German grammar is based on diverse literature onGerman HPSG, most notably Muller (1999), Crysmann(2003, 2005), Muller (2007). Since a detailed descriptionis beyond the scope of this paper, we will only give a broadoverview of the grammar. In particular, we will state the

number of grammar rules that are dedicated to a specificphenomenon or construction.

Twenty-two grammar rules are used to model the basiclinguistic concepts subcategorization, modification, extrac-

tion and extraposition. Some of these rules are largely lan-guage-independent, whereas others are motivated by morelanguage-specific phenomena such as the relatively freeword order in German or the so-called fronting of partial

verb phrases. The formation of the German verbal complexis covered by 3 additional rules. Coordinations (e.g. “nei-

ther the man nor the woman”) are modeled by means of 4grammar rules. There is one grammar rule for relativeclauses and another one for interrogative clauses. Finally,8 grammar rules are involved in the construction of nounphrases. They include rules for nominalized adjectives (“der

Kleine”, engl. the little one), prenominal genitives (“des

Pudels Kern”, engl. the poodle’s core), post-nominal geni-tives (“die Stunde der Wahrheit”, engl. the moment of truth)and appositions (“Prasident Bush”, engl. president bush).

The 39 rules that were outlined above represent thegrammar for general language use. This grammar iscomplemented with domain-specific subgrammars whichamount to 49 additional rules. These subgrammarsdescribe certain expressions of date and time, as well asforms of address. Further, they account for the fact thatthe speech recognizer may split certain compound words,e.g. spelled-out numbers (“ein und zwanzig”), compoundnouns (“zeitungs bericht”) and acronyms (“b. m. w.”).

Some expressions follow restricted and very specific var-iation patterns that are rather cumbersome to handle inHPSG but can easily be expressed by means of regularexpressions. An example is “bis Ende August 2009” (until

the end of August 2009). Such expressions are specified ina regular grammar. Each regular expression is mapped toa feature structure which represents the interface to theHead-driven Phrase Structure Grammar.

4. Development of lexical resources

4.1. Introduction

Precise formal grammars require precise lexical informa-tion. A noun, for example, can be countable or uncount-able, it can denote a title, a personal or geographicalname, a quantity or a temporal unit, and it can have fourtypes of sentences as a complement. For verbs, we distin-guish about 880 subcategorization frames (i.e. possible setsof complements). Most verbs can be used with several sub-categorization frames. For example, the verb gehen (to go/

walk) is assigned 16 different frames in our lexicon.The following sections address the problem of acquiring

such lexical information, given a list of word forms thatshould be covered by the lexicon. In some cases, it is amajor problem to identify the lexical units in the first place.Examples are German particle verbs or multiword expres-sions such as “nach wie vor” in German or “by and large” inEnglish.

Page 10: Syntactic language modeling with formal grammars

724 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

In general, dictionaries or reference books do notdirectly provide the specific, fine-grained syntactic catego-ries that are required for our purposes. They also tend torepresent the prototypical usages of a word and neglectnon-standard language use and the numerous variationsthat can be observed in corpus data. Thus, we decidedto rely on data-driven, semi-supervised approaches when-ever possible. We did not use treebanks or the output ofstatistical parsers in order not to weaken the argumentthat the proposed approach requires relatively little syn-tactically annotated data. Instead, we mainly relied on atext corpus of 220 million words that was tagged withthe TnT part-of-speech tagger by Brants (2000). The cor-pus consists of news reports from the Frankfurter Runds-chau that were published between 2nd January 1997 and13th April 2002.

The methods that will be described next were used tocreate a lexicon that covers the 7000 word forms occurringin our experimental data. The lexicon contains about 600adjectives, 4000 nouns and 3300 verbs. The verbs wereannotated with more than 11,000 subcategorizationframes. The number of multiword lexemes is about 1200,not including the 1400 acronym entries that were generatedautomatically. The whole set of lexicon entries gives rise toabout 120,000 inflected forms.

Our decision to acquire only the words that occur in theexperimental data calls for an explanation: technically,restricting the vocabulary like this gives us informationabout the test data. We chose to do so because it wouldhave been too laborious to manually acquire the full speechrecognizer vocabulary of about 300,000 words. Also,research in (sufficiently effective) methods of automatic lex-ical acquisition was clearly outside the scope of our work.However, we think that our results are not significantlyaffected by this decision. First, the acquisition is based onan alphabetically sorted list of about 7000 words, about30% of which actually appear in the reference transcrip-tion. This makes it difficult to guess if a particular wordoccurs in the transcription and in which particular syntac-tic contexts. Second, our semi-automatic acquisition pro-cess heavily relies on information which is completelyunrelated to the test data, for example morphological dat-abases, dictionaries and automatically extracted corpusexamples.

4.2. Identification of open-class words

Before determining the syntactic categories, it is neces-sary to identify the lexical units (lexemes) that accountfor the given set of word forms. This was achieved by send-ing queries to the canoo.net morphological database. Foreach word form, the set of matching lexemes was retrieved(with kind permission of Canoo Engineering AG), where alexeme was represented by its part-of-speech and itsinflected forms. The inflectional class and the stem infor-mation of each lexeme were automatically inferred fromthe inflected word forms.

A special treatment was required for particle verbs suchas untergehen (to sink): as the particle (unter) and the verb(gehen) can occur as two separate words in certain situa-tions, the actual particle verb may not be present in thegiven set of word forms. In order to check whether a prefixand a verb from the set may form a particle verb, the textcorpus was searched for relatively unambiguous non-finiteforms such as unterzugehen and untergegangen.

The identification of personal names and geographicalnames is difficult for two reasons. First, they are not wellcovered by dictionaries and linguistic databases. Second,many of them are homonyms of nouns, verbs or adjectives.For each word form, it was manually decided whether itcan be used as a given type of proper name (i.e. first name,last name or geographical name). The decision-making wassupported by two pieces of information: the output of aclassifier that was trained to solve the same task, and the10 “most informative” corpus examples, i.e. the corpusexamples in which the word in question is most likely tobe used as a proper name of the given type. Our approachof training such classifiers and determining the most infor-mative examples will be outlined in the next section.

4.3. Acquisition of open-class words

Once all nouns, verbs and adjectives were identified, thenext step was to determine their syntactic properties. Thiswas done manually with the help of information that wasextracted from the tagged text corpus. For each lexeme,the lexicon developer had to decide whether a given usage(e.g. the uncountable use of a given noun) is possible ornot. To help with the decision, the lexicon developer waspresented a number of corpus examples in which the givenword potentially occurs in that specific usage. Thisapproach is presumably more efficient (and more accurate)than browsing through dictionaries, and the corpus infor-mation can be expected to be more reliable than the intui-tion of the lexicon developer.

A straightforward approach to selecting corpus exam-ples is to specify a regular expression for contexts thatare characteristic for the given usage. The symbols of theregular expression can be defined in terms of word formsand/or part-of-speech tags. This approach was found towork reasonably well for certain syntactic properties. Forother properties, the value of a corpus example seemed todepend on many different cues whose relative importanceis not intuitively clear. For example, the absence of a deter-miner is not a good indicator for a mass noun, as determin-ers of count nouns can also be omitted in appropriatecontexts such as headlines. The preceding word seems tobe more informative. In particular, words like viel (much)are strong indicators for uncountable use. Our secondapproach attempts to integrate and weight such cues.

The basic idea is to train a log-linear classifier (i.e. amaximum entropy model) to predict whether a given usageis possible for a given word. Each feature of this classifieressentially searches the corpus and counts how often a

Page 11: Syntactic language modeling with formal grammars

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 725

specific cue occurs in the context of the word to be classi-fied. The classifier can be trained by means of an initial lex-icon, which results in a set of feature weights. Theseweights represent the relative importance of their respectivecues. Summing up the weights of all cues that are present ina corpus example provides a score which measures the“informativeness” of that example.

This way of acquiring syntactic information turned outto work very well for nouns and adjectives. For verbs, how-ever, the approach was of rather limited use. Most subcat-egorization information was manually obtained from adictionary (Duden, 1999). Corpus information was onlyused for detecting certain types of complements such asprepositional objects or sentential complements.

We did not attempt to acquire lexical information fullyautomatically, as it was done by Baldwin (2005a), Baldwin(2005b), Nicholson et al. (2008), among others. The mainreason is that we wanted the lexical resources to be as accu-rate as possible. However, it was not investigated howmuch our approach to speech recognition relies on an accu-rate lexicon.

4.4. Acquisition of multiword lexemes

Using the terminology of Sag et al. (2002), we are con-cerned with lexicalized phrases, i.e. phrases whose syntacticor semantic properties cannot be derived from their parts.More precisely, we are only interested in syntactic compos-itionality. For example, the meaning of “den Loffel abge-

ben” (to kick the bucket, literally: to hand in the spoon)cannot be derived from its parts. From a syntactic pointof view, however, this expression is completely regular.This is in contrast to the expression “Schlange stehen” (toqueue up, literally: to stand snake). Here, the noun Schlange

lacks a determiner and the verb stehen is used transitively,which both is not possible in general.

We distinguished between 11 types of multiword expres-sions which reached from completely fixed expressions (e.g.“ab und zu”, now and then) to expressions that undergointernal variation (e.g. “auf diese/unterschiedliche Weise”,in this/a different way) or word order variation (e.g. “etwas

in Kauf nehmen”, to put up with something).Potential multiword expressions were extracted in the

standard way: suffix arrays (Manber and Myers, 1990;Yamamoto and Church, 2001) were used to identify alltagged word sequences that occurred at least 5 times inthe corpus and that were composed of at most 10 words.For each sequence, it was determined how strongly theword/tag pairs were associated with each other. Point-wise mutual information (Fano, 1961; Church andHanks, 1990) was chosen as the association measurefor word pairs. For sequences of more than two words,the approximation by Schone and Jurafsky (2001) wasemployed.

The tagged word sequences were sorted in decreasingorder of association strength and manually inspected start-ing from the top-ranked sequence. Each word sequence

that was considered to be a relevant multiword expressionwas marked and later added to the lexicon. The manualinspection was stopped when the event of a relevant multi-word expression occurring became too rare. This generalprocedure was varied by filtering certain word sequences,e.g. sequences that contain words which are not coveredby the lexicon, or sequences with a particular part-of-speech pattern.

Our second approach to identifying multiword lexemesexploits the fact that the prefield of German main clausesis typically occupied by a single constituent. The prefieldgenerally covers the part to the left of the finite verb. Weextracted many instances of the prefield from the text cor-pus and parsed each of them. The unparsable wordsequences were sorted by association strength andinspected manually. Apart from multiword expressions,this also yielded flaws of the grammar and missing syntac-tic constructions. Examples are postnominal modifiers as in“er selbst” (he himself) or various expressions of date thatwere later incorporated in the regular grammar, e.g. “im

September letzten Jahres” (in september of last year). Thisapproach is related to the error mining technique whichwas introduced by van Noord (2004).

5. Experiments and results

5.1. Baseline system and data

Our experiments are based on word lattice output of the300k LIMSI German broadcast news transcription systemwhich is described in (McTait and Adda-Decker, 2003;Gauvain et al., 2002). The speech recognizer is a multi-passsystem that employs continuous density hidden Markovmodels with Gaussian mixtures for acoustic modeling, 4-gram backoff language models and class-based trigrams.The vocabulary size is 300,000 words.

The lattices cover 6 broadcasts of the German newsshow Tagesschau. The data of three news shows was usedin preliminary experiments and for the training of the seg-mentation component (see Fig. 1). The experimentsreported in this work are based on the remaining threenews shows (broadcast on 29th April, 15th May and 23rdMay 2002). We will subsequently refer to the latter threenews shows, unless otherwise noted. These three new showsamount to 603 sentences, 7477 words, or a signal length of52 minutes. The average sentence length is 12.4 words. Theword error rate of the first-best transcriptions is 13.27%.The first-best transcription of one of the word lattices isgiven below:

Doch auf der Internet Seite wird zur bes-

onderen Vorsicht gemahnt <s/> Insbeson-

dere bei Menschen Ansammlungen [silence]

{breath} <s/> Auch der Betroffene Reisev-

eranstalter [silence] zieht Konsequenzen

<s/>

Page 12: Syntactic language modeling with formal grammars

Table 3The word error rate after reranking with syntactic information comparedto the first-best word error rate of the baseline system and the 100-bestoracle error rate.

Approach Word error rate

Baseline system 13.27%+ Syntactic information 11.98% (�9.7% relative)100-Best oracle 6.32%

726 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

Besides the tags indicating silence, breath, filler wordsand potential sentence boundaries, the speech recognitionoutput has certain peculiarities that have to be taken intoaccount by the grammar. In particular, the speech recog-nizer may split words such as compound nouns (e.g. “Inter-

net Seite” instead of “Internetseite”), acronyms (e.g. “B. M.W.” instead of “BMW”), numbers (e.g. “ein und zwanzig”

instead of “einundzwanzig”) and prefix verbs (e.g. “aus han-

delten” instead of “aushandelten”). This allows the speechrecognizer to deal with the potentially infinitely many com-pound words.

Split compounds are not counted as errors in the evalu-ation scheme by McTait and Adda-Decker (2003). Theevaluation is based on rewrite rules that normalize boththe reference transcriptions and the speech recognitionhypotheses. Even though syntactic analysis could help torecover the original word forms, we adopted the originalevaluation scheme in our experiments.

5.2. Preprocessing

The three news shows were automatically segmentedinto 609 sentence-like units. From each resulting word lat-tice, the best hypotheses were extracted incrementally. Asmany hypotheses only differ in the tags and in the capital-ization of words, this information was removed in a nor-malization step. The extraction of the best hypotheseswas stopped as soon as 100 different normalized hypotheseswere found.

Lexical acquisition and grammar development wereentirely based on the roughly 7000 word forms thatoccurred in the speech recognition hypotheses, i.e. the lan-guage engineer had not other information about the testdata. After the development of the linguistic resources,the reference transcriptions of the 609 speech segmentswere determined and annotated with syntactic structure.

All hypotheses were parsed and the resulting parse for-ests were stored on disk. For three utterances, somehypotheses could not be parsed exhaustively due to mem-ory limitations (parsing was aborted as soon as more than5,000,000 feature structure nodes were created). In thesecases, we considered only the hypotheses that were scoredbetter than the first unparsable hypothesis. On average,parsing one hypothesis took about 1.5 s with Java 5.0 ona 64 bit Linux work station with AMD Opteron 2218CPUs and 8 GB RAM.

The part-of-speech tagger for extracting the part-of-speech features was trained on a transformed version ofthe TIGER corpus (Brants et al., 2002). The tagger andthe training data are described in (Kaufmann, 2009).

5.3. Experimental setup

In order to make the best use of the available test data,we followed a 10-fold cross-validation scheme, i.e. thehypothesis reranking model was evaluated on each foldafter being trained on the 9 remaining folds. Note that

the training of the hypothesis reranking model relies onthe output of the disambiguation model, namely the mostlikely parse trees and the respective disambiguation scores.This output is produced by means of an embedded cross-validation cycle: the disambiguation model produces out-put for one fold after being trained on the remaining 8folds (the fold for evaluating the hypothesis rerankingmodel has to be excluded from the training).

The utterances were randomly assigned to the 10 folds.Cross-validation with random assignment is problematicfor tasks in which similar observations tend to occur inclose proximity: the similar observations are likely to beassigned to different folds and thus can be learned by themodel. This would in fact be a problem if our modelswould consider actual word forms (which tend to be clus-tered within segments of the same topic). However, as themodels are based on syntactic categories rather than wordforms, such clustering effects can be regarded as negligible.

The training data for the disambiguation model wascomplemented with 447 syntactically annotated broadcastnews transcripts from earlier experiments. There is a slightmismatch between this additional training data and thedata from our experiments: the transcripts have correctsentence boundaries and represent only the sections withlow word error rates. Overall, the training data for the dis-ambiguation model amounts to about 990 sentences.

The conditional log-linear models were trained with thereranking software by Charniak and Johnson (2005). Theregularization parameter c was set to 13 for disambiguationand 30 for hypothesis reranking. These values were foundto work well in earlier experiments with similar data andfeature sets.

5.4. Broadcast news transcription results

Table 3 shows that automatic segmentation and 100-best reranking reduced the word error rate by 9.7% rela-tive. This improvement is statistically significant at a levelof less than 0.1% according to both the Matched Pairs Sen-tence-Segment Word Error test (MAPSSWE) and theMcNemar test on sentence level (Gillick and Cox, 1989).

It is sometimes suspected that syntactic informationmainly helps in getting inflectional endings right, particu-larly for highly inflected languages such as German. Inorder to check this assumption, we compared the outputof the baseline system to the output of the reranking com-ponent and counted how often an error was corrected by

Page 13: Syntactic language modeling with formal grammars

0 20 40 60 80 100

12

12.2

12.4

12.6

12.8

13

13.2

N

wor

d er

ror

rate

(%

)

Fig. 3. The word error rates that result from reranking the N best speechrecognition hypotheses, 1 6 N 6 100.

Table 4The relative improvements for two artificially simplified speech recogni-tion tasks. The first experiment assumes correct segmentation and is basedon manually determined sentence boundaries. The second experiment isperformed on a subset of the data (anchor speaker, correspondent andweather report sections).

Approach Word error rate

Baseline 13.27%+ Syntactic information 11.98% (�9.7% relative)+ Correct segmentation 11.67% (�12.2% relative)

Baseline (selected sections) 10.91%+ Syntactic information 9.41% (�13.7% relative)

Table 5Word error rates for different sets of features.

Model Word error rate

1 No statistics 12.70% (�4.3% rel.)2 Score features 12.24% (�7.7% rel.)3 + Partial parse trees 12.05% (�9.2% rel.)

Full model 11.98% (�9.7% rel.)4 No disambiguation score 12.39% (�6.6% rel.)

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 727

hypothesizing a different inflectional ending, discountingthe errors that were introduced due to the same reason.We found that only 27% of the net error correctionsinvolved changing inflectional endings. The remaining cor-rections were due to introducing a stem or a function wordthat was not present in the first-best hypothesis.

Another common notion is that formal grammarsimprove speech recognition essentially by choosing a gram-matical hypothesis and that they fail to identify goodhypotheses if these hypotheses are ungrammatical. Again,our experimental data suggests that this is not the case.We counted the net error corrections of those utterancesin which hypothesis reranking chose a hypothesis withtwo or more partial parse trees. It was observed that theseutterances account for about 37% of the net errorcorrections.

In order to assess the impact of the number of hypoth-eses N on the word error rate, we repeated the experimentfor all N in 1..100. We did not try higher values of N as thelexicon is only guaranteed to cover the words of the 100best hypotheses. The results are shown in Fig. 3. The 10best hypotheses alone account for an improvement ofabout 1% absolute (7.8% relative). For higher values ofN, the word error rate starts to level off. However, therestill seems to be room for further improvement beyondN = 100.

5.5. The influence of the recognition task

In order to measure the influence of different character-istics of the recognition task, we performed experiments forvariants of the original task. The first experiment was iden-tical to the original one with the difference that the sentenceboundaries were determined manually. This can be thoughtof as assuming a perfect sentence boundary detection, or asapproaching a dictation scenario where sentence bound-aries are clearly marked. As shown in Table 4, manual seg-mentation led a to relative improvement of 12.2% over thebaseline. The contribution of manual segmentation wasonly weakly significant (4.8% for the MAPSSWE test).However, our results are in line with McNeill et al.(2006), who observed that speech recognition with syntac-tic features benefited from manual segmentation.

The goal of the second experiment was to assess the ben-efit of syntactic information for a lower baseline word errorrate. This was achieved by restricting the data to anchorspeaker sections, correspondent speech and weatherreports, which amounts to a total of 506 utterances. Thesentence boundaries were determined automatically. Table4 shows that the lower baseline word error rate led to a lar-ger relative improvement. This confirms the intuition that alanguage model which makes extensive use of non-localinformation is particularly good at correcting errors withincontexts that are largely correct. Beutler (2007) made a sim-ilar observation when varying the out-of-vocabulary rate ofhis baseline speech recognizer.

In order to get an idea of how much improvement canbe expected for spontaneous speech, we performed anexperiment on the 92 segments that appear to have beenuttered spontaneously. It was found that syntactic informa-tion reduced the word error rate by 5.6% relative, com-pared to a baseline of 22.03%. However, these results areby no means conclusive. First, the set of utterances is verysmall, both for training and for evaluation. Second, the sit-uations with spontaneous speech are often difficult for thebaseline speech recognizer because of background noise orinsufficient speaker adaptation. The improvements arelikely to be larger under more controlled conditions.

5.6. The influence of syntactic information

In this section, it is investigated how the choice of fea-tures and the amount of syntactic information influencethe improvement relative to the baseline system. The resultsare shown in Table 5.

Model 1 involved no statistical information on syntacticpreferences. The artificial parse trees were extracted by

Page 14: Syntactic language modeling with formal grammars

728 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

means of the fewest-chunks heuristics, and the feature setfor hypothesis reranking was restricted to the baseline rec-ognizer scores and the features based on partial parse treecounts. It can be seen that the lack of statistical informa-tion substantially reduces the benefit of our approach. Thisreduction is statistically significant on a level of <0.1%according to the McNemar test and the MAPSSWE test.

Model 2 employs the complete feature set for disambig-uation, but lets hypothesis reranking only consider the dis-ambiguation score and the baseline recognizer scores. Thismodel significantly improves on model 1 (<1.4% accordingto the MAPSSWE test and <0.1% according to the McNe-mar test). Model 3 additionally includes the complete set ofpartial parse tree features in hypothesis reranking (see Sec-tions 2.3.3 and 2.4). The word error rate of this model isalready very close to that of the full model.

It is remarkable that model 2 achieves a relatively largeimprovement even though its sole source of syntactic infor-mation is the disambiguation score. The importance of thisfeature is confirmed by the fact that removing the disam-biguation score from the full model (model 4) significantlyincreases the word error rate (<0.3% according to theMAPSSWE test). Note that in model 4, hypothesis reran-king employs all features that are used to compute the dis-ambiguation score. However, the respective feature weightsare trained to discriminate between good and bad hypoth-eses. This seems to be a less effective way of learning syn-tactic preferences.

It was further investigated to which extent the improve-ment depends on the precision of the grammar. The origi-nal grammar was modified by disabling all rules thatparticipate in the formation of verbal complexes, verbphrases and adjective phrases. The resulting grammar stillenforced the agreement between determiners, adjectivesand nouns, as well as case agreement between prepositionsand their complement noun phrases. The full model incombination with this chunk grammar achieved a worderror rate of 12.78%, which corresponds to a relative reduc-tion of 3.7%. This reduction is still significant (0.4% for theMAPSSWE test and 0.1% for the McNemar test).

9 224 448 672 8966

7

8

9

10

number of sentences

rela

tive

WE

R r

educ

tion

(%)

Fig. 4. The effect of the amount of syntactically annotated training dataon the relative reduction of the word error rate. Note that the set of 896sentences represents the full amount of training data. This set consists ofthe 447 sentences from earlier experiments and 90% of the 609 currentsentences (only 90% because of the cross-validation scheme).

We also varied the amount of syntactically annotateddata for the training of the disambiguation model by meansof random subsampling. The influence of this parameter onthe word error rate is shown in Fig. 4. It can be seen thatthe number of training sentences can be halved without sig-nificantly impairing the speech recognition performance.This confirms our claim that a relatively small amount ofsyntactically annotated data is sufficient for the trainingof our model. Note that the hypothesis reranking modelalso employs syntactic features and thus can to some extentcompensate for the lack of training data.

Finally, we attempted to evaluate the contribution ofour approach to robust parsing. To this end, we performedan experiment in which the disambiguation componentchose the partial parse trees according to the fewest-chunks

method. Note that the fewest-chunks method essentiallysplits a sentence into segments, each of which correspondsto a partial parse tree. For a given segment, we still choosethe most likely parse tree according to our model. Weobserved a relative improvement of 9.01%, which is onlyslightly (and not significantly) lower than the improvementwith the original approach (9.70%).

However, it must be noted that even for the fewest-chunks method, we still consider the selected partial parsetrees as an artificial parse tree and apply our disambigua-tion model to that tree. Thus, the main contribution ofour robustness approach is not the (possibly) better choiceof chunks, but the fact that it produces a disambiguationscore even for ungrammatical hypotheses. As we have seenin Table 5, this score contributes significantly to the overallimprovement.

5.7. Coverage of grammar and lexicon

In order to assess the coverage of the grammar and thelexicon, the reference transcriptions of the manually deter-mined sentence segments were parsed. As the words occur-ring in the reference transcriptions are not necessarilycovered by the lexicon, missing words were includedbeforehand. For each sentence, the parse tree that bestmatched the reference syntax tree was extracted and exam-ined. A correct complete parse tree was found for 61% ofthe sentences. Another 7.3% of the sentences were ellipsesfor which all parts could be correctly analyzed by the par-ser. Only 3% of the sentences were considered to be clearlyungrammatical in the sense of incorrect language use.

It was also attempted to determine the different types oferrors that prevented sentences from being analyzed cor-rectly. It was found that lexicon errors account for about30% of the total number of errors. More than half of thelexicon errors were due to missing verb valence informa-tion. Another common type of lexicon error were missingmultiword lexemes such as “kirch media”, as well as lexical-ized phrases like “aus dem stand heraus” (straight away).The remaining lexicon errors were due to incorrect or miss-ing single-word lexicon entries, mostly for function words.None of the lexicon-related errors involved an incorrect

Page 15: Syntactic language modeling with formal grammars

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 729

noun entry. This confirms our impression that our methodfor acquiring nouns works quite well. On the other hand,the acquisition of verb valence information appears to bea major problem which might benefit from additionaldata-driven techniques.

Only 5% of the errors could be attributed to cases wherethe grammar failed to account for a construction that wassupposed to be covered by the grammar. For example, thephrase “das depot, das umgeben von wohngebieten ist” (the

depot which is surrounded by residential areas) cannot beanalyzed because the grammar prohibits the interruptionof verbal complexes (in this example umgeben ist) bycomplements.

About 41% of the errors were due to general construc-tions that are not covered by the grammar. Some of theseconstructions could be incorporated into the grammar withmoderate effort, e.g. certain types of appositions, free rela-tive clauses and sentential relative clauses. Other construc-tions are notoriously difficult to formalize or would giverise to a large amount of ambiguity. These constructionsinclude coordinations (asyndetic coordinations, asymmet-ric coordinations and coordinations of constituents withdifferent categories), parentheses and ellipses. Anotherimportant phenomenon of this type is the omission of man-datory determiners such as ausloser (trigger) in “ausloser

war eine herabstufung des unternehmens” (the downgrading

of the company was the trigger).On the other hand, idiosyncratic or very rare construc-

tions accounted for about 24% of the errors. An exampleis the construction “X fur X” (X by X) as in the utterance“in hebron durchkammten die soldaten haus fur haus” (inhebron the soldiers combed house by house). The main prob-lem with such constructions is that there are many of themand they each require a dedicated grammar rule or a speciallexicon entry. In fact, it is not trivial to discover such con-structions in the first place.

6. Conclusions

6.1. Discussion

In this paper, we have described and evaluated anapproach to integrating a precise formal grammar intostatistical speech recognition. Even though formal gram-mars represent natural language in the form of hard lin-guistic constraints, these constraints are applied in a“soft” way: syntactic information is extracted from arbi-trary hypotheses (including hypotheses that are not cov-ered by the grammar) and weighted by means ofstatistical models. The approach was implemented andapplied to a German broadcast news transcription task.Our experiments confirmed that a precise formal grammarcan significantly improve large-vocabulary speech recogni-tion, even for a broad domain and a competitive baselinesystem.

The automatic transcription of broadcast news is subjectto out-of-vocabulary words, unknown sentence boundaries

and – to some extent – bad acoustic conditions. Thus, thechosen task was by no means simple. Nevertheless, syntac-tic information may be particularly beneficial for broadcastnews transcription as this task involves mostly prepared,grammatical speech. We do not provide a conclusive eval-uation on spontaneous speech. Yet we believe that ourapproach would still yield improvements on such tasks:even ungrammatical spontaneous utterances are likely tocontain phrases that provide useful syntactic information.It might also be possible to adapt to the nature of sponta-neous speech by extending the grammar or by using pre-processing components that deal with self-repairs andrelated phenomena (see Moore et al. (1995), for anexample).

We have investigated different properties of ourapproach. In particular, it was observed that grammati-cality information alone led to a relatively small (thoughstill statistically significant) improvement. However, theimprovement was more than doubled by adding statisticalinformation, even if the statistical models were trainedwith a very small amount of data (about 450 sentencesor 5200 words of syntactically annotated text and 550N-best lists). This effect may be explained by the precisionof the grammar and the lexicon: the grammar still acceptsmany implausible sentences, but in order to do so, it hasto resort to certain constructions that can easily be penal-ized by the statistical models. Examples of such construc-tions are prenominal genitives and nominalized adjectivephrases, which indeed are strongly penalized by the log-linear models. These results suggest that the qualitativeinformation expressed in the grammar and the lexicongreatly reduces the need for quantitative information,i.e. treebank data.

The small amount of treebank data is a distinctive char-acteristic of the proposed approach in comparison to therelated work of Collins et al. (2005) and McNeill et al.(2006). These approaches are based on statistical parsersthat were trained on about one million words of syntacti-cally annotated text. It is difficult to compare the effort ofcreating a treebank to that of writing a precision grammarand acquiring a large lexicon. Both tasks are very laboriousand require expert knowledge. However, it is likely thatqualitative information can be transferred to new domainsmore easily.

Another characteristics is that our approach does notconsider head word information at all. Thus, semanticaspects like common combinations of a verb and the headnoun of its subject are not captured. This is again in con-trast to the work of Collins et al. (2005) and McNeillet al. (2006), who made extensive use of this sort of infor-mation. We conclude that the proposed approach is, tosome degree, orthogonal to statistical methods that con-sider head-to-head dependencies. How large the actual gainfrom combining such statistical methods with rule-basedapproaches would be and how head word informationcould be incorporated into our approach are openquestions.

Page 16: Syntactic language modeling with formal grammars

730 T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731

6.2. Outlook

The primary objective of this work was to investigatehow formal grammars can be used to improve large-vocabulary continuous speech recognition, and how muchimprovement can be expected from this kind of information.

Although our experiments are valid, it is unclearwhether the proposed approach could be used in an oper-ative speech recognition system. The problem is twofold.First, the parsing of the recognition hypotheses would haveto be much more efficient. Second, the manual creation of aprecise lexicon would be infeasible for a speech recognizervocabulary with hundreds of thousands of word forms.

One solution could be to improve the current methods,e.g. to develop a more efficient parser and a sufficientlyaccurate technique for automatic lexical acquisition.Another solution could be to use a less precise grammarand complement it with a stronger model for syntactic pref-erences. For example, the lexicon might specify only thosesyntactic properties that can reliably be extracted from textcorpora, whereas the other properties remain underspeci-fied. The statistical models could represent preferencesfor these properties, possibly falling back on less reliablecues present in the text corpora.

Even though we have placed a strong emphasis on pre-cise linguistic knowledge, we are aware that this need notbe the optimal way for exploiting syntactic informationin language modeling. It thus seems reasonable to explorefurther the continuum between fully statistical languagemodels and precise formal grammars.

Acknowledgements

This work was supported by the Swiss National ScienceFoundation. We cordially thank Jean-Luc Gauvain forproviding us with word lattices from the LIMSI Germanbroadcast news transcription system. We further thankCanoo Engineering AG for granting us access to theirmorphological database.

References

Baldwin, T., 2005a. Bootstrapping deep lexical resources: resources forcourses. In: Proc. ACL-SIGLEX Workshop on Deep Lexical Acqui-sition. Ann Arbor, USA, pp. 67–76.

Baldwin, T., 2005b. General-purpose lexical acquisition: Procedures,questions and results. In: Proceedings of the 6th Meeting of the PacificAssociation for Computational Linguistics. pp. 23–32.

Beutler, R., 2007. Improving speech recognition through linguisticknowledge. Ph.D. thesis, ETH Zurich.

Beutler, R., Kaufmann, T., Pfister, B., 2005. Integrating a non-probabilisticgrammar into large vocabulary continuous speech recognition. In: Proc.IEEE ASRU Workshop. San Juan, Puerto Rico, pp. 104–109.

Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G., 2002. TheTIGER treebank. In: Proc. Workshop on Treebanks and LinguisticTheories. Sozopol, Bulgaria.

Brants, T., 2000. TnT: A statistical part-of-speech tagger. In: Proceedingsof the 6th Conference on Applied Natural Language Processing.Seattle, Washington, pp. 224–231.

Carrol, J., Oepen, S., 2005. High efficiency realization for a wide-coverageunification grammar. In: Proceedings of the International JointConference on Natural Language Processing. Jeju Island, Corea, pp.165 – 176.

Charniak, E., 2001. Immediate-head parsing for language models. In:Proc. ACL’01. Toulouse, France, pp. 124–131.

Charniak, E., Johnson, M., 2005. Coarse-to-fine n-best parsing andmaxent discriminative reranking. Proc. ACL’05, 173–180.

Chelba, C., Jelinek, F., 2000. Structured language modeling. Comput.Speech Lang. 14 (4), 283–332.

Chen, S.F., Rosenfeld, R., 1999. A Gaussian prior for smoothingmaximum entropy models. Tech. rep. Carnegie Mellon University,Pittsburgh, PA.

Church, K.W., Hanks, P., 1990. Word association norms, mutualinformation, and lexicography. Comput. Linguist. 16 (1), 22–29.

Collins, C., Carpenter, B., Penn, G., 2004. Head-driven parsing for wordlattices. In: Proc. ACL.

Collins, M., Koo, T., 2005. Discriminative reranking for natural languageparsing. Comput. Linguist. 31 (1), 25–69.

Collins, M., Roark, B., Saraclar, M., 2005. Discriminative syntacticlanguage modeling for speech recognition. In: Proc. ACL’05. AnnArbor, Michigan, pp. 507–514.

Crysmann, B., 2003. On the efficient implementation of German verbplacement in HPSG. In: Proc. RANLP.

Crysmann, B., 2005. Relative clause extraposition in German: An efficientand portable implementation. Res. Lang. Comput. 3 (1), 61–82.

Duden, 1999. Das grosse Worterbuch der deutschen Sprache in 10Banden, 3. Auflage. Dudenverlag, Mannheim, Leipzig, Wien, Zurich.

Erman, L.D., Hayes-Roth, F., Lesser, V.R., Reddy, D.R., 1980. TheHearsay-II speech-understanding system: integrating knowledge toresolve uncertainty. ACM Comput. Surv. 12 (2), 213–253.

Fano, R.M., 1961. Transmission of Information. MIT press.Gauvain, J.L., Lamel, L., Adda, G., 2002. The LIMSI broadcast news

transcription system. Speech Comm. 37 (1–2), 89–108.Gillick, L., Cox, S., 1989. Some statistical issues in the comparison

of speech recognition algorithms. In: Proc. ICASSP’89. pp. 532–535.

Goddeau, D., 1992. Using probabilistic shift-reduce parsing in speechrecognition systems. In: Proc. ICSLP.

Goodine, D., Seneff, S., Hirschman, L., Phillips, M., 1991. Full integrationof speech and language understanding in the MIT spoken languagesystem. In: Proc. 2nd European Conf. on Speech Communication andTechnology.

Hall, K., Johnson, M., 2003. Language modeling using efficient best-firstbottom-up parsing. In: Proc. IEEE Automatic Speech Recognitionand Understanding Workshop. pp. 507–512.

Johnson, M., Geman, S., Canon, S., Chi, Z., Riezler, S., 1999. Estimatorsfor stochastic “unification-based” grammars. In: Proc. ACL’99.College Park, Maryland, pp. 535–541.

Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G.,Morgan, N., 1995. Using a stochastic context-free grammar as alanguage model for speech recognition. In: Proc. ICASSP’95, Vol. 1.pp. 189–189.

Kaplan, R.M., Riezler, S., King, T.H., Maxwell III, J.T., Vasserman, A.,Crouch, R.S., 2004. Speed and accuracy in shallow and deep stochasticparsing. In: Proc. HLT-NAACL’04. pp. 97–104.

Kaufmann, T., 2009. A rule-based language model for speech recognition.Ph.D. Thesis, ETH Zurich.

Kaufmann, T., Ewender, T., Pfister, B., 2009. Improving broadcast newstranscription with a precision grammar and discriminative reranking.In: Proc. Interspeech’09. Brighton, England.

Kiefer, B., Krieger, H.-U., Nederhof, M.-J., 2000. Efficient androbust parsing of word hypotheses graphs. In: Wahlster, W. (Ed.),Verbmobil. In: Foundations of Speech-to-Speech Translation. Springer,Berlin, pp. 280–295.

Kita, K., Ward, W.H., 1991. Incorporating LR parsing into SPHINX. In:Proc. ICASSP’91, pp. 269–272.

Page 17: Syntactic language modeling with formal grammars

T. Kaufmann, B. Pfister / Speech Communication 54 (2012) 715–731 731

Malouf, R., 2002. A comparison of algorithms for maximum entropyparameter estimation. In: Proc. Internat. Conf. On ComputationalLinguistics, pp. 1–7.

Malouf, R., van Noord, G., 2004. Wide coverage parsing with stochasticattribute value grammars. In: Proc. IJCNLP-04 Workshop: BeyondShallow Analyses – Formalisms and Statistical Modeling for DeepAnalyses.

Manber, U., Myers, G., 1990. Suffix arrays: a new method for on-linestring searches. In: Proc. 1st Annual ACM-SIAM Symposium onDiscrete Algorithms, pp. 319–327.

McNeill, W.P., Kahn, J.G., Hillard, D.L., Ostendorf, M., 2006. Parsestructure and segmentation for improving speech recognition. In: Proc.IEEE Spoken Language Technology Workshop, pp. 90–93.

McTait, K., Adda-Decker, M., 2003. The 300k LIMSI German broadcastnews transcription system. In: Proc. Eurospeech’03. Geneva,Switzerland.

Moore, R., Appelt, D., Dowding, J., Gawron, J.M., Moran, D., 1995.Combining linguistic and statistical knowledge sources in natural-language processing for ATIS. In: Proc. ARPA Spoken LanguageSystems Technology Workshop.

Muller, S., 1999. Deutsche Syntax deklarativ. Head-Driven PhraseStructure Grammar fur das Deutsche. No. 394 in LinguistischeArbeiten. Max Niemeyer Verlag, Tubingen.

Muller, S., 2007. Head-Driven Phrase Structure Grammar: Eine Einfuh-rung. Stauffenburg Einfuhrungen, Nr. 17. Stauffenburg Verlag,Tubingen.

Muller, S., Kasper, W., 2000. HPSG analysis of German. In: Wahlster, W.(Ed.), Verbmobil: Foundations of Speech-to-Speech Translation. In:Artificial Intelligence. Springer-Verlag, Berlin, Heidelberg, New York,pp. 238–253.

Nederhof, M.J., Bouma, G., Koeling, R., van Noord, G., 1997. Gram-matical analysis in the OVIS spoken-dialogue system. In: Proc. ACL/EACL Workshop on Spoken Dialog Systems, pp. 66–73.

Nicholson, J., Kordoni, V., Zhang, Y., Baldwin, T., Dridan, R., 2008.Evaluating and extending the coverage of HPSG grammars. In: Proc.6th International Conference on Language Resources and Evaluation.Marrakech, Morocco.

Osborne, M., 2000. Estimation of stochastic attribute-value grammarsusing an informative sample. In: Proc. 18th Internat. Conf. onComputational Linguistics, pp. 586–592.

Pollard, C.J., Sag, I.A., 1987. Information-based syntax and semantics. In:CSLI Lecture Notes, Vol. 1. CSLI Publications, Stanford University,No. 13.

Pollard, C.J., Sag, I.A., 1994. Head-driven Phrase Structure Grammar.University of Chicago Press.

Price, P., 1990. Evaluation of spoken language systems: the ATIS domain.In: Proceedings of the Workshop on Speech and Natural Language,pp. 91–95.

Price, P., Fisher, W.M., Bernstein, J., Pallett, D.S., 1988. The DARPA1000-word resource management database for continuous speechrecognition. In: Proc. ICASSP’88, Vol. 1, pp. 651–654.

Prins, R., van Noord, G., 2003. Reinforcing parser preferences throughtagging. Traitement Automatique des Langues 44 (3), 121–139.

Rayner, M., Hockey, B.A., Bouillon, P., et al., 2006. Putting Linguisticsinto Speech Recognition: The Regulus Grammar Compiler. MITPress.

Riezler, S., King, T.H., Kaplan, R.M., Crouch, R., Maxwell III, J.T.,Johnson, M., 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. Proc.ACL’02, pp. 271–278.

Roark, B., 2001a. Markov parsing: Lattice rescoring with a statisticalparser. In: Proc. ACL’01. Philadelphia, USA, pp. 287–294.

Roark, B., 2001b. Probabilistic top-down parsing and language modeling.Comput. Linguist. 27 (2), 249–276.

Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D., 2002.Multiword expressions: a pain in the neck for NLP. Lect. NotesComput. Sci., 1–15.

Sag, I.A., Wasow, T., Bender, E.M., 2003. Syntactic Theory: A FormalIntroduction, second ed. CSLI Publications, Stanford.

Schone, P., Jurafsky, D., 2001. Is knowledge-free induction of multiwordunit dictionary headwords a solved problem? In: Proc. 6th Conf. onEmpirical Methods in Natural Language Processing, pp. 100–108.

Tomita, M., 1991. Generalized LR Parsing. Springer.Toutanova, K., Manning, C.D., Flickinger, D., Oepen, S., 2005.

Stochastic HPSG parse disambiguation using the Redwoods Corpus.Res. Lang. Comput. 3 (1), 83–105.

van Noord, G., 2001. Robust parsing of word graphs. In: Junqua, J.-C.,van Noord, G. (Eds.), Robustness in Language and Speech Technol-ogy. Kluwer Academic Publishers, Dordrecht.

van Noord, G., 2004. Error mining for wide-coverage grammar engineer-ing. In: Proc. ACL.

van Noord, G., et al., 2006. At last parsing is now operational. In: Actesde la 13e conference sur le traitement automatique des languesnaturelles, pp. 20–42.

Wahlster, W. (Ed.), 2000. Verbmobil: Foundations of Speech-to-SpeechTranslation. Springer, Berlin.

Wang, W., Harper, M.P., 2002. The SuperARV language model:investigating the effectiveness of tightly integrating multiple knowledgesources. In: Proc. Conf. on Empirical Methods in Natural LanguageProcessing, Vol. 10, pp. 238–247.

Wang, W., Harper, M.P., 2003. Language modeling using a statisticaldependency grammar parser. In: Proc. IEEE Workshop on AutomaticSpeech Recognition and Understanding, pp. 519–524.

Wang, W., Stolcke, A., Harper, M.P., 2004. The use of a linguisticallymotivated language model in conversational speech recognition. In:Proc. ICASSP’04, pp. 261–264.

Xu, P., Chelba, C., Jelinek, F., 2002. A study on richer syntacticdependencies for structured language modeling. In: Proc. ACL’02.Philadelphia, USA, pp. 191–198.

Yamamoto, M., Church, K.W., 2001. Using suffix arrays to compute termfrequency and document frequency for all substrings in a corpus.Comput. Linguist. 27 (1), 1–30.

Zhang, Y., Kordoni, V., Fitzgerald, E., 2007. Partial parse selection forrobust deep processing. In: Proc. Workshop on Deep LinguisticProcessing, pp. 128–135.

Zue, V., Glass, J., Goodine, D., Leung, H., Phillips, M., Polifroni, J.,Seneff, S., 1990. The VOYAGER speech understanding system:preliminary development and evaluation. In: Proc. ICASSP’90, pp.73–76.