12
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 2439 Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features Fabio Valente, Member, IEEE, Mathew Magimai Doss, Member, IEEE, Christian Plahl, Suman Ravuri, and Wen Wang, Member, IEEE Abstract—Recently, several multi-layer perceptron (MLP)- based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Further- more, the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15%–23% relative at the different steps of a multipass system when compared to mel-frequency cepstral coefficient (MFCC) and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive per- formances compared to other Mandarin systems. Index Terms—Automatic speech recognition (ASR), broadcast data, GALE project, multi-layer perceptron (MLP), multi-stream, TANDEM features. Manuscript received July 16, 2010; revised December 20, 2010; accepted March 20, 2011. Date of publication April 21, 2011; date of current version September 16, 2011. This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dimitra Vergyri. F. Valente and M. M. Doss are with the Idiap Research Institute, 1920 Mar- tigny Switzerland (e-mail: [email protected]; [email protected]). C. Plahl is with the Computer Science Department, RWTH Aachen Univer- sity, 52056 Aachen, Germany (e-mail: [email protected]). S. Ravuri is with the International Computer Science Institute, Berkeley, CA 94704 USA (e-mail: [email protected]). W. Wang is with the Speech Technology and Research Laboratory, SRI In- ternational, Menlo Park, CA 94025 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2011.2139206 I. INTRODUCTION R ECENTLY a growing number of large-vocabulary con- tinuous speech recognition (LVCSR) systems make use of multi-layer perceptron (MLP) features. MLP features have been originally introduced by Hermansky and his colleagues in [1], where the output of an MLP classifier is used as acoustic front-end for conventional speech recognition systems based on hidden Markov models/Gaussian mixture models (HMMs/ GMMs). A large number of studies have proposed different types of MLP-based front-ends (see [2]–[5]) and investigated their use for transcribing English (see [6], and [7]). The most common application is in concatenation with mel-frequency cepstral co- efficient (MFCC) or perceptual linear predictive (PLP) features, where MLP features show considerable complementarity prop- erties. In recent years, in the framework of the GALE 1 program, MLP features have been extensively used in ASR systems for Mandarin and Arabic languages (see [5], and [8]–[11]). Since the original work [1], MLP front-ends have progressed along two main directions: 1) the use of different input representations to the MLP; 2) the use of complex MLP architectures beyond the conven- tional three-layer perceptron. The first category includes speech representations that aims at using long time spans of the speech signal which could capture long term phenomena (such as co-articulation) and are comple- mentarity to MFCC or PLP features [7]. Because of the large dimension of the signal time spans, a number of techniques for efficiently encoding this information have been proposed like MRASTA [4], DCT-TRAPS [12] and wLP-TRAPS [13]. The second category includes a heterogeneous number of techniques that aim at overcoming the pitfalls of the single MLP classi- fier. They are based on the probabilistic combination of MLP outputs obtained using different input representations. Those combinations can happen in a parallel fashion like in the multi- stream approach [2], [14] or in a hierarchical fashion [15]. Fur- thermore, recently the probabilistic features generated by three- layer MLPs have also been replaced by the bottleneck features extracted by four-layer and five-layer MLPs [16]. While previous works, e.g., [9], have discussed the devel- opment of the Mandarin LVCSR systems that use those fea- tures, no exhaustive comparisons and analysis of the different front-ends have been presented in literature. Without such a side-by-side comparison, it is not possible to assess which one 1 http://www.darpa.mil/ipto/programs/gale/gale.asp 1558-7916/$26.00 © 2011 IEEE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND · PDF fileIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, ... Member, IEEE, Mathew Magimai Doss, Member, IEEE, Christian

  • Upload
    lamdan

  • View
    228

  • Download
    1

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011 2439

Transcribing Mandarin Broadcast Speech UsingMulti-Layer Perceptron Acoustic Features

Fabio Valente, Member, IEEE, Mathew Magimai Doss, Member, IEEE, Christian Plahl, Suman Ravuri, andWen Wang, Member, IEEE

Abstract—Recently, several multi-layer perceptron (MLP)-based front-ends have been developed and used for Mandarinspeech recognition, often showing significant complementaryproperties to conventional spectral features. Although widely usedin multiple Mandarin systems, no systematic comparison of all thedifferent approaches as well as their scalability has been proposed.The novelty of this correspondence is mainly experimental. Inthis work, all the MLP front-ends recently developed at multiplesites are described and compared in a systematic manner on a100 hours setup. The study covers the two main directions alongwhich the MLP features have evolved: the use of different inputrepresentations to the MLP and the use of more complex MLParchitectures beyond the three-layer perceptron. The results areanalyzed in terms of confusion matrices and the paper discussesa number of novel findings that the comparison reveals. Further-more, the two best front-ends used in the GALE 2008 evaluation,referred as MLP1 and MLP2, are studied in a more complexLVCSR system in order to investigate their scalability in terms ofthe amount of training data (from 100 hours to 1600 hours) andthe parametric system complexity (maximum likelihood versusdiscriminative training, speaker adaptative training, lattice levelcombination). Results on 5 hours of evaluation data from theGALE project reveal that the MLP features consistently produceimprovements in the range of 15%–23% relative at the differentsteps of a multipass system when compared to mel-frequencycepstral coefficient (MFCC) and PLP features, suggesting thatthe improvements scale with the amount of data and with thecomplexity of the system. The integration of those features intothe GALE 2008 evaluation system provide very competitive per-formances compared to other Mandarin systems.

Index Terms—Automatic speech recognition (ASR), broadcastdata, GALE project, multi-layer perceptron (MLP), multi-stream,TANDEM features.

Manuscript received July 16, 2010; revised December 20, 2010; acceptedMarch 20, 2011. Date of publication April 21, 2011; date of current versionSeptember 16, 2011. This work was supported by the Defense AdvancedResearch Projects Agency (DARPA) under Contract HR0011-06-C-0023.Any opinions, findings and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarily reflect the views ofDARPA. The associate editor coordinating the review of this manuscript andapproving it for publication was Dr. Dimitra Vergyri.

F. Valente and M. M. Doss are with the Idiap Research Institute, 1920 Mar-tigny Switzerland (e-mail: [email protected]; [email protected]).

C. Plahl is with the Computer Science Department, RWTH Aachen Univer-sity, 52056 Aachen, Germany (e-mail: [email protected]).

S. Ravuri is with the International Computer Science Institute, Berkeley, CA94704 USA (e-mail: [email protected]).

W. Wang is with the Speech Technology and Research Laboratory, SRI In-ternational, Menlo Park, CA 94025 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2011.2139206

I. INTRODUCTION

R ECENTLY a growing number of large-vocabulary con-tinuous speech recognition (LVCSR) systems make use

of multi-layer perceptron (MLP) features. MLP features havebeen originally introduced by Hermansky and his colleagues in[1], where the output of an MLP classifier is used as acousticfront-end for conventional speech recognition systems basedon hidden Markov models/Gaussian mixture models (HMMs/GMMs).

A large number of studies have proposed different types ofMLP-based front-ends (see [2]–[5]) and investigated their usefor transcribing English (see [6], and [7]). The most commonapplication is in concatenation with mel-frequency cepstral co-efficient (MFCC) or perceptual linear predictive (PLP) features,where MLP features show considerable complementarity prop-erties. In recent years, in the framework of the GALE1 program,MLP features have been extensively used in ASR systems forMandarin and Arabic languages (see [5], and [8]–[11]). Sincethe original work [1], MLP front-ends have progressed alongtwo main directions:

1) the use of different input representations to the MLP;2) the use of complex MLP architectures beyond the conven-

tional three-layer perceptron.The first category includes speech representations that aims at

using long time spans of the speech signal which could capturelong term phenomena (such as co-articulation) and are comple-mentarity to MFCC or PLP features [7]. Because of the largedimension of the signal time spans, a number of techniques forefficiently encoding this information have been proposed likeMRASTA [4], DCT-TRAPS [12] and wLP-TRAPS [13]. Thesecond category includes a heterogeneous number of techniquesthat aim at overcoming the pitfalls of the single MLP classi-fier. They are based on the probabilistic combination of MLPoutputs obtained using different input representations. Thosecombinations can happen in a parallel fashion like in the multi-stream approach [2], [14] or in a hierarchical fashion [15]. Fur-thermore, recently the probabilistic features generated by three-layer MLPs have also been replaced by the bottleneck featuresextracted by four-layer and five-layer MLPs [16].

While previous works, e.g., [9], have discussed the devel-opment of the Mandarin LVCSR systems that use those fea-tures, no exhaustive comparisons and analysis of the differentfront-ends have been presented in literature. Without such aside-by-side comparison, it is not possible to assess which one

1http://www.darpa.mil/ipto/programs/gale/gale.asp

1558-7916/$26.00 © 2011 IEEE

2440 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

of the recent advances actually produced improvements in thefinal system. This correspondence focuses on those recent ad-vances in training, scaling and integrating MLP front-ends forMandarin transcription. The novelty of this work is mainly ex-perimental and the correspondence provides two contributions.

First, the various MLP based front-ends recently developedat multiple sites are described and compared on a commonexperimental setup in a systematic way. The comparison coversall the MLP features used in GALE and is done using thesame phoneme set, the same speech-silence segmentation, thesame amount of training data and the same number of freeparameters. The study is done using a simplified version ofthe system described in [9] trained on 100 hours of Mandarinbroadcast news and conversation recordings. The investigationcovers MLP acoustic front-ends as stand alone features and inconcatenation with conventional MFCC features. To our bestknowledge, this is the most exhaustive comparison of MLPfront-ends for Mandarin speech recognition. The comparisonreveals a number of novel facts on the different features and ontheir use in LVCSR systems.

The second contribution is the study on how the performancesscale with the amount of training data (from 100 hours to 1600hours of broadcast audio) and with the parametric model com-plexity of the system (including speaker adaptive training, lat-tice level combination and discriminative training). As before,the contrastive experiments are run with and without the MLPfeatures to assess the maximum relative improvement that canbe obtained.

The remainder of the paper is organized as follows. Section IIdescribes features obtained using three-layer MLPs with variousinput representations and Section III describes features obtainedusing modifications to the three layer architecture. Section IVexperiments with those features in a system trained on 100hours and analyzes and discusses the results of the comparison.Section V experiments in a large scale multi-pass evaluationsystem and finally the paper is concluded in Section VI.

II. INPUT REPRESENTATION FOR THREE-LAYER MLP FEATURES

The simplest MLP feature extraction is based on the fol-lowing steps. At first, a three-layer MLP classifier is trainedin order to minimize the cross-entropy between its output anda set of phonetic labels. Such a classifier produces phonemeposterior probabilities conditioned on the input representationat a given time instant [17].

In order to exploit this representation into HMM/GMMmodels, phoneme posterior probabilities are first gaussianizedapplying a logarithm and then decorrelated using a principalcomponent analysis (PCA) transform. After PCA, a dimension-ality reduction accounting for 95% of the total variability isapplied. The resulting feature vectors are used as conventionalacoustic features into ASR systems. This framework is alsoknown as TANDEM [1].

The input to the MLP classifier can be conventional short termfeatures like PLP/MFCC or long term features which aim at cap-turing the dynamic characteristics of the speech signal over largetime spans. Let us briefly describe four different MLP inputsproposed and used for transcription of Mandarin broadcast:

A. TANDEM-PLP

In TANDEM-PLP features, the input to the MLP is repre-sented by nine consecutive frames of PLP cepstral features.Mandarin is a tonal language; thus, the PLP vector is aug-mented with the smoothed log-pitch estimate plus its first-and second-order temporal derivatives as described in [18].PLP features undergo vocal tract length normalization andspeaker-level mean and variance normalization. The finaldimension of this vector is 42, thus the input to the MLP is avector of size .

TANDEM-PLP has been the first MLP-based feature tobe proposed and aims at using a few consecutive frames ofshort term spectral features. On the other hand, the input tothe MLP can also be represented by critical band temporaltrajectory (up to half a second) aiming at modeling long timepatterns of the speech signal (also known as Temporal Patternsor TRAPS [19]). The dimensionality of TRAPS is quite large;considering for instance, 500-ms trajectories in a 19 criticalband spectrogram would produce a vector of dimension 9500.Several methods have been considered for efficiently encodingthis information while reducing the dimension and they will bebriefly reviewed in the following.

B. Multiple RASTA

Multiple RASTA (MRASTA) filtering [4] is an extension ofRASTA filtering and aims at using long signal time spans atthe input of the MLP. The model is consistent with studieson human perception of modulation frequencies modeled usinga bank of filters equally spaced on a logarithmic scale [20].This bank of filters subdivides the available modulation fre-quency range into separate channels with a decreasing resolu-tion moving from slow to fast modulations.

Feature extraction is composed of the following parts: 19 crit-ical band auditory spectrum is extracted from short-time Fouriertransform of a signal every 10 ms. A 600-ms long temporal tra-jectory in each critical band is filtered with a bank of bandpassfilters. Those filters represent first derivatives (1)and second derivatives (2) of Gaussian functionswith variance varying in the range 8–60 ms:

(1)

with (2)

In effect, the MRASTA filters are multi-resolution bandpass fil-ters on modulation frequency, dividing the available modula-tion frequency range into its individual sub-bands.2 In the mod-ulation frequency domain, they correspond to a filter-bank withequally spaced filters on a logarithmic scale. Identical filters areused for all critical bands. Thus, they provide a multiple-resolu-tion representation of the time–frequency plane.

After MRASTA filtering, frequency derivatives across threeconsecutive critical bands are introduced. The total number offeatures used as input for a three-layer MLP is 432.

2Unlike in [4], filter-banks G1 and G2 are composed of six filters rather thaneight, leaving out the two filters with longest impulse responses.

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2441

TABLE IDIFFERENCES BETWEEN THE THREE INPUT REPRESENTATIONS

THAT USES LONG TEMPORAL TIME SPANS

C. DCT-TRAPS

The DCT-TRAPS aims at reducing the dimension of the tra-jectories using a discrete cosine transform (DCT). As describedin [12], the results obtained using DCT basis are very similarto the one obtained using a principal component analysis. Crit-ical band auditory spectrum is extracted from short-time Fouriertransform of a signal every 10 ms. Then 500-ms long energytrajectories are extracted for each of the 19 critical bands thatcompose the spectrogram. Those are projected on the first 16coefficients of a DCT transform resulting in a vector of size

used as input to the MLP. In contrary to theMRASTA, they do not emulate any sensitivity of the hearingproperties to the different modulation frequencies.

D. WLP-TRAPS

A third alternative for extracting information from long signaltime spans is represented by the wLP-TRAPS [13]. In con-trary to previous front-ends, the process does not use the shortterm spectrum thus potentially provides more complementarityto the MFCC features. Those features are obtained by warpingthe temporal axis after LP-TRAP features calculation [21]. Thefeature extraction is composed of the following steps: at first,linear prediction is used to model the Hilbert envelops of pre-warped 500-ms long energy trajectories in auditory-like fre-quencies sub-bands. The warping ensures that more emphasisis given to the center of the trajectories compared to the borders[13], thus emulating again human perception. 25 LPC coeffi-cients in 19 frequency bands are then used as input to the MLPproducing a feature vector of dimension .

All the three representations described in Sections II-B–II-Daim at using long temporal time spans; however, they differ fromeach other in a number of implementation issues like the use ofshort-time power spectrum, the use of zero-mean filters and thewarping of the time axes. Those differences are summarized inTable I.

As Mandarin is a tonal language, those representations canbe augmented with the smoothed log-pitch estimate obtained asdescribed in [18] and with the value of the critical band energy(19 features per frame). In the following, we will refer to themas Augmented features.

III. MLP ARCHITECTURES

The second direction along which the front-ends have evolvedis the use of more complex architectures to overcome limitationsof the three-layer MLP in different ways. Most of them are basedon the combination of several MLP outputs trained using dif-ferent input representations. This combination can happen in aparallel or hierarchical fashion. Again, no side-by-side compar-isons of these architectures have been presented in the literature.

The following paragraphs briefly describe these front-ends usedfor LVCSR systems.

A. Hidden Activation TRAPS (HATS)

HATS feature extraction is based on observations on humanspeech recognition [22], which conjectures that humans recog-nize speech independently in each critical band and a final de-cision is obtained by recombining those estimates. HATS aimsat using information extracted from long time spans of criticalband energies which are fed into a set of independent classifiersinstead of a single MLP classifier. At first, 19 critical band au-ditory spectrum is extracted from short-time Fourier transformof a signal every 10 ms. After that, HATS [2] feature extractionis composed of two steps.

1) In the first stage, an independent MLP for each of the 19critical bands is trained to classify phonemes. The input toeach of the MLP is 500-ms-long log critical band energytrajectories (i.e., 51-dimensional input). The input under-goes an utterance level mean and variance normalization.

2) In the second stage, a merger MLP is trained using thehidden activations obtained from the 19 MLPs of the firststage. The merger classifier aims at obtaining a singlephoneme posterior estimate out of the independent esti-mates coming from each critical band. Phoneme posteriorsobtained from the merger MLP are then transformed andused as features.

The rationale behind this architecture consists in the fact thatcorruptions in particular critical bands should affect less the finalrecognition results.

B. Multi-Stream

The output of MLPs are posterior probabilities of phonetictargets that can be combined into a single estimate using proba-bilistic rules. This approach is typically referred as multi-streamand has been introduced in [14]. The rationale behind it consistsin the fact that MLPs trained using different input representa-tions will perform differently in multiple conditions. To takeadvantage of both representations, the combination rule shouldbe able to dynamically select the best posterior stream. Typicalcombination rules weight the posterior probabilities using afunction of the output entropy (see [23] and [24]). Posteriorsobtained from TANDEM-PLP (short signal time spans) andHATS (long signal time spans) are combined using the Demp-ster–Shafer method [24] and used as features after a log/PCAtransform. Multi-stream comes at the obvious cost of doublingthe total number of parameters in the system.

C. Hierarchical Processing

While multi-stream approaches combine MLP outputs in par-allel, studies on English and Mandarin data [15], [25] showedthat the most effective way of combining classifiers trained onseparate ranges of modulation frequencies, i.e., on different tem-poral spans, is based on hierarchical (sequential) processing.The hierarchical processing is based on the following steps.

MRASTA filters cover the whole range of modulation fre-quencies. The filter-banks G1 and G2 (six filters each) aresplit into two separate filter banks G1-Low and G2-Low andG1-High and G2-High, which filter fast and slow modulation

2442 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

Fig. 1. Proposed scheme for the MLP-based feature extraction as used in the GALE 2008 Evaluation. The auditory spectrum is filtered with a set of multipleresolution filters that extract fast modulation frequencies. The resulting vector is concatenated with short term critical band energy and pitch estimates and isused as input to the first MLP that estimates phoneme posterior distributions. The output of the first MLP is then concatenated with features obtained using slowmodulation frequencies, short-term critical band energy and pitch estimates and is used as input to the second MLP.

frequencies, respectively. G-High and G-Low are defined asfollows:

G-High G1-High G2-High

with (3)

G-Low G1-Low G2-Low

with (4)

Filters G1-fast and G2-fast are short filters and they process highmodulation frequencies. Filters G1-slow and G2-slow are longfilters and they process low modulation frequencies. The cutofffrequency for both filter-banks G-High and G-Low is approxi-mately 10 Hz.

The output of the MRASTA filtering is processed accordingto a hierarchy of MLPs progressively moving from high to lowmodulation frequencies (i.e., from short to long temporal con-texts). The rationale behind this processing is based on the factthat the errors produced from the first MLP can be correctedfrom a second one using the estimates from the first MLP to-gether with the evidence from another range of modulation fre-quencies.

The first MLP is trained on the first feature stream representedby the output of filter-banks G-High that extract high modula-tion frequencies. This MLP estimates the first set of phonemeposterior probabilities. These posteriors are modified accordingto a Log/PCA transform and then concatenated with the secondfeature stream thus forming an input to the second phonemeposterior-estimating MLP. In such a way, phoneme estimatesfrom the first MLP are modified by the second net using an ev-idence from a different feature stream. This process is depictedin Fig. 1.

D. Bottleneck Features

Bottleneck features are recently introduced MLP non-prob-abilistic features [16]. The conventional three-layer MLP is re-placed with a four- or five-layer MLP where the first layer is theinput features and the last layer is the phonetic targets. As dis-cussed in [26], the five-layer architecture provides slightly betterperformances compared to the four-layer. The size of the secondlayer is large to provide enough modeling power, the size of the

third one is small, typically equal to the desired feature dimen-sion, while the size of the fourth one is approximately half thesecond layer [26]. Instead of using the output of the MLP, fea-tures are obtained from the linear activation of the third layer.Bottleneck features do not require a dimensionality reduction,as the desired dimension can be obtained fixing the size of thebottleneck layer. Furthermore, the linear activations are alreadyGaussian distributed thus they do not require any Log trans-form. The most common input to the non-probabilistic Bottle-neck features are long term features as DCT-TRAPS and thewLP-TRAPS described in sections Sections II-C and II-D.

IV. SMALL SCALE EXPERIMENTS

The following preliminary experiments are based on thelarge-vocabulary ASR system for transcribing Mandarin broad-cast described in [9], developed by SRI/UW/ICSI for the GALEproject. The recognition is performed using the SRI Decipherrecognizer and results are reported in terms of character errorrate (CER). The training is done using approximately 100 hoursof broadcast news and conversation data manually transcribedincluding speaker labels. Results are reported on the GALE2006 evaluation data simply referred as eval06 in the following.

The baseline system uses 13 standard MFCC plus first- andsecond-order temporal derivatives. Vocal tract length normaliza-tion (VTLN) and speaker level mean-variance normalizationsare applied. Mandarin is a tonal language thus the MFCC vectoris augmented with the smoothed log-pitch estimate plus its first–and second-order temporal derivatives as described in [18], re-sulting in a feature vector of dimension 42. In the following, wewill refer to this system simply as the MFCC baseline.

The training is based on conventional Maximum-Likelihood.The acoustic models are composed of within word triphoneHMM models and a 32-component diagonal covariance GMMis used for modeling acoustic emission probabilities. Parame-ters are shared across different triphones according to a phoneticdecision tree. Recognition networks are compiled from trigramlanguage models trained on over one billion words, with a 60 Kvocabulary lexicon [9]. The decoding phase consists of twodecoding passes, a speaker independent (SI) decoding followedby a speaker adapted (SA) decoding. Speaker adaptation is

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2443

TABLE IIBASELINE SYSTEM PERFORMANCE ON THE eval06 DATA

TABLE IIITANDEM-9FRAMESPLP PERFORMANCE ON THE eval06 DATA. RESULTS

ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION

WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO

THE BASELINE IS REPORTED IN PARENTHESES

done using a one-class constrained maximum-likelihood linearregression (CMLLR) followed by three-class MLLR. Perfor-mance of this baseline system on the eval06 data is reported inTable II for both speaker independent (SI) and speaker adapted(SA) models.

In this set of experiments, three-layer MLPs are trained on allthe available 100-hour acoustic model training data. The Man-darin toneme set is composed of 72 elements. The training isdone using the ICSI Quicknet Software.3

A. MLP Features

This section discusses experiments with features obtainedusing three-layer MLP architectures with different input rep-resentations. Unless it is explicitly mentioned otherwise, thetotal number of parameters in the different MLP architecturesis equalized to approximately one million parameters in orderto assure a fair comparison between the different approaches.The size of the input layer equals to the feature dimension,the size of the output layer equals to the number of phonetictargets (72) and the size of the hidden layer is modified so thatthe total number of parameters equals to one million. AfterPCA, a dimensionality reduction accounting for 95% of thetotal variability is applied. The resulting feature vectors hasdimension 35 for all the different MLP features.

The investigation was carried out with MLP features asstand-alone front-end and in concatenation with spectral fea-tures, i.e., MFCC. Results are reported in terms of charactererror rate (CER) on the eval06 data as described in the next sec-tion. Let us first consider the TANDEM-PLP features describedin Section II-A. Performances of those features are reportedin Table III as well as the relative improvements with respectto the MFCC baseline with and without speaker adaptation.When used as stand-alone features, TANDEM-PLP does notoutperform the baseline, whereas a relative improvement of

is obtained when they are used in concatenation withMFCC. After speaker adaptation, the relative improvementdrops slightly by 2%, still a 14% relative improvement over theMFCC baseline.

Let us now consider the use of MLP features obtainedusing long time spans of the speech signal as described inSections II-B–II-D. Table IV shows that these features performquite poorly as stand alone features, whereas they can provideimprovements around 10% relative in concatenation with theMFCC features. As a stand-alone front-end, the wLP-TRAPS

3http://www.icsi.berkeley.edu/Speech/qn.html

TABLE IVMLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL

AS INPUT. PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS

ARE REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND

IN CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH

RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES

TABLE VMLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS

INPUT AUGMENTED WITH CRITICAL BAND ENERGY AND LOG-PITCH.PERFORMANCE IS REPORTED ON THE eval06 DATA. RESULTS ARE

REPORTED WITH MLP FEATURES AS STAND ALONE FEATURES AND IN

CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH

RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES

outperforms the other two; whereas, in concatenation withspectral features and after adaptation, the three representationsare comparable. Their performances are however inferior tothe conventional TANDEM 9frames PLP. The performances ofthese features augmented with the values of the critical bandenergy (19 features per frame) and the smoothed log-pitchestimates are reported in Table V. Augmenting the long termfeatures produces consistent improvements in all the cases andbrings the performances of these front-ends to the same level ofthe TANDEM-PLP when tested in concatenation with MFCC.As before, the relative improvements are always reduced afterspeaker adaptation. In concatenation with spectral features, thethree input representations have similar performances.

In summary, MLP front-ends obtained using a three-layerMLP with different input representations do not outperformthe conventional MFCC as stand alone features. On the otherhand, they produce relative improvements in the range of10%–14% when used in concatenation with spectral features.TANDEM-PLP front-end outperforms the other long term fea-tures. The various coding schemes, MRASTA, DCT-TRAPS,and wLP-TRAPS, give similarly poor results as stand-alonefeatures and similar improvements (approximately 11%) whenused in concatenation with spectral features. Augmenting thelong term input with a vector of short term energy and pitchbrings the performances close to those of the TANDEM-PLPfeatures.

The relative improvements after speaker adaptation are gen-erally reduced by 2% with respect to the speaker independentsystems. This is consistent with what has already been verifiedon English ASR experiments [27].

2444 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

TABLE VIHATS PERFORMANCE ON THE eval06 DATA. RESULTS ARE REPORTED

WITH MLP FEATURES ALONE AND IN CONCATENATION WITH MFCC.THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE

IS REPORTED IN PARENTHESES

TABLE VIIMULTI-STREAM MLP FEATURE PERFORMANCE ON THE eval06 DATA.

RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN

CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH

RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES

B. MLP Architectures

This section discusses experiments with the different MLP ar-chitectures; the input signal representations are similar to thoseused in the previous section while the information is exploiteddifferently when changing the MLP architectures. The resultsobtained using these methods are compared with their counter-parts based on the three-layer MLPs.

1) Hidden Activation TRAPS: HATS aims at using infor-mation extracted from long time spans of critical band ener-gies, but the recognition is done independently in each crit-ical band using 19 independent MLPs. The final posterior es-timates are obtained by merging all these estimates (see subsec-tion Section II-A). Results with HATS features are reported inTable VI. As stand-alone features, HATS performs significantlyworse than MFCC; whereas a relative improvement isobtained when used in concatenation with MFCC. ComparingTables IV and VI, it is noticeable that this approach is margin-ally better than those that use long-term features into a singleMLP.

2) Multi-Stream MLP Features: Table VII reports the per-formance of the multi-stream front-end that combines infor-mation from TANDEM-PLP (short time spans of signal) andHATS (long time spans of signal). These features outperformthe MFCC by 10% relative when used stand-alone and by 16%relative in concatenation with MFCC.

Those numbers must be compared to the performancesof the individual streams of TANDEM-PLP (Table III) andHATS (Table VI). The combination provides a large improve-ment in case of stand-alone features (TANDEM-PLP 25.5%,HATS 29.1%, Multistream 23.1%); however, the improve-ments are smaller when used in concatenation with MFCC(TANDEM-PLP 22.1%, HATS 22.7%, Multistream 21.7%).This can be easily explained considering the fact that when usedin concatenation with the MFCC, the feature vector containstwice the spectral information, through the MFCC and troughthe TANDEM features.

3) Hierarchical Processing: Next, we discuss experimentswith the hierarchical processing described in Section III-C. Re-sults are reported in Table VIII in cases of both MRASTA andAugmented MRASTA inputs (processing is depicted in Fig. 1).

TABLE VIIIHIERARCHICAL FEATURE PERFORMANCE ON THE eval06 DATA. RESULTS

ARE REPORTED WITH MLP FEATURES ALONE AND IN CONCATENATION

WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO

THE BASELINE IS REPORTED IN PARENTHESES

TABLE IXBOTTLENECK FEATURES PERFORMANCE ON THE eval06 DATA. RESULTS ARE

REPORTED WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT

WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES

TABLE XAUGMENTED BOTTLENECK FEATURE PERFORMANCE ON THE

eval06 DATA. RESULTS ARE REPORTED WITH MLP FEATURES

ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO

THE BASELINE IS REPORTED IN PARENTHESES

Comparing Table VIII with Tables IV and V, it is noticeablethat the hierarchical approach produces considerable improve-ments with respect to the single classifier approach both withand without MFCC features. It is important to notice that thetotal number of parameters is kept constant; thus, the improve-ments are produced from the sequential architecture where shortsignal time spans are used first and then integrated with thelonger ones.

4) Bottleneck Features: Tables IX and X report the perfor-mances of the bottleneck features obtained using different longterm inputs (MRASTA, DCT-TRAPS, and wLP-TRAPS) andtheir augmented versions. The dimension of the bottleneck isfixed to 35 in order to compare with other probabilistic MLP fea-tures. Results reveal that bottleneck features always outperformtheir probabilistic counterparts obtained using the three-layerMLP. This is verified on all the different input features and theiraugmented versions.

For comparison purposes, Table XI also reports the perfor-mance of Bottleneck features when the input to the MLP is

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2445

Fig. 2. RWTH evaluation system composed of two subsystems trained on MFCC and PLP features. The two subsystems consist of ML training followed bySAT/CMLLR training. The lattice outputs from the subsystems are combined in the end.

TABLE XIBOTTLENECK FEATURES PERFORMANCE ON eval06 DATA WHEN 9frames PLP

AND PITCH INPUT IS USED. RESULTS ARE REPORTED WITH MLP FEATURES

ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE

IS REPORTED IN PARENTHESES

9frames PLP features augmented with pitch: In summary, re-placing the three-layer MLP with a more complex MLP struc-ture (while keeping constant the number of total parameters)produces a reduction in the error both with and without concate-nation of spectral features. The multi-stream approach that com-bines in parallel MLPs trained on long and short speech tem-poral features produce the lowest CER as stand-alone front-end(16% relative CER reduction compared to the MFCC). On theother hand, hierarchical and bottleneck structures that go be-yond the three-layer appear to produce the highest complemen-tarity to MFCC, producing an improvement of 17%–18% rela-tive when used in concatenation. The reasons of these effects areinvestigated in the next section where the front-ends are com-pared in terms of phonetic confusions.

C. Analysis of Results

In order to understand the differences between the variousMLP front-ends, let us now analyze the errors they producedin terms of phonetic targets. Table XII reports the phoneticset composed of 72 tonemes used for training the MLP. Theset is sub-divided into six broad phonetic classes for analysispurposes. The numbers beside the vowels represent the tonalaccents. The frame-level accuracy of a three-layer MLP trainedusing 9frames-PLP features in classifying the phonetic targets is69.8%. Fig. 3 plots the per-class accuracy. Let us now considerthe accuracies of the three-layer MLPs trained using long-terminput representations, i.e., the MRASTA, DCT-TRAPS andwLP-TRAPS. They are respectively 64%, 62.9%, and 65.2%,which are worse than the accuracy from the 9frame-PLP.The HATS features that are based on long-term critical bandtrajectories have a similar frame-level accuracy, i.e., 65.7%.

While the overall performance of MLP trained on spectralfeatures is superior to MLP trained on long time spans ofspeech signals, the latter appears to perform better on somephonetic classes. Fig. 3 plots the accuracy of recognizing eachof the phonetic classes for HATS. It is noticeable that in spiteof an overall inferior performance, the HATS outperforms theTANDEM-PLP on almost all the stop consonants “p,” “t,” “k.”“b,” “d,” and the affricative “ch.” Stop consonants are short

Fig. 3. Phonetic-class accuracy obtained by the TANDEM-9framesPLP andHATS. The former outperforms the latter on most of the classes apart from stopsand affricatives.

TABLE XIIPHONETIC SET USED TO TRAIN THE DIFFERENT MLPS DIVIDED INTO BROAD

PHONETIC CLASSES. AS MANDARIN IS A TONAL LANGUAGE, THE NUMBER

BESIDE THE VOWELS DESIGNATES THE ACCENT OF THE DIFFERENT TONEMES

sounds characterized by burst of acoustic energy following ashort period of silence and are known to be prone to strongco-articulation from the following vowel. Studies like [28] haveshown that stop consonant recognition can be largely improvedconsidering information from the following vowel; this explainswhy using longer speech time spans produces higher recogni-tion performance compared to conventional short term spectralfeatures. Also, the affricative “ch” (composed of a plosive anda fricative) is confused with the fricatives “zh” and “s” by theshort term features while this confusion is significantly reducedby the other long term features. Vowels and other consonantsare still better recognized from the short term features.

Those facts are verified on all the MLP front-ends thatuse long temporal inputs (MRASTA,DCT-TRAPS andwLP-TRAPS) as well as the HATS. In summary, trainingMLPs using short-term spectral input outperforms trainingusing long term temporal input on most of the phonetic classesapart a few of them including the plosives and affricatives.

Let us now consider the multi-stream approach which dynam-ically weights the posterior estimates from the 9frames-PLP and

2446 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

HATS according to the confidence of the MLP. The frame accu-racy becomes 73% and the phoneme-level confusion shows thatperformances are never inferior to the best of the two streamsthat compose the combination. In other words, the combineddistribution appears to perform as the HATS on the stop conso-nants and affricatives and as the 9frame-PLP on the remainingphonemes. Those results translate into a significant reduction onthe CER, never worse than those obtained using the individualMLP features (see experiments in Section IV-B2).

The hierarchical approach described in Section III-C is basedon a completely different idea. This method uses an initial MLPtrained using energy trajectories filtered with short temporalfilters (G1-High and G2-High). This provides an initial pos-terior estimation then fed into the second MLP concatenatedwith energy trajectories filtered with long temporal filters (G1-Low, G2-Low). The second MLP re-estimates the phonetic pos-teriors obtained from the first MLP using information from alonger temporal context. The hierarchical framework achievesa frame accuracy of 72%. Interestingly this is done withoutusing any spectral feature (MFCC or PLP) and keeping con-stant the number of parameters; only the architecture of the MLPis changed where the temporal context of the input features isincreased sequentially. In other words, the first MLP trainedon short temporal context is effective on most of the phoneticclasses apart stops and affricatives. Those estimates are then cor-rected from the second MLP using the information from longertemporal context. Fig. 4 plots the phonetic class accuracy ob-tained by the three-layer MLP trained using the MRASTA inputand the hierarchical approach. It is noticeable that the hierar-chical approach outperforms training using the MRASTA on allthe targets. Recognition results show that the hierarchical ap-proach (where the processing moves from short to long tem-poral contexts) reduces the CER with respect to the single MLPfeatures (where the different time spans are processed using thesame MLP). Augmenting the input with pitch estimates and en-ergy further reduces the CER.

Another interesting finding is the fact that as stand-alonefeatures, the multi-stream approach has the lowest CER,while in concatenation with MFCC, the augmented hierar-chical approach produces the largest CER reduction (compareTables VIII and VII). This effect can be explained by the factthat the multi-stream approach makes use of spectral informa-tion (through the 9frame PLP). This information produces aframe accuracy of 73% but does not appear complementary tothe MFCC features as they both represent spectral information.On the other hand, the hierarchical approach achieves a frameerror rate of 72% without the use of any spectral features andappears more complementary when used in concatenation withthe MFCC.

Results from the bottleneck features cannot be analyzed in asimilar way, as these are non-probabilistic features without anyexplicit mapping to a phonetic target. However, recognition re-sults in Tables IX and X show that replacing the three-layer MLPwith the bottleneck architectures reduces the CER for all thedifferent input representations (MRASTA,DCT-TRAPS,wLP-TRAPS). Bottleneck and hierarchical approaches produce sim-ilar improvements in concatenation with MFCC features.

Fig. 4. Phonetic-class accuracy obtained by the MRASTA and the HierarchicalMRASTA. The latter improves the performance on all the phonetic targetswithout the use of any spectral information.

TABLE XIIIACOUSTIC DATA FOR TRAINING AND TESTING

TABLE XIVPERFORMANCES OF BASELINE SYSTEMS USING MFCC OR PLP FEATURES

V. LARGE SCALE EXPERIMENTS

Contrastive experiments in literature are typically reportedwith small setups like the one presented so far. However, theGALE evaluation systems are trained on a much larger amountof data, make uses of multi-pass training and are composed of anumber of individual sub-systems. In order to study how the pre-vious results generalize on more complex LVCSR systems anda large amount of training data, the experiments are extendedusing a highly accurate automatic speech recognizer for contin-uous Mandarin speech trained on 1600 hours of data collectedby LDC (GALE releases P1R1-4, P2R1-2, P3R1-2, P4R1). Thetraining transcripts were preprocessed and the audio data weresegmented into waveforms based on sentence boundaries de-fined in the manual transcripts. Both were provided by UW-SRIas described in [9].

This comparison will cover the Multi-stream approach andthe hierarchical MRASTA front-ends, which will be simply re-ferred as MLP1 and MLP2 in the remainder of this paper. Thesetwo features have been used in the GALE 2008 Mandarin evalu-ation. The 1600 hours data are used for training the HMM/GMMsystems as well as the MLP front-ends. The evaluation is doneon the GALE 2007 development test set (dev07) which is usedfor tuning hyper-parameters, the GALE 2008 development testset (dev08) and the sequestered data of the GALE 2007 evalu-ation (eval07-seq), for a total amount of 5 hours of data. Statis-tics of the different test sets are summarized in Table XIII. Thenumber of parameters in the MLP architectures is increased to

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2447

TABLE XVSUMMARY OF FEATURE PERFORMANCES ON GALE dev07/dev08/seq-eval07 TEST SETS. RESULTS ARE REPORTED WITH MLP FEATURES ALONE AND IN

CONCATENATION WITH MFCC OR PLP. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC AND PLP BASELINES IS REPORTED IN PARENTHESES

five millions parameters for the large scale setup. The training ofMLP1 and MLP2 networks took approximately five weeks on aneight-core machine (AMD Opteron(tm) Dual Core 2192 MHz2 4-core CPUs). MLP1 networks have been trained at ICSIand MLP2 networks have been trained at IDIAP. On the otherhand, the generation of the features is quite fast, approximately0.09xRT on a single CPU.

The RWTH evaluation system is composed of two sub-systems which only differ for their acoustic front-ends. Theacoustic front-ends of the subsystems consist of conventionalMFCCs and PLPs augmented with the log-pitch estimates[18]. The filter banks underlying the MFCC and PLP featureextraction undergo VTLN. After that, features are mean andvariance normalized and they are fed into a sliding window oflength nine. All feature vectors within the sliding window areconcatenated and projected to a 45-dimensional feature spaceusing a linear discriminative analysis (LDA). The system usesa word-based pronunciation dictionary described in [9] thatmaps words to phoneme sequences, while the phoneme carriesthe tone information, which is usually referred to as a toneme.The acoustic models for all systems are based on triphoneswith cross-word context, modelled by a three-state left-to-rightHMM. A decision tree based state tying is applied, resulting ina total of 4500 generalized triphone states. The acoustic modelsconsist of Gaussian mixture distributions with a globally pooleddiagonal covariance matrix.

The first pass consists of maximum-likelihood training. Wewill refer to this system as an SI system. The second passconsists of speaker adaptive training (SAT). Furthermore,during decoding, maximum likelihood linear regression isapplied to means for performing speaker adaptation. We willrefer to this system as an SA system. Finally, the outputs ofthe different subsystems are combined at the lattice level usingthe min.fWER combination method described in [29]. Themin.fWER method has been shown to outperform other latticecombination methods as ROVER or Confusion Network Com-bination (CNC) [29]. Fig. 2 schematically depict the RWTHevaluation system.

The language model (LM) used in this work is kindlyprovided by SRI and UW. The vocabulary size is 60 K. Ex-perimental results with the full LM are reported only in the

TABLE XVISYSTEM COMBINATION OF MFCC AND PLP SUBSYSTEMS DESIGNATED WITH

�. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE MFCC � PLPBASELINE IS REPORTED IN PARENTHESES

system combination, while a pruned version is applied in allother recognition steps.

Table XIV reports the CER for the speaker independent andthe speaker adapted subsystems trained using MFCC and PLPfeatures only. The error rate is in the range of 12.5%–14.5% forthe different test sets.

Let us now consider the integration of the MLP1 and MLP2front-ends. Table XV report the performance of the subsystemswhen they are trained using MLP1 and MLP2 features onlyand when MFCC and PLP are concatenated with MLP1 andMLP2. The results show similar trends as in the 100-hoursystem. In other words, the MLP feature performance scaleswith the amount of training data. In particular, the MLP1and MLP2 front-ends outperform the spectral features andproduce a relative improvement in the range of 15%–25%when used in concatenation with MFCC or PLP, reducing theCER to the range 10.1%–12.2% for the different datasets. Theimprovements are verified on all three test sets. The relativeimprovements after SAT are generally reduced with respect tothe speaker-independent system. After SAT, the MLP2 features(based on a hierarchical approach) yield the best performancein concatenation with both MFCC and PLP.

The lattice combination results of MFCC and PLP sub-sys-tems are reported in Table XVI (first row). For investigationpurposes, corresponding sub-systems trained using MLP1 andMLP2 front-ends are combined in the same way and their per-formance is reported in Table XVI (second row). Their perfor-mance is superior to the MFCC/PLP system by 9%–14% rela-tive, showing that the improvements hold after the lattice levelcombination.

In order to increase the complementarity of the sub-systems,features MLP1 and MLP2 were then concatenated with PLPand MFCC, respectively. The performance of the lattice level

2448 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

TABLE XVIIEFFECT OF DISCRIMINATIVE TRAINING ON DIFFERENT SUBSYSTEMS AND

THEIR COMBINATION (DESIGNATED WITH �)

combination of those two sub-systems is reported in Table XVI(third row). The results show that using the two MLP front-endsin concatenation with MFCC/PLP features produces an addi-tional relative improvement, resulting in the range of 18%–23%after system combination.

For the GALE 2008 evaluation, discriminative trainingwas further applied to the two subsystems before the lat-tice level combination. Discriminative training is based on amodified minimum phone error (MPE) criterion described in[30]. Table XVII reports CER obtained after discriminativetraining. Results are reported for the PLP+MLP1 system, theMFCC+MLP2 system and their lattice level combination.

In all the three cases, discriminative training reduced the CERin the range 6%–13% relative, showing that it is also effectivewhen used together with different MLP front-ends. For compu-tational reasons, fully contrastive results with and without dis-criminative training are not available on the 1600 hours system.

This system including the two most recent MLP-based front-ends showed to be very competitive to current Mandarin LVCSRsystems evaluated on the same test sets [31], [32].

VI. DISCUSSION AND CONCLUSION

During GALE evaluation campaigns, several MLP basedfront-ends have been used in different LVCSR systems althoughno exhaustive and systematic study of their performances hasbeen reported in literature. Without such a comparison, it isnot possible to verify which of the modification to the originalMLP features produced improvements in the final system.

This correspondence describes and compares in a systematicmanner all the MLP front-ends developed recently at multiplesites and used during the GALE project for Mandarin transcrip-tion. The initial investigation is carried on a small-scale exper-imental setup (100 hours) and investigates the two directionsalong which the MLP features have recently evolved: the useof different inputs to the conventional three-layer MLP and theuse of complex MLP architectures. The experimentation is doneboth using MLP front-ends as stand-alone features and in con-catenation with MFCC.

Three-layer MLPs are trained using conventional spectralfeatures (9frames-PLP) and features extracted from long timespans of the signal (MRASTA, DCT-TRAPS, wLP-TRAPS andtheir augmented versions). Results reveal that as stand-alonefeatures, none of them outperforms the conventional MFCCfeatures. The performances of the MLPs trained on longtime spans of the speech signal (MRASTA, DCT-TRAPS,wLP-TRAPS) are quite poor compared to those obtained fromtraining on short-term spectral features (9frames-PLP). Thelatter one is superior on most of the phonetic targets apart froma few of phonetic classes like plosives and affricatives.

Features based on the three-layer MLP produce relative im-provements in the range of 10%–14% when used in concatena-

tion with the MFCC. Even when their performances are poor asstand-alone front-ends, they always appear to provide comple-mentary information to the MFCC.

After concatenation with MFCC, the various representations(MRASTA, DCT-TRAPS, wLP-TRAPS) produce comparableperformances.

Over time, several alternative architectures have been pro-posed to replace the three-layer MLP with different motiva-tions. This work experiments with Multi-stream, Hierarchicaland Bottleneck approaches. Results using those architectures re-veal the following novel findings.

• The Multi-stream framework that combines MLPs trainedon long and short time spans outperforms the MFCC by ap-proximately 10% relative as stand-alone feature. Further-more, it reduces the CER by 16% relative in concatenationwith MFCC.

• The hierarchical approach that sequentially increases thetime context through a hierarchy of MLPs outperforms theMFCC by approximately 6% relative as stand-alone fea-ture and reduces the CER by 18% relative in concatena-tion with MFCC. Results obtained using the bottleneck ap-proach (five-layer MLP) show a similar trend.

• The MLP front-end that provides the lowest CER as stand-alone feature is different from the front-end that providesthe highest complementarity to spectral features. This ef-fect is discussed in Section IV-C.

• MLPs trained using long-time spans of the signal at theinput become effective only when coupled with architec-tures that go beyond the three-layer structure, i.e., hierar-chies or bottleneck.

In summary, the most recent improvements are obtained bythe use of architectures that go beyond the three-layer MLPrather than the various input representations.

These results have been obtained by training the HMM/GMMand MLPs on 100 hours of speech data and tested in a simpleLVCSR system. Evaluation systems are typically trained on amuch larger amount of data, make uses of multipass trainingand are composed of a number of individual sub-systemsthat are combined together to provide the final recognitionoutput. In this paper, MLP features are investigated with a largeamount of training data as well as on a state-of-the-art multipasssystem. The improvements from the small scale study holdfor the large amount of training data on speaker-independent,speaker-adapted systems and after the lattice level combination.This is verified both in concatenation with MFCC and PLPfeatures. When MLP features are used together with spectralfeatures, the gain after lattice combination is in the range of19%–23% relative for the 5-hour evaluation data sets. Thecomprehensive contrastive experiment on a multipass evalua-tion system shows that the improvements obtained on a smallsetup scale with the amount of training data and the parametriccomplexity of the system.

To our best knowledge, this is the most extensive study onMLP features for Mandarin LVCSR covering all the front-endsincluding the most recent ones used in the 2008 GALE evalu-ation systems. The final evaluation system showed to be verycompetitive to current Mandarin LVCSR systems evaluated onthe same test sets [31], [32].

VALENTE et al.: TRANSCRIBING MANDARIN BROADCAST SPEECH USING MLP ACOUSTIC FEATURES 2449

ACKNOWLEDGMENT

The authors would like to thank colleagues involved in theGALE project and Dr. P. Fousek for their help.

REFERENCES

[1] H. Hermansky et al., “Connectionist feature extraction for conventionalHMM systems,” in Proc. ICASSP, 2000, pp. 1635–1638.

[2] B. Chen et al., “Learning discriminative temporal patterns in speech:Development of novel TRAPS-like classifiers,” in Proc. Eurospeech,2003, pp. 853–856.

[3] N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition,”in Proc. ICASSP, 2004, pp. 537–540.

[4] H. Hermansky and P. Fousek, “Multi-resolution rasta filtering fortandem-based ASR,” in Proc. Interspeech’05, 2005, pp. 361–364.

[5] P. Fousek, L. Lamel, and J.-L. Gauvain, “Transcribing broadcast datausing MLP features,” in Proc. Interspeech, 2008, pp. 1433–1436.

[6] D. Ellis et al., “Tandem acoustic modeling in large-vocabulary recog-nition,” in Proc. ICASSP, 2001, pp. 570–520.

[7] N. Morgan et al., “Pushing the envelope—aside,” IEEE Signal Process.Mag., vol. 22, no. 5, pp. 81–88, Sep. 2005.

[8] D. Vergyri et al., “Development of the SRI/Nightingale arabic ASRsystem,” in Proc. Interspeech, 2008, pp. 1437–1440.

[9] M.-Y. Hwang et al., “Building a highly accurate mandarin speech rec-ognizer with language-independent technologies and language-depen-dent modules,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no.7, pp. 1253–1262, Sep. 2009.

[10] C. Plahl et al., “Development of the GALE 2008 Mandarin LVCSRsystem,” in Proc. Interspeech, 2009, pp. 2107–2110.

[11] J. Park et al., “Efficient generation and use of MLP features for arabicspeech recognition,” in Proc. Interspeech, Brighton, U.K., Sep. 2009,pp. 236–239.

[12] P. Schwarz, P. Matejka, and J. Cernocky, “Extraction of features forautomatic recognition of speech based on spectral dynamics,” in Proc.TSD’04, Brno, Czech Republic, Sep. 2004, pp. 465–472.

[13] P. Fousek, “Extraction of features for automatic recognition of speechbased on spectral dynamics,” Ph.D. dissertation, Faculty of Elect. Eng.,Czech Technical Univ., Prague, Czech Republic, 2007.

[14] H. Hermansky et al., “Towards ASR on partially corrupted speech,” inProc. ICSLP, 1996, pp. 462–465.

[15] F. Valente and H. Hermansky, “Hierarchical and parallel processing ofmodulation spectrum for ASR applications,” in Proc. ICASSP, 2008,pp. 4165–4168.

[16] F. Grezl et al., “Probabilistic and bottle-neck features for LVCSR ofmeetings,” in Proc. ICASSP’07, Hononulu, HI, 2007, pp. 757–760.

[17] H. Bourlard and N. Morgan, Connectionist Speech Recognition—A Hy-brid Approach. Norwell, MA: Kluwer, 1994.

[18] X. Lei et al., “Improved tone modeling for Mandarin broadcast newsspeech recognition,” in Proc. Interspeech, 2006, pp. 1237–1240.

[19] H. Hermansky and S. Sharma, “Temporal Patterns (TRAPS) in ASR ofNoisy Speech,” in Proc. ICASSP’99, Phoenix, AZ, 1999, pp. 289–292.

[20] T. Dau et al., “Modeling auditory processing of amplitude modulation.I detection and masking with narrow-band carriers,” J. Acoust. Soc.Amer., no. 102, pp. 2892–2905, 1997.

[21] M. Athineos, H. Hermansky, and D. P. W. Ellis, “Lp-trap: Linear pre-dictive temporal patterns,” in Proc. ICSLP, 2004, pp. 1154–1157.

[22] J. B. Allen, Articulation and Intelligibility. San Rafael, CA: Morgan& Claypool, 2005.

[23] H. Misra, H. Bourlard, and V. Tyagi, “Entropy-based multi-streamcombination,” in Proc. ICASSP, 2003, pp. 741–744.

[24] F. Valente and H. Hermansky, “Combination of acoustic classifiersbased on Dempster-Shafer theory of evidence,” in Proc. ICASSP, 2007,pp. 1129–1132.

[25] F. Valente et al., “Hierarchical Modulation spectrum for the GALEproject,” in Proc. Interpseech, 2009, pp. 2963–2967.

[26] F. Grezl and P. Fousek, “Optimizing bottleneck features for LVCSR,”in Proc. ICASSP’08, Las Vegas, NV, 2008, pp. 4729–4732.

[27] Q. Zhu et al., “On using MLP features in LVCSR,” in Proc. ICSLP,2004, pp. 921–924.

[28] A. Suchato, “Classification of Stop Place of Articulation,” Ph.D. dis-sertation, Mass. Inst. of Technol., Cambridge, 2004.

[29] B. Hoffmeister et al., “Frame based system combination and a com-parison with weighted ROVER and CNC,” in Proc. Interspeech, Pitts-burgh, PA, Sep. 2006, pp. 537–540.

[30] G. Heigold et al., “Margin-based discriminative training for stringrecognition,” J. Sel. Topics Signal Process, vol. 4, no. 6, pp. 917–925,Dec. 2010.

[31] S. M. Chu et al., “Recent advances in the GALE mandarin transcriptionsystem,” in Proc ICASSP, Las Vegas, NV, Apr. 2008, pp. 4329–4333.

[32] T. Ng et al., “Progress in the BBN mandarin speech to text system,” inProc. ICASSP, Las Vegas, NV, Apr. 2008, pp. 1537–1540.

Fabio Valente (M’05) received the M.Sc. degree(summa cum laude) in communication systems fromPolitecnico di Torino, Turin, Italy, in 2001 and theM.Sc. degree in image processing and the Ph.D.degree in signal processing from the University ofNice, Sophia Antipolis, France, in 2002 and 2005,respectively. His Ph.D. work was on variationalBayesian methods for speaker diarization done at theInstitut Eurecom, France.

In 2001, he worked for the Motorola HIL (HumanInterface Lab), Palo Alto, CA. Since 2006, he has

been with the Idiap Research Institute, Martigy, Switzerland, involved in sev-eral E.U. and U.S. projects on speech and audio processing. His main interestsare in machine learning and speech recognition. He is an author/coauthor ofseveral papers in international conferences and journals with contributions infeature extraction and selection for speech recognition, multi-stream ASR, andBayesian statistics for speaker diarization.

Mathew Magimai Doss (S’03–M’05) received theB.E. degree in instrumentation and control engi-neering from the University of Madras, Chennai,India, in 1996, the M.S. degree in research incomputer science and engineering from the IndianInstitute of Technology, Madras, India, in 1999,and the PreDoctoral diploma and the DocteurèsSciences (Ph.D.) degree from the École Polytech-nique Fédérale de Lausanne (EPFL), Lausanne,Switzerland, in 2000 and 2005, respectively.

From April 2006 to March 2007, he was a Post-doctoral Fellow at the International Computer Science Institute, Berkeley, CA.Since April 2007, he has been working as a Research Scientist at the Idiap Re-search Institute, Martigny, Switzerland. His research interests include speechprocessing, automatic speech and speaker recognition, statistical pattern recog-nition, and artificial neural networks.

Christian Plahl received the diploma degree in com-puter science from the University of Bielefeld, Biele-feld, Germany, in 2005. He is currently pursuing thePh.D. degree in the Computer Science Department,RWTH Aachen University, Aachen, Germany.

His research interests cover speech recognition,discriminative training, and signal analysis.

Suman Ravuri is currently pursuing the Ph.D. degree in the Electrical Engi-neering and Computer Sciences Department, University of California, Berkeley.

He is with the International Computer Science Institute (ICSI), Berkeley, CA,working on automatic speech recognition.

2450 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 8, NOVEMBER 2011

Wen Wang (M’98) received the B.S. degree in elec-trical engineering and the M.S. degree in computerengineering from Shanghai Jiao Tong University,Shanghai, China, in 1996 and 1998, respectively,and the Ph.D. degree in computer engineering fromPurdue University, West Lafayette, IN, in 2003.

She is currently a Research Engineer in the SpeechTechnology and Research Laboratory, SRI Interna-tional, Menlo Park, CA. Her research interests arein statistical language modeling, speech recognition,machine translation, natural language processing and

understanding, and machine learning. She authored or coauthored over 50 re-search papers and served as reviewer for over 10 journals and conferences.

She is member of the Association for Computational Linguistics.