valente_hierarchical_revised

Hierarchical and Parallel Processing of Auditory and Modulation

Frequencies for Automatic Speech Recognition

Fabio Valente

IDIAP Research Institute, CH-1920 Martigny, [email protected]

Abstract

This paper investigates from an automatic speech recognition perspective, the most effective way of com-bining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulationfrequencies. Two different combination schemes based on MLP are considered. The first one operates inparallel fashion and is invariant to the order in which feature streams are introduced. The second one oper-ates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The studyis carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings datausing the TANDEM approach. Results reveal that 1) the combination of MLPs trained on different rangesof auditory frequencies is more effective if performed in parallel fashion; 2) the combination of MLPs trainedon different ranges of modulation frequencies is more effective if performed in hierarchical fashion movingfrom high to low modulations; 3) the improvement obtained from separate processing of two modulation fre-quency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger thanthe improvement obtained from separate processing of two auditory frequency ranges (4% relative WERreduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systemsand on other languages. Furthermore the paper extends the discussion to the combination of classifierstrained on separate auditory-modulation frequency channels showing that previous conclusions hold also inthis scenario.

Key words: Automatic speech recognition (ASR), TANDEM features, Multi Layer Perceptron (MLP),auditory and modulation frequencies.

1. Introduction

Typical Automatic Speech Recognition (ASR) features are obtained through the short term spectrum of30 ms segments of speech signal. This representation extracts instantaneous frequency components of thesignal. The power spectrum is then integrated using a bank of filters equally spaced on an auditory scale(e.g. Bark scale ) thus obtaining the auditory spectrum.

Studies on recognition of non-sense syllables [1] have shown that humans process speech separatelyin different auditory frequency channels (known as articulatory bands) and they classify a speech soundmerging estimates from different bands. Later Allen [2],[3] interpreted that the recognition of speech in eacharticulatory band is done independently and a correct decision is obtained if the sound is correctly recognizedin at least one of the bands. The size of each articulatory band spans approximatively two critical bands [3].

Those observations have inspired automatic speech recognition approaches referred as multi-band ASR.Multi-band ASR [4],[5] uses a set of independent classifiers (e.g. Multi Layer Perceptron (MLP) or HiddenMarkov Models (HMM) ) trained on different parts of the auditory spectrum in order to discriminate inbetween phonetic targets. The classifiers outputs are then combined together obtaining a final decision onthe phonetic targets. Typical combination frameworks include both merger classifiers ( another MLP [6],[7])and rule based combinations (e.g. Inverse Entropy [8] or Dempster-Shafer combination [9]). Multi-bandspeech recognition has been originally introduced for dealing with noise, the rationale being that if the

Preprint submitted to Elsevier February 4, 2010

noise affects a particular auditory band, the correct phonetic recognition can be obtained using informationcoming from the uncorrupted bands.

Later the multi-band paradigm has been generalized into the multi-stream paradigm where indepen-dent classifiers are trained on different representations of the speech signal including conventional spectralfeatures (PLP) [10], long time critical band energy trajectories [11],[12] and spectro-temporal modulations[13],[14],[15].

Several multi-stream systems make use of features based on long time windows of the speech signal (e.g.[11],[14],[13],[15],[12],[6]). Conventional Short Term Fourier Transform features do not provide informationon the speech dynamics. Those are generally introduced using temporal differentials of the spectral trajectory(also known as delta features) or processing long segments of spectral energy trajectories i.e. the modulation

spectrum [16]. Several studies have been carried out to evaluate the importance of the different parts of themodulation spectrum for ASR applications [17] and robustness techniques like RASTA filtering are basedon emphasizing modulation spectrum frequencies that are most important for speech recognition [18].

This study is motivated by two main arguments:

• Current multi-band/multi-stream approaches operates in two separate steps: in the first step a setof independent classifiers (e.g. MLP) is trained in order to discriminate between phonetic targetsi.e. to estimate phoneme posterior probabilities; in a second step all the individual estimates arecombined together into a single phoneme posterior estimate. The combination happens in parallelfashion i.e. it is invariant to the order in which the different features are introduced. Alternativemethods for combining information based on hierarchies of classifiers have been proposed in literature[19],[20],[21] and have shown competitive results to the parallel scheme. In contrary to the parallelscheme, hierarchical combinations are sequential, i.e. they assume an ordering in the processing.

• Although proven effective in several small and large vocabulary ASR tasks, the parallel combinationscheme is motivated from observations made on the auditory spectrum of the speech signal. Speechtemporal modulations represent the dynamics of the signal and they are extracted using different timescales. No specific studies have been carried on the optimal way of combining classifiers trained onthis type of informations.

This paper aims at investigating the combination of classifiers trained on different ranges of auditory

frequencies (as in conventional multi-band approaches) and modulation frequencies. In particular we studyfrom an ASR perspective whether the combination of information obtained from auditory and modulationfrequency channels is more effective in parallel (as in conventional multi-band) or hierarchical (sequential)fashion. In contrary to related works, this study is carried on a Large Vocabulary Automatic SpeechRecognition task using the TANDEM approach.

The paper is organized as follows: section 2 describes the pre-processing techniques that extracts differentranges of auditory and modulation frequencies and the joint auditory-modulation channels. We limit theinvestigation to two auditory channels and two modulation channels thus four joint auditory-modulationchannels for simplifying the setup. Section 3 describes two different combination schemes (parallel andhierarchical) based on Multi Layer Perceptron (MLP) classifiers, and section 4 presents the experimentalframework based on a LVCSR system for transcription of meetings data. The combination of classifierstrained on different ranges of auditory frequencies is investigated in section 5, the combination of classifierstrained on different ranges of modulation frequencies is investigated in section 6. Joint auditory-modulationfrequencies processing is then presented in section 7 and Section 8 describes results on Single DistantMicrophones (SDM). Section 9 describes the application of those features into other LVCSR systems andfinally section 10 concludes the paper discussing results and presenting future directions.

2. Time-frequency processing

This section presents the processing used for extracting evidence from different auditory-modulationfrequency sub-bands. Feature extraction is composed of the following parts: critical band auditory spectrumis extracted from Short Time Fourier Transform of a signal every 10 ms. In the following study, the power

2

spectrum is integrated using a bank of filters equally spaced on a bark scale; 15 critical bands are used. Thisstep is common to several conventional feature extraction methods used in ASR e.g. PLP features.

Different ranges of modulation frequencies are extracted using MRASTA filtering (see [14] for details).MRASTA is an extension of RASTA filtering and extracts different modulation frequencies using a set ofmultiple resolution filters.

A one second long temporal trajectory in each critical band is filtered with a bank of band-pass filters.Those filters represent first derivatives G1 = [g1σi

] (equation 1) and second derivatives G2 = [g2σi](equation

2) of Gaussian functions with variance σi varying in the range 8-60 ms (see figure 1). In effect, the MRASTAfilters are multi-resolution band-pass filters on modulation frequency, dividing the available modulationfrequency range into its individual sub-bands1.

g1σi(x) ∝ −

x

σ2

i

exp(−x2/(2 σ2

i)) (1)

g2σi(x) ∝ (

x2

σ4

i

−1

σ2

i

) exp(−x2/(2 σ2

i)) (2)

with σi = {0.8 , 1.2 , 1.8 , 2.7 , 4 , 6}.

In the modulation frequency domain, they correspond to a filter-bank with equally spaced filters on alogarithmic scale (see figure 2). Identical filters are used for all critical bands. Thus, they provide amultiple-resolution representation of the time-frequency plane.

MRASTA filtering is consistent with studies in [22],[23] where human perception of modulation frequen-cies is modeled using a bank of filters equally spaced on a logarithmic scale. This bank of filters subdividesthe available modulation frequency range into separate channels with decreasing frequency resolution movingfrom slow to fast modulations. After MRASTA filtering the total number of features is 15×6+15×6 = 180.Then frequency derivatives across three consecutive critical bands are introduced [14]. The representationconsiders all the possible auditory and modulation frequency ranges of the speech signal.

Let us now divide this available information in different auditory and modulation frequency channels.In order to obtain different auditory frequency sub-bands, the available 15 critical bands are split in tworanges of 7 and 8 critical bands respectively referred as F-Low and F-High. The investigation is limited totwo parts for simplifying the setup.

Filter-Banks G1 and G2 cover the whole range of modulation frequencies. We are interested in processingseparately different parts of the modulation spectrum and again we limit the investigation to two parts forsimplifying the setup. Similarly to what is done with the auditory filterbanks, the Filter-Banks G1 and G2(6 filters each) are split in two separate filter bank G1-Low, G2-Low and G1-High and G2-High that filterrespectively fast and slow modulation frequencies. We define G-High and G-Low as follows:

G-High = [G1-High, G2-High] = [g1σi, g2σi

] (3)

with σi = {0.8 , 1.2 , 1.8}

G-Low = [G1-Low, G2-Low] = [g1σi, g2σi

] (4)

with σi = {2.7 , 4 , 6}

Filters G1-fast and G2-fast are short filters (figure 1 continuous lines) and they process high modulationfrequencies (figure 2 continuous lines). Filters G1-slow and G2-slow are long filters (figure 1 dashed lines)and they process low modulation frequencies (figure 2 dashed lines). The effect of this filtering is depictedin figure 3. The left picture plots the auditory spectrum of a speech signal while figure 3 (center and right)plot the auditory spectrum after filtering with g11.2 and g16. We can notice that high modulations representthe original auditory spectrum with more ’details’ while low modulations give a coarse representation of thespectrum. The cutoff frequency for both filter-banks G-High and G-Low is approximatively 10Hz.

1Unlike in [14], filter-banks G1 and G2 are composed of six filters rather than eighth, leaving out the two filters with longestimpulse responses.

3

Figure 1: Set of temporal filter obtained by first (G1 left picture) and second (G2 right picture) order derivation of Gaussianfunction. G1 and G2 are successively split in two filter bank (G1-low and G2-low, dashed line) and (G2-high and G2-highcontinuous line) that filter respectively high and low modulation frequencies.

−30 0 +30−1

0

1

TIME

G1−highG1−low

−30 0 +30−1

0

1

TIME

G2−highG2−low

Figure 2: Normalized frequency response of G1 (left picture) and G2 (right picture). G1 and G2 are successively split in twofilter bank. G1-low and G2-low (dashed lines) emphasize low modulation frequencies while G1-high and G2-high emphasizehigh modulation frequencies

1 1010

−2

10−1

100

modulation frequency [Hz]

dB

G1−high

G1−low

1 1010

−2

10−1

100

modulation frequency [Hs]

dBG2−high

G2−low

Separate ranges of auditory and modulation frequency channels can be obtained dividing the initial spec-trogram in four channels (F-Low,G-Low),(F-Low,G-High),(F-High,G-Low),(F-High,G-High) that representthe combination of low/high auditory and modulation frequencies. This processing is depicted in figure 4.

In the remainder, the investigation focuses on the most effective way of combining classifiers trained on :

1 Separate ranges of auditory frequencies (F-High) and (F-Low) (Section 5).

2 Separate ranges of modulation frequencies (G-High) and (G-Low) (Section 6).

3 Separate ranges of auditory-modulation frequencies (F-Low,G-Low),(F-Low,G-High),(F-High,G-Low),(F-High,G-High) (Section 7).

3. Combination of classifiers

The classifier used for this study is the Multi Layer Perceptron (MLP) described in [24]. The trainingis done using back-propagation [25] for minimizing the cross entropy between MLP outputs and phonetic

Figure 3: Auditory spectrum of speech signal (left picture) and its filtered versions with filter g11.2 i.e. extraction of highmodulation frequency (center plot) and g16 i.e. extraction of low modulation frequency (right plot).

Time

Bar

k

Auditory Spectrogram

20 40 60 80 100 120 140

2

4

6

8

10

12

14

Time

Bar

k

20 40 60 80 100 120 140

2

4

6

8

10

12

14

16

Time

Bar

k

20 40 60 80 100 120 140

2

4

6

8

10

12

14

4

Figure 4: Auditory-modulation frequency channels extraction: auditory spectrum is filtered with a set of Gaussian filters thatextracts high and low modulation frequencies (G-High and G-Low). After that auditory frequencies are divided in two channels(F-High and F-Low). This produces four different auditory-modulation channels: (F-Low,G-Low),(F-Low,G-High),(F-High,G-Low),(F-High,G-High).

targets [24]. MLPs output can thus be considered as an estimate of the phoneme posterior probabilitiesconditioned to the acoustic observation vector. Phoneme posterior probability are then used as conventionalfeatures into the HMM/GMM systems through the TANDEM approach [10].

Two different combination schemes based on Multi Layer Perceptron (MLP) are studied : the firstone combines two feature streams in parallel fashion while the second combines features in hierarchical(sequential) fashion.

1-Parallel combination: Given two different feature streams, a separate MLP for estimating phonemeposterior probabilities is trained independently on each of them. Phoneme posterior probabilities fromindividual MLPs are then concatenated together forming an input to a third MLP which estimates a singlephoneme posterior estimate. The process is depicted in figure 5. This architecture is generally used forcombining classifiers trained on different auditory frequency bands [4] and has been used as well in manymulti-stream systems that uses speech temporal modulations (e.g. [13],[12],[15]).

2-Hierarchical processing: an MLP is trained on a first feature stream in order to obtain phonemeposteriors. These posteriors are then concatenated with a second feature stream thus forming an inputto a second phoneme posterior-estimating MLP. In such a way, phoneme estimates from the first MLPare modified by a second net using an evidence from a different feature stream. This process is depicted infigure 6. In contrary to parallel processing , the order in which features are presented does make a difference.Hierarchical processing integrates the information contained in the different frequency channels in sequentialfashion, progressively modifying the phoneme posteriors obtained from the first MLP using a different signalrepresentation.

In the rest of the paper, the total number of parameters in the various MLP structures is made constantmodifying the size of the hidden layer. This allows fair comparison in between the different experimentswithout biasing results towards structures that contains more parameters.

4. Experimental setting

In contrary to previous related works on considerable smaller amounts of data, we pursue the investigationin a Large Vocabulary Continuous Speech Recognition task (LVCSR). The system is a simplified version ofthe first path AMI LVCSR system for meeting transcription described in [26] and uses a pruned trigram

5

Figure 5: Parallel processing of two feature sets. This combination scheme is invariant to the order in which features areintroduced.

Figure 6: Hierarchical processing of two feature sets. This combination scheme is sensitive to the order in which features areintroduced.

language model2. The training data for this system comprises of individual headset microphone (IHM) dataof four meeting corpora; the NIST (13 hours), ISL (10 hours), ICSI (73 hours) and a preliminary part of theAMI corpus (16 hours). Acoustic models are phonetically state tied triphone models trained using standardHTK maximum likelihood training procedures. The recognition experiments are conducted on the NISTRT05s evaluation data [27] (Independent Headset Microphone (IHM) part) which is composed of speechrecorded in five different meeting rooms (AMI, CMU, ICSI, NIST, VT). The pronunciation dictionary issame as the one used in AMI NIST RT05s system [26]. Juicer large vocabulary decoder [28] is used forrecognition with a pruned trigram language model.

2The first path RT05 system does not include VTLN, HLDA or speaker adaptation. Furthermore decoding is done using apruned trigram language model. Only the first path is used as the paper does not focus on benchmarking the LVCSR systembut on comparing the different feature sets.

6

In order to use phoneme posteriors into a conventional HMM/GMM system, the TANDEM approachis used [10]. The different time-frequency representations are used as input to an MLP which estimatesphoneme posterior probabilities. The phonetic targets consists of 42 phonemes. Phoneme posteriors are thenmodified according to a Log/KLT transform and used as conventional features for HMM/GMM systems.After KLT only first 25 components are used accounting for 95% of the variability.

Table 1 reports results for the baseline system (PLP plus dynamic features) and the MRASTA-TANDEMsystem where all the available ranges of modulation and auditory frequencies are processed using a singleMLP. The MRASTA-TANDEM performs 3.4% worse than the PLP baseline system.

Table 1: RT05 WER for baseline PLP system and MRASTA-TANDEM features

Features TOT AMI CMU ICSI NIST VTPLP 42.4 42.8 40.5 31.9 51.1 46.8

MRASTA 45.8 47.6 41.9 37.1 53.7 49.7

5. Combination of Auditory frequency channels

In this section, we investigate from an ASR perspective whether classifiers trained on separate auditoryfrequency ranges should be combined in parallel or hierarchical fashion.

The auditory spectrum is split in two frequency ranges composed of 7 and 8 bark. MRASTA filteringis then applied resulting in two feature set of 168 and 192 component each. We refer to those two featureset as F-Low and F-High and they contain all the available modulation frequencies extracted at high andlow auditory frequencies. Table 2 reports WER for the two MLP features obtained training on separateauditory frequency ranges.

Table 2: RT05 WER for TANDEM features obtained training MLPs on high and low auditory frequencies.

Features TOT AMI CMU ICSI NIST VTF-Low 65.5 68.7 62.1 58.1 71.1 68.3F-High 60.6 60.2 56.4 50.1 64.6 74.1

Classifiers trained on F-Low and F-High are then combined according to the schemes described in section3 i.e. in parallel fashion or in hierarchical (sequential) fashion. In contrary to the parallel combination, thehierarchical combination is sensitive to the order in which features are introduced. Thus we considers thecases in which the processing moves from F-Low to F-High and vice-versa. Results are reported in Table 3.

Table 3: RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of auditory frequencies bothin parallel and hierarchical fashion (both from low to high and from high to low frequencies).

Features TOT AMI CMU ICSI NIST VTF-Low to F-High 45.0 48.1 42.9 36.0 51.0 47.7F-High to F-Low 44.3 45.7 42.5 35.2 50.5 48.8

Parallel 43.9 46.0 41.0 35.8 50.7 47.1

The parallel processing outperforms the hierarchical processing. Using two separate frequency channelsreduces the WER by 2% (i.e. from 45.8% to 43.9%) absolute w.r.t. the single classifier approach. Thevariation in WER in between the three combination schemes is approximately 1%.

Those results are consistent with findings on human speech recognition [1],[2] and confirm that the parallelscheme (as in conventional multi-band systems) outperforms other combinations in case of classifiers trainedon different auditory frequency channels.

7

6. Combination of Modulation frequency channels

In this section we investigate from an ASR perspective whether classifiers trained on separate modulationfrequency ranges should be combined in parallel or hierarchical fashion.

As before, we limit the splitting to only two modulation frequency ranges. Filter-banks G1 and G2(6 filters each) are divided in two separate filter bank as described in section 2. We refer to those ashigh modulations (G-High) and low modulations (G-Low) and they contains the entire available ranges ofauditory frequencies extracted at high and low modulations respectively. Performances of MLP featurestrained on G-High/G-Low are reported in Table 4.

Table 4: RT05 WER for TANDEM features obtained training MLPs on high and low modulation frequencies.

Features TOT AMI CMU ICSI NIST VTG-high 45.9 48.7 41.9 37.3 53.3 49.2G-low 50.0 51.9 47.6 40.7 57.5 53.1

Now classifiers trained on G-Low to G-High are combined in parallel fashion or in hierarchical (sequential)fashion moving from G-Low to G-High and from G-High to G-Low. Results are reported in Table 5.

Table 5: RT05 WER for TANDEM features obtained combining MLPs trained on separate ranges of modulation frequenciesboth in parallel and hierarchical fashion (both from low to high and from high to low frequencies).

Features TOT AMI CMU ICSI NIST VTG-Low to G-High 45.8 48.3 43.5 37.0 52.5 48.5G-High to G-Low 40.0 40.5 37.3 32.2 47.8 42.9

Parallel 41.4 42.7 38.3 32.5 47.4 47.1

Moving in the hierarchy from low frequencies to high frequencies yields similar performance as the singleMLP approach. On the other hand, moving from high to low modulation frequencies produces a significantreduction of 5.8% into final WER w.r.t. single classifier approach. The parallel combination performs 1.4%absolute worse than the sequential combination.

In contrary to auditory frequencies, the hierarchical combination outperforms the parallel combinationwhen the processing moves from high to low modulation frequencies. Furthermore this approach outperformsthe PLP baseline by 2.5% absolute. The variation in WER in between the three combination schemes isconsiderable larger than the variation obtained splitting the auditory frequencies.

Those findings are consistent with physiological experiments in [29] that shows how different levels ofspeech processing may attend different rates of the modulation spectrum, the higher levels emphasizinglower modulation frequency rates.

To verify that the improvements in the previous experiment are produced from the sequential processingof modulation frequencies and not from a hierarchy of MLP classifiers, an additional experiment is proposed.Posterior features from the single MRASTA MLP (i.e. all auditory and modulation frequencies simultane-ously processed) are presented as input to a second MLP. The second MLP does not use additional inputbut only re-processes a block of several concatenated posterior features. Table 6 reports WER on RT05data set.

Table 6: RT05 WER for TANDEM features obtained hierarchical processing the output of an MLP trained on all the availableauditory and modulation frequencies.

Features TOT AMI CMU ICSI NIST VTHier Posterior 44.2 46.2 41.9 34.6 51.3 48.1

Results show an improvement in performances w.r.t. the single MRASTA classifier of 1.6% absolute thussignificantly worse than the sequential modulation processing which produce a WER reduction of 5.8%. The

8

Figure 7: Parallel combination of four separate auditory-modulation channels. A separate MLP is trained on each of thefrequency channels. Posterior estimates are then combined together using another MLP classifier.

experiment reveals that the improvements are actually coming from the sequential processing of modulationfrequencies rather than the hierarchical of MLPs.

7. Combination of Auditory-Modulation frequencies

This section aims at investigating if conclusions of section 5 and section 6 are valid also when combiningfour joint auditory-modulation frequency channels. At first four auditory-modulation frequency ranges areextracted as described in section 2. Then four separate MLP classifiers are trained on each of them. WERfor each of those individual TANDEM features are reported in Table 7. All the four streams have similarlyhigh WER compared to the full MRASTA filter bank. Posteriors are then combined into a single featurestream training another MLP that operates as merger classifier (see figure 7). This approach combines inparallel all the four information channels (auditory and modulation). Results are reported in Table 8.

Table 7: RT05 WER for TANDEM features obtained training MLPs on separated auditory-modulation frequency ranges.

Features G-Low,F-Low G-High,F-Low G-Low,F-High G-High,F-HighWER 65.9 66.8 65.0 67.3

Table 8: RT05 WER for parallel combination of evidence from four auditory-modulation frequency channels (see figure 7).

Features TOT AMI CMU ICSI NIST VT40.7 40.7 38.6 32.6 47.6 44.5

Separate processing of auditory-modulation frequency channels reduces the WER from 45.8% (singleclassifier approach) to 40.7% i.e. approximatively 5% absolute.

In order to verify if findings of section 5 and section 6 hold also when different ranges of joint auditory-modulation frequencies are considered, the combination scheme of figure 8 is investigated. This processingaims at processing in parallel fashion auditory frequencies and in hierarchical fashion modulation frequencies.

9

Figure 8: Proposed combination of separate auditory-modulation channels. This scheme aims at processing in parallel fashionclassifiers trained on auditory frequencies and in hierarchical fashion classifiers trained on modulation frequencies. MLPstrained on high and low auditory frequencies extracted at high modulation frequencies are combined in parallel. Later highand low auditory frequencies extracted at low modulation frequencies are combined in hierarchical fashion.

The proposed scheme combines in parallel MLPs trained on high and low auditory frequencies extractedat high modulation frequencies. Later high and low auditory frequencies extracted at low modulationfrequencies are combined in hierarchical fashion. Results are reported in Table 8.

Table 9: RT05 WER for TANDEM features obtained using MLP architecture depicted in figure 8

Features TOT AMI CMU ICSI NIST VT39.6 41.9 37.0 31.9 45.1 42.9

The proposed combination scheme produces an improvement of 1.1% in terms of WER respect to theparallel combination of the four channels. Furthermore the WER is reduced by 0.4% respect to the hier-archical modulation spectrum approach. Those results suggests that the conclusions of section 5 and 6 arealso verified when joint auditory-modulation channels are combined.

8. Distant microphone results

In order to measure the performance of different architectures in case of low SNR and increased re-verberation, the features are also evaluated on audio acquired with a Single Distant Microphone (SDM)conditions. The system training is same as in section 4. Acoustic features for evaluation are extracted fromdistant microphone. Results are reported in Table 10 and includes parallel and hierarchical combination ofauditory and modulation frequency channels as well as joint auditory-modulation processing.

On SDM audio, the gap in performances between the PLP baseline and the TANDEM-MRASTA featuresis only 0.4% absolute. The trend for features generated using parallel and hierarchical architectures issimilar to what reported on IHM data i.e. the combination of auditory frequencies is more effective inparallel fashion while the combination of modulation frequencies is more effective in hierarchical fashion.Furthermore hierarchical features largely outperform the PLP baseline.

10

Table 10: RT05 WER for TANDEM features obtained training MLPs on separated auditory-modulation frequency ranges.Results are reported for Single Distant Microphone audio.

Features PLP TANDEM-MRASTAWER 56.9 57.3

Auditory Parallel F-Low to F-High F-High to F-LowWER 55.4 56.8 56.0

Modulation Parallel G-Low to G-High G-High to G-LowWER 53.7 57.1 51.2

Four channels-parallel Four channels (Combination as in Figure 8)WER 51.9 50.2

9. Application into other LVCSR systems

The hierarchical combination of classifiers trained on separate ranges of modulation frequencies (alsoreferred as hierarchical modulation spectrum) has also been tested on other languages and integrated inother LVCSR systems that make use of TANDEM based feature extraction.

In [30], experiments on an LVCSR system for Mandarin Broadcast speech recognition are presented. TheHMM/GMM and the MLP training is done using approximatively 100 hours of Broadcast data manuallytranscribed. Those data present cleaner acoustic conditions compared to meeting recordings. Results arereported on the DARPA GALE evaluation 06 data. Acoustic models are phonetically state tied triphonemodels trained using maximum likelihood training procedures. The Mandarin phonetic set is composed of 71tonemes. Phoneme posteriors are transformed according to a Log/KLT and only the first 35 components arekept accounting for 95% of the variability. As before, hierarchical and parallel combinations of modulationfrequencies are studied; the number of parameters in the different MLP architectures is kept constant forproviding fair comparison. Results are evaluated in terms of Character Error Rate and reported in Table11.

Table 11: CER for DARPA GALE eval06 data set.Features MRASTA Parallel G-High G-Low Hier G-High to G-Low

CER 32.4 28.1 27.8

Results reveal similar conclusions to those previously presented in the meeting recognition tasks i.e.hierarchical processing of modulation frequencies outperforms the single classifier approach and the parallelprocessing.

Other large scale experiments with TANDEM features based on the hierarchical processing of the mod-ulation spectrum are also reported in [31]. The authors experiments with HMM/GMM and MLP systemstrained on very large amounts of data (1600hours) and integrated in the GALE Mandarin LVCSR sys-tem. Results show that the proposed approach outperforms by 20% the single classifier approach on severalevaluation and development dataset from the GALE project.

10. Conclusion and discussions

In this paper we discuss the most effective way of combining classifiers trained on separate ranges ofauditory and modulation frequencies. Two different schemes are considered: the parallel and the hierarchical(sequential) combination.

The parallel combination of classifiers trained on separate ranges of auditory frequencies is a well knownpractice in the multi-band framework. Table 12 summarizes results obtained dividing the available auditoryfrequencies in two separate ranges. In brackets, relative improvements w.r.t. the single classifier approachare reported.

11

Table 12: Summary of WER dividing the available range of auditory frequencies. In brackets relative improvements w.r.t. theMRASTA baseline are reported.

Features MRASTA Two Separate auditory channelsParallel Low to High High to Low

WER 45.8 43.9 (+4%) 45.0% (+1%) 44.3% (+3%)

Table 13: Summary of WER dividing the available range of modulation frequencies. In brackets, relative improvements w.r.t.the MRASTA baseline are reported.

Features MRASTA Two Separate modulation channelsParallel Low to High High to Low

WER 45.8 41.4 (+9%) 45.8% (+0%) 40.0 (+12%)

Similarly, Table 13 summarizes results obtained dividing the available modulation frequencies in twoseparate ranges.

We can conclude that:

• the combination of MLPs trained on different ranges of auditory frequencies is more effective if per-formed in parallel fashion. This is consistent with studies on human speech recognition [1],[2] and withthe conventional multi-band framework.

• the combination of MLPs trained on different ranges of modulation frequencies is more effective ifperformed in hierarchical fashion moving from fast to slow modulations. This is consistent withstudies on spectro-temporal receptive fields [29] that shows how different levels of processing mayattend different rates of the modulation spectrum, the higher levels emphasizing lower modulationfrequency rates. Furthermore it outperforms by 2.4% the conventional PLP baseline.

• When only two frequency channels are used, the improvement coming from separate processing ofmodulation frequencies is considerably larger than the improvement coming from separate processingof auditory frequencies.

Results on SDM data shows an overall similar trend. The relative improvements w.r.t. MRASTA featuresare reported in Table 14.

Table 14: Summary of WER dividing the available range of auditory and modulation frequencies. In brackets relative improve-ments w.r.t. the MRASTA baseline are reported.

Features MRASTA Two Separate modulation channelsParallel Low to High High to Low

WER 57.3 55.4 (+3%) 56.8% (+1%) 56.0% (+2%)

Features MRASTA Two Separate auditory channelsParallel Low to High High to Low

WER 57.3 53.7 (+6%) 57.1% (+0.3%) 51.2 (+10%)

Similar experiments proposed on different data (Broadcast Recordings) and different languages (Man-darin Chinese) hold similar conclusions [31],[30].

Those findings are effective also in case of combination of joint auditory-frequency channels. In fact theproposed combination scheme of figure 8 outperforms by 1% the conventional parallel combination of MLPstrained on separate ranges of auditory/modulation frequencies.

The reason of this effect is due to what type of information auditory and modulation frequencies carryand on the time scales at which they are extracted. Auditory frequencies are extracted from a fixed short-term (30ms) window and represents the information contained in the instantaneous frequency of the signal.Thus high and low auditory frequencies correspond to the same temporal context. Auditory frequencies do

12

not carry information on the dynamics of the signal. The extraction of modulation frequencies, i.e. thedynamics of the signal, involves analysis at different time scales done by the multiple-resolution (MRASTA)filters with varying time spans. While short filters (fast modulations) provide a fine representation of thespeech dynamics, long filters (slow modulations) provides a coarse representation of the speech dynamics.

MLPs trained on different ranges of auditory frequencies represent phoneme posteriors estimates at thesame time scale (30 ms). MLPs trained on different ranges of modulation frequencies represent phonemeposterior estimates at different time scales with fine and coarse representation of the dynamics. Fine andcoarse representations are extracted using long and short temporal filters.

The difference between parallel and hierarchical combinations lays in the fact that parallel combinationassumes that there is no ordering in the process while hierarchical combination is a sequential scheme wherethe order in which features are introduced matters.

Experiments on human speech recognition suggests that there is no ordering in processing auditoryfrequencies i.e. recognition is carried independently in each band and a then a decision is taken mergingresults from different bands. This is verified in the experimental section as the parallel architecture (whichdoes not assume any ordering) performs better then hierarchical sequential architectures. This is furthermoresupported by the small difference in performance whenever the processing moves from fast to slow and slowto fast auditory frequencies (in the order of 2% relative).

MLPs trained on different ranges of modulation frequencies produce phoneme posterior estimates atdifferent time scales, i.e., they can be ordered according to their time scales. Combining those estimatesassuming that there is no ordering i.e., in parallel, maybe suboptimal. Studies like [29] suggests that theprocessing of modulation frequencies is sequential, i.e. different levels attend different rates of the modulationspectrum. Hierarchical combination is a possible way of implementing sequential processing which assumesan ordering and can operate from a fine-to-coarse (i.e. fast to slow) or a coarse-to-fine (i.e. slow to fast )time scales. Experiments reveal that integrating information from fast modulations (i.e. small size temporalwindows) to slow modulations (i.e. long size temporal windows) is the most effective processing consistentlywith [29]. The hypothesis of the sequential processing is furthermore supported by the large difference inperformance whenever the MLP architecture moves from fast to slow and slow to fast modulation frequencies(in the order of 10% relative).

In other words, the combination of MLPs trained on high and low modulation frequencies involves thecombination of different time contexts. Moving from short to long time-spans is similar to progressivelyincreasing the temporal context as done in a number of other posterior-based ASR systems [32]. Such aneffect could not be obtained using the parallel scheme.

On the other hand, the MLPs trained on high and low auditory frequencies have a fixed input temporalcontext, they do not provide fine/coarse estimation of the input signal, and they do not have a particularordering thus sequential combination is not as effective as the parallel one.

We limited here the investigation to two auditory and two modulation frequency channels obtainedsplitting the tonotopic scales in two equally sized bands for simplifying the experimental setup. In futurewe plan to further consider larger number of bands and to experiments with considerable larger number offrequency channels like in [13],[15].

11. Acknowledgments

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA)under Contract No. HR0011-06-C-0023 and by the European Union under the integrated project AMIDA.Any opinions, findings and conclusions or recommendations expressed in this material are those of the au-thor(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).The author thanks colleagues from the AMIDA and GALE projects for their help with the different LVCSRsystems and the reviews for their comments.

References

[1] Fletcher H., Speech and Hearing in Communication., Krieger, Hew York, 1953.

13

[2] Allen J.B., “How do humans process and recognize speech?,” IEEE Transactions on Speech and Audio Processing, vol.2, Oct. 1994.

[3] Allen J.B., Articulation and Intelligibility, Morgan and Claypool, 2005.[4] Hermansky H., Tibrewala S., and Pavel M., “Towards ASR on partially corrupted speech,” Proc. ICSLP 1996.[5] Bourlard H. and Dupont S., “A new ASR approach based on independent processing and re-combination of partial

frequency bands.,” Proc. ICSLP 96.[6] Hermansky H., “TRAP-TANDEM: Data-driven extraction of temporal features from speech,” Proceedings of ASRU,

2003.[7] Chen B., Chang S., and Sivadas S., “Learning discriminative temporal patterns in speech: Development of novel traps-like

classifiers,” Proceedings of Eurospeech, 2003.[8] Misra H., Bourlard H., and Tyagi V., “Entropy-based multi-stream combination,” in Proceedings of ICASSP, 2003.[9] Valente F. and Hermansky H., “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” Proc.

ICASSP 2007.[10] Hermansky H., Ellis D., and Sharma S., “TANDEM connectionist feature extraction for conventional hmm systems.,”

Proceedings of ICASSP, 2000.[11] Hermansky H. and Sharma S., “Temporal patterns (TRAPS) in asr of noisy speech,” in Proceedings of ICASSP, 1999.[12] Morgan N., Chen B., Zhu Q., and Stolcke A., “Trapping conversational speech : Extending TRAP/tandem approaches

to conversational telephone speech recognition,” in Proceedings of ICASSP, 2004.[13] Kleinschmidt M., “Methods for capturing spectro-temporal modulations in automatic speech recognition,” Acustica united

with Acta Acustica, vol. 88(3), pp. 416–422, 2002.[14] Hermansky H. and Fousek P., “Multi-resolution RASTA filtering for TANDEM-based ASR.,” in Proceedings of Interspeech

2005, 2005.[15] Zhao S.and Ravuri S. and Nelson Morgan, “Multi-stream to many-stream: Using spectro-temporal features for ASR,” in

Proceedings of Interspeech, 2009.[16] Hermansky H., “Should recognizers have ears?,” Speech Communications, vol. 25, pp. 3–27, 1998.[17] Hermansky H. Kanedera H., Arai T. and Pavel M., “On the importance of various modulation frequencies for speech

recognition,” Proc. of Eurospeech Eurospeech ’97, 1997.[18] Hermansky H. and Morgan N., “RASTA processing of speech,” IEEE Transactions on speech and audio processing, vol.

2, 1994.[19] Sivadas S. and Hermansky H., “Hierarchical tandem feature extraction,” in Proceedings of ICASSP, 2002.[20] Valente F., Vepa J., Plahl C., Gollan C., Hermansky H., and Schluter R., “Hierarchical neural networks feature extraction

for LVCSR system,” in Interspeech 2007, 2007.[21] Valente F. and Hermansky H., “Hierarchical and parallel processing of modulation spectrum for ASR applications,” in

Proceedings of ICASSP, 2008.[22] Houtgast T., “Frequency selectivity in amplitude modulation detection,” J. Acoust. Soc. Am., vol. 88, 1989.[23] Dau T., Kollmeier B., and Kohlrausch A., “Modeling auditory processing of amplitude modulation .i detection and

masking with narrow-band carriers.,” J. Acoustic Society of America, , no. 102, pp. 2892–2905, 1997.[24] Bourlard H. and Morgan N, Connectionist Speech Recognition - A Hybrid Approach, Kluwer Academic Publishers, 1994.[25] Rumelhart D., Hinton G, and Williams R., “Learning representations by back-propagating errors,” Nature, vol. 323, pp.

533 – 536, 1986.[26] Hain T. et al, “The 2005 AMI system for the transcription of speech in meetings,” NIST RT05 Workshop, Edinburgh,

UK., 2005.[27] http://www.nist.gov/speech/tests/rt/rt2005/spring/, ,” .[28] Moore D. et al., “Juicer: A weighted finite state transducer speech coder,” Proc. MLMI 2006 Washington DC.[29] Miller et al., “Spectro-temporal receptive fields in the lemniscal auditory thalamus and cortex,” The journal of Neuro-

physiology, vol. 87(1), 2002.[30] Valente F., Magimai.-Doss M., Plahl C., and Suman R., “Hierarchical processing of the modulation spectrum for GALE

mandarin LVCSR system,” in Proceedings of the 10thAnnual Conference of the International Speech CommunicationAssociation (Interspeech), 2009.

[31] Plahl C., Hoffmeister B., Heigold G., Loof J., Schluter R., and Ney H., “Development of the GALE 2008 mandarinLVCSR system,” in Proceedings of the 10thAnnual Conference of the International Speech Communication Association(Interspeech), 2009.

14

Documents

valente_hierarchical_revised