Nontactile estimation of glottal excitation characteristics of voiced speech

SCIENCE

Nontactile estimation of glottal excitationcharacteristics of voiced speech

N.P. Brieseman, MEC.W. Thorpe, BEProf. R.H.T. Bates, DSc(Eng), FEng, FIEE

Indexing terms: Signal processing, Speech synthesis, Biomedical applications, Digital filters

Abstract: A record of voiced speech is LPC-analysed. It is also partitioned into a sequence ofshort signals, each containing a single glottalpulse. Each short signal is taken to be the convol-ution of a component varying from one shortsignal to the next and an invariant component,corrupted by a significant but not overwhelmingcontamination, i.e. noise plus all other imperfec-tions. The invariant component, which is initiallyestimated by shift-and-add processing, is themultiple convolution of the invariant responses ofthe recording apparatus, the speaker's lips andvocal tract (plus nasal tract and soft palate) andthe speaker's average glottal excitation. Thisinitial estimate, which is characteristic of theglottal excitation, is iteratively refined by a com-putational procedure which makes use of the LPCcoefficients. The procedure, which checks its ownnumerical convergence, is illustrated by presentingresults for six different speakers and for a singlespeaker under varying conditions.

1 Introduction

This paper introduces a technique capable of nontactilecharacterisation of a speaker's average glottal excitation,by computational processing of recorded speech, withoutthe need for a throat microphone or recourse to anyspecies of glottography [1-4]. We do not, as Milenkovic,for instance, in Reference 5, make prior assumptions con-cerning the form of the 'glottal pulse'. Our procedure isinherently self-checking in that the estimate of the excita-tion can be iteratively refined.

Section 2 summarises our reasons for carrying out thework reported here and makes clear what its basic limi-tations are. Certain established results, needed to explaindetails of the technique introduced in this paper, are sum-marised in Section 3. The assumptions on which the tech-nique is predicated are introduced in Section 4, whichalso details the computational strategy. Our protocols forgathering, organising and processing data are describedin Section 5. In Section 6 we report and assess our

Paper 5668A (S9), first received 21st May 1986 and in revised form 5thMay 1987The authors are with the Department of Electrical & Electronic Engin-eering, University of Canterbury, Christchurch 1, New Zealand

results, and in Section 7 we outline useful future pos-sibilities.

Our processing is derived from the shift-and-add prin-ciple, which arose originally in the context of opticalastronomical imaging [6] and is now incorporated intoobservational practice [7-9]. It has also been applied toultrasonic imaging [10, 11] and electrocardiography[12]. It is important to understand that shift-and-add isnot merely an 'educated-averaging' technique, such asFriedman [13] employs (but it is worth noting thatFriedman's preliminary processing is similar to ours). Asexplained in detail in Section 3.2, shift-and-add is actuallya species of blind deconvolution [14], whereby thesmearing or blurring suffered by a signal or image iscompensated without prior knowledge of the impulseresponse or point spread function of whatever apparatusor propagation medium caused the degradation. It is thisproperty of shift-and-add, allied to its proven robustness[7], that makes it particularly suited to estimating theaverage glottal excitation embedded in a speech record.

2 Motivation and constraints

There are several reasons for wanting to estimate theacoustic waveform of an individual's glottal excitation,without the necessity of affixing any kind of apparatus to,or in the vicinity of, the individual's throat. First, despitethe undoubted success of therapeutic methods employingglottographic devices [1, 15], a clinical patient with avoice disorder, or a deaf person being instructed by atherapist, would find it more comfortable and less stress-ful merely to speak into a microphone. Secondly, eventhough speaker recognition systems are commerciallyavailable [16], their reliability might be improved if aspeech waveform could be conveniently broken downinto components having physiologically distinct origins.The task of achieving a desired speaker-identificationprobability would be eased if the number of independentdescriptors could be increased. Thirdly, since it seemsestablished that the quality of synthesised speech isaffected by how faithfully one can mimic the glottal exci-tation [17-19], it is reasonable to expect that improvedvocal tract parameters would be obtained if an estimateof the glottal excitation was to be deconvolved out of thespeech record before the parameters were abstracted.

It would obviously be desirable to recover the forms ofindividual glottal pulses using nontactile methods. Thetechnique introduced in this paper only separates out theaverage glottal excitation during any utterance recordedby an individual. In spite of this, the results reported in

IEE PROCEEDINGS, Vol. 134, Pt. A, No. 10, DECEMBER 1987 807

Section 6 suggest the waveforms so obtained will beuseful in applications such as the three outlined in theprevious paragraph.

Recorded speech consists of unwanted noise added tothe convolution of the glottal excitation and the responseof what is here called the generalised vocal tract. Thelatter is composed of the multiple convolution of theresponses of the actual vocal tract (as modified by thenasal tract and any absorption by the soft palate), the lipsacting as an acoustic radiator, the receiving transducer(e.g. microphone) and the recording device. Since it is thefinal output which is of interest in all three of the applica-tions mentioned in the first paragraph of this Section, wethink it inappropriate to deconvolve the responses of theapparatus and the lips from our estimate of the averageglottal excitation. The former is constant for any particu-lar application (in Section 5 we describe our precautionsfor ensuring that our results are not significantlydegraded by any inadequacies of our experimentalapparatus), and while there is an accepted universalmodel for the lip response [20, 21], it cannot characteriseall individual speakers equally well. Therefore, we havethought it most useful to generate waveforms which arespecific to particular utterances by particular individualsrecorded by particular apparatus. Thus the quantity g(t)called the average generalised excitation (defined preci-sely in Section 4) is that feature of an utterance whichpersists throughout the recording.

Although we are not able to isolate the acoustic mani-festation of any individual 'glottal pulse', we can detectalterations in the glottal excitation, and separate suchalterations from those in the vocal tract response.

3 Necessary preliminaries

Certain established theoretical points and algorithmicprocedures, which are invoked later in the paper, aresummarised in this Section.

3.1 General inconsistency of con volutionGiven any two signals, x(t), and y(t), of finite duration, wecan never, except in contrived situations, construct athird such signal z(t) so that x(t) is identical to y(t) Q z(t)where O is the convolution operator. We say that x(t)and y(t) Q z(t) are almost always inconsistent [14]. Onthe other hand, there are infinitely many signals z(t) andw(t) for which

x(t) = y(t) O z(t) (1)

Further constraints (such as those stated in Section 4)must be introduced to ensure that the signals z(t) and w(i)can be unique.

3.2 Shift -and -add processingBefore showing in detail how it can be adapted to speechprocessing, we begin by explaining the shift-and-addprinciple in descriptive terms.

Any signal can be regarded as a succession of contig-uous impulses. When a signal is blurred in such a waythat each impulse is identically distorted, the blurringprocess can be expressed as the convolution of the orig-inal signal and a blurring function, called the pointspread function (PSF) [14]. Since the PSF can also beregarded as a succession of impulses, the blurred signalcan be represented as a distribution of differently delayedversions of the original signal, each with an amplitudeequal to the original signal scaled by the amplitude of theappropriate impulse in the PSF. Thus the amplitude of

808

any impulse in the blurred signal is related in a linearway to the amplitudes of each impulse in the originalsignal and the PSF. Consequently, there is a finite prob-ability that the largest amplitude impulse in the blurredsignal corresponds to the largest amplitude impulse inthe largest version of the original signal. This is the basicassumption implicit in the shift-and-add principle [6, 11,14].

Now consider what one can do when presented with asequence of blurred signals, each being the convolution ofa particular original signal and a PSF, with all the PSFsbeing (quasi-)randomly different. If each blurred signal isshifted (in time) so that its largest impulse occurs at theorigin (which can of course be chosen arbitrarily since wecan say that the largest impulse of the original signal alsooccurs at the origin) of the time axis, then there is a finiteprobability that the largest version of the original signalpresent in any particular blurred signal is centred at theorigin (assuming that the original signal is centredaround its largest impulse - which also can be arrangedby appropriately defining the duration of the originalsignal). When all the shifted blurred images are added,the centred versions of the original signal reinforce whilethe remaining versions tend to cancel because they are(quasi-)randomly shifted with respect to each other. Arestored version of the original signal is thereby obtained.

The quality of the restoration depends upon thenumber of versions of the original signal that are actuallycentred on the origin, which in turn depends on howdominant the largest impulse in the original signal is overall the other impulses. Even when the largest impulse isonly marginally dominant, however, shift-and-add pro-cessing can be modified so as to generate a faithfulversion of the original signal, provided it belongs to aparticular (but wide) class of signals [11]. The iterativerefinement scheme introduced in Section 5.5 can beviewed as an adaptation of such modifications to thebasic shift-and-add procedure described in detail in theremainder of this Section.

Suppose a finite duration signal g(t) is repeatedlyapplied, at the set {Tm; m = 1, 2, . . . , M; Tm;M > Tm} ofinstants, to a time-varying filter, with hjt) being the zeromean time-invariant impulse response best approx-imating (in a least squares sense) the response of the filterthroughout (Tm - T_) < t ^ (Tm + zm + T+), which wecall the mth response interval. We are here defining

g(t) = 0 for t < —T_ and t > T + (2)

10(O)| > I g(t)\ for - T _ < t < 0 and 0 < t < T + (3)

tm is the greatest time taken for the filter output to finallyfall below some preset threshold when a unit impulse isapplied to the filter at any instant within the interval (Tm

— T _ ) ^ t ^ ( T m + T+). The mth individual output sm(t)of the filter—the output obtained when g(t) is applied forthe mth time—is given throughout the mth responseinterval by

sm(t) = g(t — Tm) O hm(t) + cm{t) (4)

where the mth contamination cm(t) accounts for the effectsof the threshold, the temporal variation of the filter, anyoverlap of successive response intervals, and recordingnoise. If the major peak of g(t) is dominant, in the sensethat

\g(0)\>\g(t)\

for - T _ < t < - 1/2W and 1/2W < t < r+ (5)

IEE PROCEEDINGS, Vol. 134, Pt. A, No. 10, DECEMBER 1987

where W is the effective bandwidth of the filter, and if thelatter's temporal variations are such that the members ofthe set {hjt\ cjt); m = 1, 2, . . . , M) are effectively sta-tistically independent, and if (Tm+1 - Tm_^)l2 is at leastcomparable with the duration of the mth response inter-val (for most values of m), and if M is large enough (howlarge this is depends on the contamination level, butuseful results are often obtained for M > 50), then theshift-and-add signal sM(0 can be expected to reveal theform of g(t) faithfully [7, 10, 11]

tj>m, (6)

where <>m denotes an average over m, and tm is theinstant at which the magnitude of sm(t) is largest i.e.

\sjtj=m*x\sjt)\ (7)

cJLt) = a Jit) O IP Jit) + q(t) O qJLt)]

+ Pm(t) O MO O ^m(f — ^m) (12)It is worth noting the form of the second of eqns. 11,which recognises that each Tem may differ significantlyfrom Tm—this comes about because there can be appre-ciable differences between k{t) and g(t). For the reasonsgiven in Section 2, we call g(t) the average generalisedexcitation.

Since the generalised vocal tract (as defined in Section2) is necessarily a causal filter, its response must be of the'all-pole' variety during any interval short enough for thisfilter to be effectively time-invariant. This is of course theconventional rationale for linear predictive coding (LPC)analysis of speech [22]. We abstract standard LPCparameters from each speech record, and use them, asexplained in Section 5.5, to iteratively refine g(t).

4 Shift-and-add for voiced speech

A speech signal can be regarded as the outcome of apply-ing to the vocal tract (which is a time-varying filter) anexcitation e(t), issuing from the larynx. During voicedspeech, e(t) is a sequence of effectively discrete glottalexcitations, the mth being denoted by gm(t — Tem) whereTem is the instant at which the major peak of the mthexcitation occurs. Thus, a recording s(t) of a section con-taining M individual excitations can be expressed as

M

(8)

where, throughout the interval during which the mthexcitation persists within the generalised vocal tract (asdefined in Section 2), vjt) mimics the role of hm(t) duringthe mth response interval (as defined in Section 3). Thecontamination c(t) accommodates all the effectsaccounted for by the individual contaminations intro-duced in Section 3.2.

The processing, by which we generate the results pre-sented in Section 6, is predicated upon the instants Tem

being sufficiently separated that s(t) can be partitionedinto segments having the character of the individualoutputs discussed in Section 3.2. The mth segment can beequated with sm(t), as defined by eqn. 4, with the aid ofthe notation

~ Tem) = k(t - Tem) O kJLt) + ccm(t), (9)

(10)

where the quantities appearing on the right hand sides ofeqns. 9 and 10 are introduced to take account of the con-sistency question raised in Section 3.1. We posit that weare able to fix the forms of these quantities (refer to thefinal sentence of Section 3.1) by, first, requiring km(t) andqm(t) to be of zero mean and, secondly, constraining theenergies of k(t) and q{t) to be as large as are consistentwith the actual forms of all M of the gjt) and vm(t). Otherconstraints could be imposed, but we feel that theseaccord best with our results. The implication is that k(t)and q{t) represent the parts of the glottal excitations andthe generalised vocal tract responses respectively, whichpersist throughout the whole utterance. The usefulness ofthis postulate can only be evaluated experimentally (referto the results presented in Section 6).Inspection of eqns. 4, 9 and 10 reveals that

g(t) = k(t) O q{t\ hjt) = kjt + Tem - Tm) Q qjt), (11)

IEE PROCEEDINGS, Vol. 134, Pt. A, No. 10, DECEMBER 1987

5 Experimental method

5.1 TerminologyHere we shall introduce certain terminology. A selectedsentence of text, when read aloud by a person (hereafterreferred to as a speaker), is called an utterance. Afterrecording (which encompasses prerecording with anaudio tapedeck in an anechoic chamber—as describedbelow, sampling, and storing in computer memory) anutterance is termed a speech record. The number of uni-formly spaced samples per second is called the samplingrate. The temporal interval during which any sample isrecorded is called a sampling instant. A sample is thedigital representation of the amplitude of the speechwaveform recorded at an instant. A segment is a part of aspeech record comprising any number of consecutivesamples.

5.2 Data collectionUtterances were prerecorded in an anechoic chamberusing an AIWA CM-53 microphone and an AIWA F990tapedeck. The 3 dB frequency responses of the micro-phone and the tapedeck (when employing Dolby C noisereduction, and optimised bias and equalisation for thechromium dioxide tape that was used) were at 50 Hz and13 kHz and at 20 Hz and 18 kHz respectively.

In any signal processing application where a wave-form shape is to be recovered, such as shift-and-add, it isimportant that the phase response of the recordingapparatus be close to linear. It has been customary inglottal inverse filtering situations either to use an FMtape recorder [20, 23], or to digitise the signal directlyinto the computer [5, 20]. However the quality of AMtape recording has improved over the years (Miller'spaper [23] was written in 1959), and in tests on our tape-deck we found no perceptable differences in the shape ofthe waveform between the directly digitised and the pre-recorded and digitised versions of a speech record. Wealso measured the phase response of the tapedeck with aHewlett Packard HP 3561A Dynamic Signal Analyser,and found that, down to 100 Hz, its maximum deviationfrom linearity was 10°.

Each prerecorded utterance was filtered to preventaliasing by a KEMO VBF/8 low-pass filter, which wasset to a cutoff frequency of 4.5 kHz with a rolloff of48 kB/octave. The speech was then sampled at a rate of10 kHz and digitised by the 12 bit A-D converter incor-porated into the DEC LPAll-k Laboratory PeripheralAccelerator on our VAX 11/750 computer. The effective

809

bandwidth of the sampling/digitising circuitry was6 MHz, implying that the duration of a sampling instantwas 165 nsec.

5.3 Shift-and-addShift-and-add is implemented for each speech record byinvoking the following sequence of steps:

(i) The voiced segments of the speech record are identi-fied with the aid of a software implementation of Knorr's[24] frequency-based algorithm. The unvoiced and silentsegments are discarded, and the voiced segments are con-catenated to produce a continuous voiced speech record.

(ii) Because we have noticed that the most prominentpeaks in any continuous voiced speech record are almostall either mostly positive-going or mostly negative-going,we normalise our speech records by inverting thosewhose most prominent peaks are negative-going.

(iii) Consecutive pitch periods of the voiced speechrecord are estimated with the aid of Brieseman's [25]implementation of our extension [26] of the Gold andRabiner time-domain algorithms [27]. We prefer thisapproach to the many others that are available, such asthose discussed in Section 2.3.2 of [28].

(iv) The voiced speech record is partitioned into asequence of M overlapping segments, with M being closeto the number of glottal pulses in the record. Eachsegment comprises 128 samples, a number chosen forconvenience (128 being a power of 2), and because it wasdeemed to be appropriately longer (by about 20 samples)than the average pitch period of any of the speechrecords that we processed. The segmentation proceedsaccording to the following steps. A preliminary segmentis defined, and the sample corresponding to the mostprominent positive-going peak (of magnitude pm) islabelled with the integer h(m). The nth sample is thenlabelled with the integer (64 — h(m) + n). After thesamples are labelled in this way (which ensures that themost prominent positive-going peak of the preliminarysegment occurs at the sample labelled with the integer64), the mth segment rm(t) is defined as those sampleslabelled with integers from 0 to 127. This comprises the'shift' step of shift-and-add. Denoting by /im the estimate,as obtained from step (iii), of the number of samples in apitch period, during the mth segment, the (m -I- l)th pre-liminary segment is centred on sample number (h(m)+ fim) (which is the expected location of the next peak).

We simplified our processing by setting \xm = 100 samplesfor all m, as this is approximately equal to the averagepitch period for male speakers, and in preliminary trialsit had no perceptible effect on our results.

(v) Since the algorithm referred to in (i) above is notalways successful in precisely locating the times at whichthe voiced/unvoiced transitions occur (implying thatsome of the segments may be of silence or unvoicedspeech), we have found it effective to discard those r^(t)for which pm is less than 1% of the maximum of all M ofthepm.

(vi) The M of the M original rm(t) which survive step(v) are normalised by dividing each rjt) by pm. These Mnormalised segments, each of whose most prominentpositive-going peaks are of unit amplitude are taken tobe the sm(t + t j , as defined in Sections 3.2 and 4.

(vii) The M of the sm{t + tm) constructed in step (vi) areadded together (the 'add' of shift-and-add), and the totalis divided by M, thereby generating sM(t). Since we areonly interested in the signal over a duration of one pitchinterval, we discard those samples which are outside thesegment, of length equal to the average pitch period, that

is centred on the 64th sample (which is the sample corre-sponding to the peak in ssa(0).

5.4 Ancillary processingThe shift-and-add signal ssa(t) can be regarded as a firstestimate of g(t). In order to assess the accuracy of theestimate we generate a synthetic speech record by excit-ing an LPC model of the voiced speech record with ssa(t).The first step in the generation of this synthetic speechrecord is to differentiate the voiced speech record to sta-bilise the LPC extraction process by flattening the spec-trum of the speech [22, 29]). We then abstract sets of 10LPC coefficients from each contiguous 200 sample longsegment of the voiced speech record by the standardautocorrelation method [22]. A standard lattice filterrepresentation of the vocal tract [22, 29] is computedfrom the LPC parameters. A synthetic version of thevoiced speech record, denoted here by a(t), is then gener-ated by repeatedly exciting the lattice filter with ssa(t) atinstants separated by the pitch periods computed duringstep (iii) of Section 5.3. We denote by am(t) the version ofsm{t) implicit in a(i).

Shift-and-add is performed on the M segments am(t) inthe same way as described in Section 5.3 for the sm(t), andthe result of this we denote by asa{t). The error measure

dte = (13)

quantifies the degree of compatibility of ssa(t) wih the trueaverage generalised excitation, because e would vanish ifssa(t) was in fact identical to g(t), where t~ < t < t+ is theinterval centred around the peak of ssa(t) with extentdefined by the average pitch period of s(t).

5.5 Iterative refinement of g(t)We have devised the following iterative scheme to gener-ate estimates of g(t) giving smaller values for s than ssa(t):

The current estimate of g(t) is replaced by (g(t) + ssa(t)— Osait)). A new synthetic version of the voiced speech

record is generated using the procedure described inSection 5.4 to generate a(t). Having formed the new ver-sions of om{t), in the same way that the sm(t) are gener-ated, as described in Section 5.3, and having subjectedthem to shift-and-add to form asa{t), we recompute theerror measure e. A new iteration is started by updatingg(t). The first estimate of g(t) is ssa(t), as indicated inSection 5.4. The iterations are stopped either when e fallsbelow whatever threshold is deemed a priori to be appro-priate, or when e is a minimum. LPC speech analysis isonly approximate, though it leads to useful results, imply-ing that, in general, there is no g(t) exactly compatiblewith a given voiced speech record. A consequence of thisis that the iterations can diverge after initially converging— e initially decreases with N (the number of iterations)and then increases, analogously to the terms in anasymptotic series [30]. Our experience is that the pro-cedure always converges to a useful level before diverging(see results presented in Section 6).

5.6 Processing of computer-generated speechrecords

A preliminary assessment of the effectiveness of oursignal processing strategy was obtained by applying it tocomputer-generated voiced speech records. We obtaineda trial speech record by applying a computer-generatedglottal pulse, having the simple triangular form shown asa dotted curve in Fig. 1, to a lattice filter constructed asindicated in Section 5.4 from LPC parameters derived

810 IEE PROCEEDINGS, Vol. 134, Pt. A, No. 10, DECEMBER 1987

from an actual utterance. The solid and dashed curves inFig. 1 (which apply to the utterance quoted in thecaption to Table 1, as recorded by Speaker A) are typicalof our results.

The solid curve in Fig. 1 shows sM(0 for the trialspeech record, computed in the same way as described,for an actual speech record, in Section 5.3. Since the samecomputer-generated glottal pulse persists throughout thetrial speech record, we expect sM(t) to be reasonably closeto both the computer-generated glottal pulse and thefinal version of aJ[t); the latter being the dashed curve inFig. 1. This expectation is substantiated by Fig. 1. The

10ms

Fig. 1 Computer generated glottal pulse and reconstruction

Computer generated glottal pulse (i.e. synthetic pulse) and reconstruction fromartificial speech was generated by using this synthetic pulse to excite (as describedin Section S.4) a model of speaker A's voice. The time scale is indicated below thewaveforms. The amplitude scale is arbitrary but all waveforms in the set (of three)shown in this figure are normalised to have the same peak amplitude. The sameconventions of time and amplitude scaling are adopted in subsequent figuresNote that e( 10) = 8 x 10"5

original g(t) tenth estimate of g(t)sjt)

differences between the dashed and dotted curves in Fig.1 characterise the part of the vocal tract response ofSpeaker A which persists throughout the utterance.

6 Experimental results

In this section we illustrate the performance of our shift-and-add approach to glottal excitation estimation by pre-senting results for: (i) 6 different speakers who recorded asingle utterance (see Table 1 and Figs. 2 and 3), (ii) asingle speaker who recorded a single utterance severaltimes (Fig. 4), (iii) a single speaker who recorded severaldifferent utterances (Fig. 5), and (iv) a speaker whorecorded a single utterance in several different ways (Fig.6). These results typify our experience.

Table 1 : Speaker characteristics

In each case, the iterative scheme was continued for 25iterations. We only present waveforms for the first andthe fifth iterations, since we found that after this stage thedifference between ajit) and sM(t) could not be seen onthe scale of any of the figures.

Fig. 2 relates to the 6 speakers listed in Table 1. Thereare two particularly noteworthy features of this figure.

Fig. 2 Shift-and-add and average glottal excitation

Shift-and-add (on left) and average glottal excitation (on right) for the 6 speakerslisted, and the utterance quoted in Table 1. The length of the horizontal axis ofeach set of curves indicates the average pitch period for the speaker identified bythe upper-case letter on the right (which correspond to the letters in Table 1). Thepeak of each curve is positioned above the centre of its respective horizontal axis.Note that the same conventions are adopted for the horizontal axes in Figs. 4, 5and 6On left: On right:

5th version of g\t)sjt)ajj), 1st iterationajt), 5th iteration

g(t), 1st iteration

bpeax

ABCDEF

er sex

mmmfff

averagepitch, Hz

97110

91222189209

M

291303274457395481

0.71.01.34.72.11.4

Characteristics of six speakers (identified by the upper-case letters inthe left hand column of Table 1) who recorded the utterance, 'Whensunlight strikes raindrops in the air, they act like a prism, and form arainbow'. The notation e(/V), which is invoked in the right-handcolumn, implies the error measure at the /Vth iteration. The integerM, heading the second column from the right, is defined in step (vi)of Section 5.3.

The first is that ajt) at the 5th iteration is very close tosjj) for all 6 speakers, implying that the 5th version ofg(t) (the right hand solid curve in each case) must be closeto the true g(t), as is supported by the small values of e(5)listed in Table 1. The second feature is that the forms of5^(0, and the 5th versions of g(t), differ noticeably for all6 speakers. Another point worth remarking is that, whilethe forms of sM(t) for speakers E and F are somewhatsimilar, the corresponding forms of the fifth version ofg(t) are appreciably different.

Fig. 3 shows how the error measure varies with N, thenumber of iterations, for each of the 6 speakers whorecorded the utterance quoted in the caption to Table 1.Note that the general decrease of e(N) with N is main-tained until N exceeds 16 iterations for 5 of the speakers.For only a single speaker does e(N) start to diverge as


soon as N exceeds 10. For all 6 speakers, e(5) is as smallas we think would be required in any of the applicationswe can envisage (including those discussed in Section 2).

Fig. 4 illustrates the degree of repeatability of ourresults. Speaker B recorded a particular utterance 10times, trying to speak identically each time. The spreadsin ssa(t) and the fifth versions of g(t) are indicated by plot-ting their averages and the curves corresponding to one

E-3 -

-7

Fig. 3 Plot of E = logl0 e(n) against the number N of iterations forthe 6 speakers listed in Table 1 {where e(N) is defined in the notes belowTable 1)

A; B; C; D; E; F

10msFig. 5 Results for 3 different utterances recorded by Speaker A

The sets of curves have the same meaning as the corresponding curves in Fig. 2.Utterances:a 'When sunlight strikes raindrops in the air, they act like a prism, and form arainbow'b 'The time has come, the walrus said, to speak of many things; of shoes and shipsand sealing wax, of cabbages and kings'c 'Stitching London together from one bank of the Thames to the other are 32bridges; 20 for road traffic, 10 for rail, and two for pedestrians only. Without thesebridges, Britain's capital could not function'

-10 ms-

Fig. 4 Variation about mean of {a) sj^t) and (b) g(t)for 10 repetitionsby Speaker B of the utterance quoted in the caption to Table 1

The peak of each waveform was normalised to unit amplitude before the averagesand spreads were computed

average average minus one standard deviationaverage plus one standard deviation

Fig. 6 Results for four different ways of speaking, by speaker B, theutterance quoted in the caption to Table 1

The sets of curves correspond to those in Fig. 5The 3 ways in which the utterance was spoken were:a Normal pitch, monotone (average pitch = 105 Hz)b High pitch, monotone (average pitch =167 Hz)c Tense throat muscles (average pitch = 124 Hz)

812 IEE PROCEEDINGS, Vol. 134, Pt. A, No. 10, DECEMBER 1987

standard deviation above and below the averages. Com-parison of Figs. 2 and 4 implies that there are likely to begreater differences between different speakers at the sametime than between the same speaker at different times.

Since g{t) is defined to be the average glottal excitationover a particular utterance, its form can be expected tovary with the utterance. Fig. 5 shows results for 3 differ-ent utterances recorded (during a single recordingsession) by Speaker A. The differences between corre-sponding curves in Fig. 5 seem again to be significantlyless than between those for different speakers (Fig. 2).

The apparent greater variability between differentspeakers than between a single speaker under differentconditions, suggested by Figs. 4 and 5, is confirmed byour experience of other speakers during recording sess-ions spread over several months. However, the relativeinvariance of sJJ) and g{t) for a single speaker does notextend to situations in which the 'state' of the speaker's'voice' changes, as is illustrated by Fig. 6. It is apparentthat one's glottal excitation tends to alter noticeablywhenever one tries to speak in any sort of 'funny voice'.This means that utterances must be spoken in a 'normalvoice' if ssa(t) and g{t) are to be used as parameters in aspeaker identification system.

7 Conclusions

We have presented a straightforward, numerically robustand self-consistent (in the sense that estimates can beobjectively refined) approach to estimating the averageglottal excitation in a record of voiced speech. The resultssummarised in Table 1 and Fig. 2 emphasise that numeri-cal convergence is achievable for a variety of speakers. Itshould be noted that the refinement technique describedin Section 5.5 can separate the effects of changes in thevocal tract response from those due to alterations in theglottal excitation.

While neither sjt) nor g(t) can be classed as a unique'voice print', the differences between speakers and thecomparative constancy of individual speakers, illustratedby Figs. 2 to 6 (which typify the other results we have sofar obtained) are sufficiently marked to suggest that shift-and-add should prove useful in applications of the kindoutlined in Section 2.

We are presently embarking on comprehensive sta-tistical trials to assess the practical significance of ourapproach in clinical and speaker-recognition applica-tions. An important aspect of these trials will be theeffects on an individual speakers' ssa(t) and g(t) when thatspeaker attempts to record utterances when sufferingfrom a heavy cold or other malady.

Another of our current studies is of computationallyand numerically stable means of performing the decon-volution operation discussed in item (c) of Section 2. TheLPC parameters so derived could be incorporated intoour iterative procedure (at the expense of much increasedprocessing time) in the hope of extracting improvedsource-filter parameters for the speech.

8 Acknowledgments

We are grateful for support for our research in the formof a grant from the New Zealand International Year ofthe Disabled Telethon Trust. We thank our colleaguesAndrew Elder, Richard Fright, Kathryn Garden and BillKennedy for many helpful comments and informativediscussions. One of us (C.W.T.) acknowledges the support

of a New Zealand University Grants Committee Post-graduate Research Scholarship.

9 References1 LUDLOW, C, and O'CONNELL HART, M. (Eds.): 'Proceedings

of the conference on the assessment of vocal pathology', ASH AReports No. 11 (Amer. Speech-Language-Hearing Assoc, Rockville,MD 20852,1981)

2 STEVENS, K., and HIRANO, M., (Eds.): 'Vocal fold physiology'(Tokyo University Press, Tokyo, 1983)

3 KRISHNAMURTHY, A.K., and CHILDERS, D.G.: 'Two-channelspeech analysis', IEEE Trans., 1986, ASSP-34, pp. 730-743

4 VEENEMAN, D.E., and BeMENT, S.L.: 'Automatic glottal inversefiltering from speech and electroglottographic signals', ibid., 1985,ASSP-33, pp. 369-377

5 MILENKOVIC, P.: 'Glottal inverse filtering by joint estimation ofan AR system with a linear input model', ibid., 1986, ASSP-34, pp.28-42

6 BATES, R.H.T., and CADY, F.M.: 'Towards true imaging by wide-band speckle interferometry', Opt. Commun., 1980, 32, pp. 365-369

7 BATES, R.H.T.: 'Astronomical speckle imaging', Phys. Rep., 1982,90, pp.203-297

8 BAGNUOLO, W.G., Jr., and McALISTER, H.A.: 'The true nodalquadrant of Capella', Publ. Astron. Soc. Pac, 1983, 95, pp. 992-995

9 CHRISTOU, J.C., HEGE, E.K., FREEMAN, J.D., and RIBAK, E.:'Images from astronomical speckle data: Weighted shift-and-addanalysis', in ARSENAULT, H.H. (Ed.), 'International conference onspeckle', Proc. SPIE, 1985, 556, pp. 255-262

10 BATES, R.H.T., and ROBINSON, B.S.: 'Ultrasonic transmissionspeckle imaging', Ultrason. Imaging, 1981, 3, pp. 378-394

11 MINARD, R.A., ROBINSON, B.S., and BATES, R.H.T.: 'Full-wave computed tomography part 3: Coherent shift-and-addimaging', 1EE Proc. A, 1985,132, (1), pp. 50-58

12 BONES, P.J., IKRAM, I., MASLOWSKI, A.H., and BATES,R.H.T.: 'Signals from the ventricular specialised conduction systemof the heart', Australa. Phys. & Eng. Sci. Med., 1982, 5, pp. 151-157

13 FRIEDMAN, D.H.: 'Pseudo-maximum-likelihood speech pitchextraction', IEEE Trans., 1977, ASSP-25, pp. 213-221

14 BATES, R.H.T., and McDONNELL, M.J.: 'Image restoration andreconstruction' (Clarendon Press, Oxford, 1986)

15 FOURCIN, A.: 'Laryngographic assessment of phonatory function',ASH A Reports, 1981,11, pp. 116-127

16 DODINGTON, G.R.: 'Speaker recognition—identifying people bytheir voices', Proc. IEEE, 1985, 73, pp. 1657-1664

17 ROSENBERG, A.E.: 'Effect of glottal pulse shape on the quality ofnatural vowels', J. Acoust. Soc. Amer., 1970, 49, pp. 583-590

18 HOLMES, J.N.: 'The influence of glottal waveform on the natural-ness of speech from a parallel formant synthesizer', IEEE Trans.,1973, AU-21, pp. 298-305

19 STEIGLITZ, K., and DICKINSON, B.: 'The use of time-domainselection for improved linear prediction', ibid., 1977, ASSP-25, pp.34-39

20 WONG, D.Y., MARKEL, J.D., and AUGUSTINE, H.G.: 'Leastsquares glottal inverse filtering from the acoustic speech waveform',ibid., 1979, ASSP-27, pp. 350-355

21 WITTEN, I.H.: 'Principles of computer speech', (Academic Press,London, 1986)

22 MARKEL, J.D., and GRAY, A.H.: 'Linear prediction of speech'(Springer-Verlag, Berlin, 1976)

23 MILLER, R.L.: 'Nature of the vocal cord wave', J. Acoust. Soc. Am.,1959,31, pp. 667-677

24 KNORR, S.G.: 'Reliable voiced/unvoiced decision', IEEE Trans.,1979, ASSP-27, pp. 263-267

25 BRIESMAN, N.P.: 'A new algorithm for musical pitch estimation'.ME thesis, University of Canterbury, NZ, 1984

26 TUCKER, W.H., and BATES, R.H.T.: 'A pitch estimation algo-rithm for speech and music', IEEE Trans., 1978, ASSP-26, pp.597-604

27 GOLD, B. and RABINER, L.R.: 'Parallel processing techniques forestimating pitch periods of speech in the time domain', J. Acoust.Soc. Am., 1968,46, pp. 442-448

28 FEIJOO, J., and HERNANDEZ, C : 'Automatic determination oftone period and evaluation of dysphony in pathological voices', IEEProc. A, 1986,133, pp. 99-103

29 RABINER, L.R., and SCHAFER, R.W.: 'Digital processing ofspeech signals' (Prentice-Hall, NJ, 1978)

30 MORSE, P.M., and FESHBACH, H.: 'Methods of theoreticalphysics' (McGraw-Hill, NY, 1953), Chapter 4.6


Documents

Nontactile estimation of glottal excitation characteristics of voiced speech