Speech Signal Processing - Phil Garner · Speech Signal Processing Milos Cernak Introduction Speech synthesis signal processing Analysis Speech parameter generation Re-synthesis Synthesis

SpeechSignalProcessing

MilosCernak

Introduction

Speechsynthesissignalprocessing

Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders

Speechqualityevaluation

Subjective

listening tests

Objective

Introduction

The VODER “Voice Operation DEmonstratoR” ofHomer Dudley, demonstrated at Bell Laboratoryexhibit at the 1939 New York World’s Fair, wascontrolled using a keyboard and foot pedals.

We can say that these peripherals enabled to controlparameters of the a vocoder behind the VODER. Andoperator of the VODER was a “model” that generatedthe control sequence.

In the case of the VODER the “model” to synthesizethe speech parameters was a human. Current vocodersincorporate the modelling of the parameters. Todistinguish them from historical vocoders, we aregoing to call them hereinafter synthesis vocoders.

1/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Analysis - MGC features

Mel-generalised cepstral (MGC) features cα,γ(m) aretypically used in speech vocoding.

H(z) = s−1γ

(∑Mm=0 cα,γ(m)z−m

)=

(

1 + γ∑M

m=1 cα,γ(m)z−m)1/γ

, −1 ≤ γ < 0

exp∑M

m=1 cα,γ(m)z−m, γ = 0

(1)where M is an analysis order.

2/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Relation of α and γ

The variable z−1 can be expressed as the first orderall-pass function

z−1 =z−1 − α1− αz−1

(2)

where α is a warping factor.

For 16kHz, α = 0.42 gives good approximation to themel scale. The parameter γ control the representationaccuracy of poles and zeros.

As the value of γ approaches zero, the accuracy forspectral zeros increases at the expense of formantaccuracy.

3/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Relation of MGC to other analysis methods.

Figure: Relation of MGC to other analysis methods.

For more details and explanation, please see Phil’s rootcepstrum notes.

4/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Speech parameter generation

As already mentioned, vocoders enable to model theirparameters. The models are typically Hidden MarkovModels (HMMs).

Then, an additional algorithm need to be used, tocalculate the speech parameters (static cepstra) fromcontinuous mixture HMMs with dynamic features.

An iterative MLPG algorithm does it. We will notexplain it as it is a staff for sequential speechprocessing systems.

5/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Re-synthesis

In mel-generalised speech ceptrum H(z) is modelledby as set cepstrum coefficients cα,γ(m).

For re-synthesis, the parameter γ is fixed to be −1/2.This value balances good representation of bothspectral poles and zeros.

Then, the synthesis filter is realised as a rationaltransfer function

H(z) =1

{B(z)}2(3)

where

B(z) = 1 + γ

M∑m=0

cα,γ(m)z−m. (4)

6/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Removing delay-free loops

To remove delay-free loops from B(z), it synthesisfilter is re-designed to

B(z) = 1 + γ

M∑m=0

b′γ(m)Φm(z). (5)

where

Φm(z) =(1− α2)z−1

1− αz−1z−(m−1),m ≥ 1. (6)

and the filter coefficients b′γ(m) are obtained using arecursive formula

b′γ(m) =

{cα,γ(M), m = Mcα,γ(m)− αb′γ(m+ 1), 0 ≤ m < M

(7)

7/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

A structure of MGLSA filter

Figure: A structure of MGLSA filter 1B(z) . (If you’re not familiar

with this kind of diagram, the triangles are scalers/attenuatorsi.e. multiply-by-constant, the plusses are adders, and the z−1

boxes are 1-cycle delays.)

This synthesis filter is known in literature asMel-Generalised Log Spectral Approximation (MGLSA)filter.

8/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Mel-generalized cepstral vocoder (MGC)

The MGC vocoder is based on analysis/re-synthesisframework introduced in the previous section. The maincharacteristics are:

Uses a mixture of pulse train and white Gaussian noisefor excitation source modelling.

Pulse/noise model is straightforward.

Produces characteristic “buzzy” sounds due to strongharmonics at higher frequencies.

Typical parameters are α = 0.42 and γ = −1/3.

9/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

STRAIGHT-MGC - 1999

Figure: Hideki Kawahara, Professor, Wakayama University

http://www.wakayama-u.ac.jp/~kawahara/

STRAIGHTadv/index_e.html

10/31

http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.html

http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.html


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

STRAIGHT: Speech Transformation andRepresentation based on Adaptive Interpolationof weiGHTed spectrogram

Figure: A block diagram of STRAIGHT vocoder

Extract fundamental frequency F0

F0-adaptive spectral analysis. The aperiodicitymeasure is defined as the lower envelope (spectralvalleys) normalized by the upper envelope (spectralpeaks).

11/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

STRAIGHT: synthesis

Figure: STRAIGHT synthesis

Aperiodicity is used to weight the harmonic and noisecomponents of the excitation; removes the periodicityeffects of fundamental frequency on extracting thevocal tract spectral shape.

12/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Glottal vocoder

Uses a library of glottal pulses instead of pulse trainfor voiced signals.

The glottal excitation is synthesized throughinterpolating and concatenating natural glottal flowpulses.

The excitation signal is further modified to reproducethe time-varying changes in the natural voice source.

Analysis of excitation using the Iterative AdaptiveInverse Filtering.

Energy and harmonic-to-noise ratio for weighting thenoise component.

Available at http://www.helsinki.fi/speechsciences/synthesis/glott.html.

13/31

http://www.helsinki.fi/speechsciences/synthesis/glott.html

http://www.helsinki.fi/speechsciences/synthesis/glott.html


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Deterministic plus Stochastic vocoder – 2012

Uses MGC analysis/re-synthesis

Differs in the excitation modelling:

1 Uses GCI-synchronous LP residuals extraction.2 Deterministic component at the low frequencies is

decomposed using PCA to obtain first eigen residual3 Stochastic component is made of energy envelope and

an autoregressive model.

14/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Harmonics plus Noise Model based vocoder -2013

The previous described vocoders were based onsource-filter decomposition and modelling.

An completely different approach is usingsinusoidal/waveform decomposition.

The harmonic plus noise (HNM) models assumes thespeech spectrum to be composed of two frequencybands: harmonic and noise. The bands are separatedby maximum voiced frequency (MVF).

15/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

HNM harmonic band analysis 1

The harmonic part, the lower band, is modelled as asum of harmonics

sh(t) =

L(t)∑k=−L(t)

Ak(t) exp(jkω0(t)t) (8)

where L(t) denotes the number of harmonics thatdepends on the fundamental frequency w0(t) and onthe MVF Fm(t).

16/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

HNM harmonic band analysis 2

The complex magnitudes Ak(t) can take on one of thefollowing forms:

Ak(t) = ak(ti)Ak(t) = ak(ti) + tbk(ti)Ak(t) = ak(ti) + tck(ti) + t2dk(ti)

(9)

where ak(ti), bk(ti), ck(ti) and dk(ti) are complexnumbers with constant phases, measured at analysistime instants ti.

A simple stationary harmonic model using the firstlydefined Ak(t) is referred as HNM1 is capable togenerate speech perceptually indistinguishable fromoriginal speech.

17/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

HNM noise band analysis

The modulated noise, the upper band. Mostimportant is specification of noise bursts (whereenergy is localised). Therefore the noise part sn(t) isdescribed as time-varying autoregressive model h(τ, t)modulated by a parametric envelope e(t):

sn(t) = e(t)[h(τ, t) ∗ b(t)] (10)

where b(t) is white Gaussian noise.

Finally, the synthetic speech s(t) is

s(t) = sh(t) + sn(t) (11)

18/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

HNM based vocoder

The HNM based vocoder thus:

Decomposes the speech frames into a harmonic partand the stochastic part using

1 MGC2 F03 MVF

Voiced frames – full spectral envelope may be obtainedby interpolating amplitudes at harmonics.

Unvoiced frames – analysed with fast Fouriertransform.

Available at aholab.ehu.es/ahocoder/index.html

19/31

aholab.ehu.es/ahocoder/index.html


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Speech quality evaluation

In the context of the last lectures about the parametricspeech, i.e., the speech analysis/re-synthesis methods,one may be interested in evaluation of speech qualitydegradation that the methods introduce.

We distinguish:

1 Subjective evaluation: by asking people aboutevaluated stimuli. It is costly and time consuming.

2 Objective evaluation: by using computers for that. Itis cheaper, faster, but the quality depends on the test.

20/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Classification of speech quality evaluationmethods

1 Conversational quality: the quality aspects of theconversation – it is a rare test.

2 Talking quality: echo, delay and sidetone distortion.

3 Listening quality: to measure typically single qualitydimension such as:

intelligibilitynaturalnesslistening effort

21/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Subjective listening tests

The subjective listening tests differ mainly if thereference signal is used.

1 Non reference based tests follow absolute category(ACR) rating procedures.

2 Otherwise reference based tests are called degradationcategory rating (DCR) tests.

Both following MOS and DMOS tests are standardisedby ITU-T.

22/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Mean Option Score (MOS)

In an ACR test a group of listeners rate the listeningquality of the stimuli (speech examples). The quality israted in the 5-level impairment scale:

1 Bad,

2 Poor,

3 Fair,

4 Good,

5 Excellent.

and the average of all scores is represents the speechquality metric called mean opinion score (MOS).

23/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Degradation Mean Option Score (DMOS)

Sometimes the resolution of the MOS is not sufficient.It can be increased by reference based DCR test.

Here the listeners first listen original (source) speechsignal and rate the degradation of speech quality of theprocessed (modified) speech signal. The degradation isagain rated in the 5-level impairment scale:

1 very annoying,2 annoying,3 slightly annoying,4 audible but not annoying,5 inaudible.

The average of all scores is represents the speechquality metric called degradation mean opinion score(DMOS).

24/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

ABX

If one can test listener’ reliability as well, there is socalled ABX test.

The listeners are provided with three speech examples– A, B, and X, asking which of A/B is identical to X.As the signal X is known reference, the ABX test alsobelongs to the DCR procedures.

The ABX test is suitable for rating small degradationusing a continuous impairment scale, and expert(trained) listeners should be used.

25/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Objective

1 Similarly as in subjective listening tests, referencebased tests are called intrusive.

2 Non reference based are called non-intrusive.

26/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Spectral distortion

Widely accepted objective measure is a frequency domainmeasure – gain-normalised spectral distortion (SD). TheSD measure evaluates autoregressive spectra Pxy(n, k)

PRxy(n, k) =< Rxy(k), exp−j2πnk/N > (12)

as per frame k

dkSD(s, t) =1

N

N−1∑n=0

[10 log10

(P sxy(n, k)

P txy(n, k)

)]2(13)

for the source signal s and target signal t. The finalmeasure, the global distortion is the root-mean SD:

dSD(s, t) =1

K

√√√√K−1∑k=0

dkSD(s, t) (14)

where K is the total number of frames.27/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Psycho-acoustically motivated measures

Many of the intrusive objective measures arepsycho-acoustically motivated measures. The idea here isto mimic human speech listening, and so the methodsimplement two basic modules:

1 Auditory processing – it employs an perceptualtransform using bark-scale frequency warping andsubjective loudness conversion. The output is theauditory (nerve) excitation.

2 Cognitive mapping – it extract key information relatedto anomalies in the speech signal from the auditoryexcitation. This area is still not well understood.

28/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Perceptual Evaluation of Speech Quality (PESQ)

Figure: Mimicking human quality assessment.

Widely used Perceptual Evaluation of Speech Quality(PESQ) computes internal representations based onauditory periphery of both reference/source signal sand distorted/target signal y

Internal representations are compared to predictspeech quality degradation Q.

It mimics human brain that probably compares thesetwo entities during speech quality evaluation as well.

29/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

Perceptual Objective Listening QualityAssessment (POLQA)

A recent update of PESQ measure is Perceptual ObjectiveListening Quality Assessment (POLQA):

PESQ measures one-way distortion and the effectsrelated to two-way communication such as delay, echoare not reflected in the scores. POLQA handles thesignal with variable delays.

PESQ was design for narrow-band signal (3.4 kHz)and even there is an wide-band (7 kHz) extension,POLQA should perform better for wide-band signals

30/31


MilosCernak

Introduction


Analysis

Speechparameter

generation

Re-synthesis

Synthesisvocoders


Subjective

listening tests

Objective

POLQA enhancements

POLQA in addition predicts “idealised” referencesignal, modelling listeners expectations of an idealsignal.

The reference signal with low amount of recordingnoise and an identical degraded signal will not bescored with the maximum score.

When the uncertainty of the subjective scores is takeninto account, a statistical metric calledepsilon-insensitive rmse (rmse*) can be used (ITU-TP.1401 (07/2012)).

Last but not least: PESQ is free while a binary ofPOLQA costs 3500 CHF.

31/31

Documents

Speech Signal Processing - Phil Garner · Speech Signal Processing Milos Cernak Introduction Speech synthesis signal processing Analysis Speech parameter generation Re-synthesis Synthesis