A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding (CELP) By: Kendall Khodra Instructor: Dr. Kepuska

A Comparison Of Speech A Comparison Of Speech Coding With Linear Coding With Linear

Predictive Coding (LPC) Predictive Coding (LPC) And Code-Excited Linear And Code-Excited Linear Predictor CodingPredictor Coding (CELP) (CELP)

By: By: Kendall KhodraKendall Khodra

Instructor: Instructor: Dr. KepuskaDr. Kepuska

IntroductionIntroduction

This project will develop Linear Predictive Coding (LPC) This project will develop Linear Predictive Coding (LPC)

to process a speech signal. The objective is to mitigate to process a speech signal. The objective is to mitigate

the lack of quality of the simple LPC model by using the lack of quality of the simple LPC model by using

a more complex description of the excitation, Code a more complex description of the excitation, Code

Excited Linear Prediction (CELP) to process the output Excited Linear Prediction (CELP) to process the output

of simple LPC.of simple LPC.

BackgroundBackground

Linear Predictive Coding (LPC) methods are the most Linear Predictive Coding (LPC) methods are the most widely used in speech coding, speech synthesis, widely used in speech coding, speech synthesis, speech recognition, speaker recognition and speech recognition, speaker recognition and verification and for speech storage.verification and for speech storage.

LPC has been considered one of the most powerful LPC has been considered one of the most powerful techniques for speech analysis. In fact, this techniques for speech analysis. In fact, this technique is the basis of other more recent and technique is the basis of other more recent and sophisticated algorithms that are used for estimating sophisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech. vocal tract and low bit representations of speech.

The basic principle of linear prediction, states that The basic principle of linear prediction, states that speech can be modeled as the output of a linear, time-speech can be modeled as the output of a linear, time-varying system excited by either periodic pulses or varying system excited by either periodic pulses or random noise. These two kinds of acoustic sources are random noise. These two kinds of acoustic sources are called voiced and unvoiced respectively. In this sense, called voiced and unvoiced respectively. In this sense, voiced emissions are those generated by the vibration voiced emissions are those generated by the vibration of the vocal cords in the presence of airflow and of the vocal cords in the presence of airflow and unvoiced sounds are those generated when the vocal unvoiced sounds are those generated when the vocal cords are relaxed. cords are relaxed.

A. Physical Model:A. Physical Model:

When you speak: •Air is pushed from your lungs through your vocal tract and out of your mouth comes speech.

• For certain voiced sound, your vocal cords (folds) vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration).

• For certain fricative and plosive (or unvoiced) sounds your vocal cords do not vibrate but remain constantly opened.

• The shape of your vocal tract, which changes as we speak, determines the sound that you make.

• The amount of air coming from your lung determines the loudness of your voice.

B. Mathematical ModelB. Mathematical Model

Block diagram of simplified mathematical model for speech production

· The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence.

The relationship between the physical and the

mathematical models:

Vocal Tract H(z) (LPC Filter)

Air u(n) (innovation)

Vocal Cord Vibration V(Voiced)

Vocal Cord Vibration Period T (Pitch period)

Fricatives and Plosives UV (Unvoiced)

)(1)(

ZP

AzH

p

k

kkZazP

1

)(Vocal tract system = function

where

The The LPC ModelLPC ModelThe LPC method considers a speech sample s(n) at time n, and approximates it as a linear combination of the past samples in the way:

(1)

Where G is the gain and u(n) the normalized excitation.

The predictor coefficients (the k’s) are determined (computed) by minimizing the sum of squared differences (over a finite interval) between actual speech samples and the linearly predicted ones( we will see later).

Block diagram of an LPC

In the LPC model the residual (excitation) is approximated In the LPC model the residual (excitation) is approximated

during during voicingvoicing by a quasi-periodic impulse train and during by a quasi-periodic impulse train and during

unvoicingunvoicing by a white noise sequence. This approximation is by a white noise sequence. This approximation is

denoted by . We then pass through the filter 1/A(z)denoted by . We then pass through the filter 1/A(z)

LPC consists of the following steps LPC consists of the following steps

• Pre-emphasis Filtering Pre-emphasis Filtering • Data Windowing Data Windowing • Autocorrelation Parameter Estimation Autocorrelation Parameter Estimation • Pitch Period and Gain Estimation Pitch Period and Gain Estimation • Quantization Quantization • Decoding and Frame Interpolation Decoding and Frame Interpolation

Pre-emphasis FilteringPre-emphasis Filtering

• When we speak, the speech signal experiences some spectral When we speak, the speech signal experiences some spectral roll off due to the radiation effects of the sound from the mouth roll off due to the radiation effects of the sound from the mouth

• As a result, the majority of the spectral energy is concentrated As a result, the majority of the spectral energy is concentrated in the lower frequencies.in the lower frequencies.

• To have our model give equal weight to both low and high To have our model give equal weight to both low and high frequencies, we need to apply a high-pass filter to the original frequencies, we need to apply a high-pass filter to the original signal. signal.

• This is done with a one zero filter, called the pre-emphasis filter. This is done with a one zero filter, called the pre-emphasis filter. The filter has the form: The filter has the form:

y[n] = 1 - a x[n] y[n] = 1 - a x[n]

Most standards use Most standards use aa = 15/16 = .9375 ( our default) = 15/16 = .9375 ( our default)

When we decode the speech, the last thing we do to eachWhen we decode the speech, the last thing we do to eachframe is to pass it through a de-emphasis filter to undo this frame is to pass it through a de-emphasis filter to undo this effect.effect.

Matlab: Matlab: speech = filter([1 -preemp], 1, data)'; % Preemphasize speechspeech = filter([1 -preemp], 1, data)'; % Preemphasize speech

Data WindowingData Windowing Because speech signals vary with time, this process is done on short Because speech signals vary with time, this process is done on short

chunks of the speech signal, which we call chunks of the speech signal, which we call framesframes. Usually 30 to 50 ms . Usually 30 to 50 ms

frames give intelligible speech with good compression.frames give intelligible speech with good compression.

• For implementation in this project we will use overlapping For implementation in this project we will use overlapping

data framesdata frames to avoid discontinuities in the model. We used to avoid discontinuities in the model. We used

a frame width of 30 ms and overlap of 10 ms.a frame width of 30 ms and overlap of 10 ms.• A hamming window was used to extract frames as shown belowA hamming window was used to extract frames as shown below

Determining Pitch PeriodDetermining Pitch Period

For each frame, we must determine if the speech is voiced or unvoiced. For each frame, we must determine if the speech is voiced or unvoiced.

We do this by searching for periodicities in the residual (prediction error) We do this by searching for periodicities in the residual (prediction error)

signal. signal.

To determine if the frame is voiced or unvoiced, we apply a threshold to To determine if the frame is voiced or unvoiced, we apply a threshold to

the autocorrelation. Typically, this threshold is set at Rthe autocorrelation. Typically, this threshold is set at Rxx(0) * 0.3.(0) * 0.3.

• If no values of the autocorrelation sequence exceed this threshold, then we If no values of the autocorrelation sequence exceed this threshold, then we declare the frame unvoiced. declare the frame unvoiced.

• If we have periodicities in the data , there should be spikes which exceed If we have periodicities in the data , there should be spikes which exceed the threshold; in this case we declare the frame voiced.the threshold; in this case we declare the frame voiced.

The distance between spikes in the autocorrelation function is The distance between spikes in the autocorrelation function is

equivalent to the pitch period of the original signal. equivalent to the pitch period of the original signal.

LPC analyzes the speech signal by:LPC analyzes the speech signal by:

• Estimating the formants Estimating the formants • Removing their effects from the speech signalRemoving their effects from the speech signal• Estimating the intensity and frequency of the remaining Estimating the intensity and frequency of the remaining

signal. signal.

The process of removing the formants is called The process of removing the formants is called inverse filteringinverse filtering, and the remaining signal is called , and the remaining signal is called the the residue.residue.

LPC synthesizes the speech signal by reversing the LPC synthesizes the speech signal by reversing the process: process:

• Use the residue to create a source signalUse the residue to create a source signal• Use the formants to create a filter (which represents the Use the formants to create a filter (which represents the

tube/tract)tube/tract)• Run the source through the filter, resulting in speech. Run the source through the filter, resulting in speech.

Estimating the FormantsEstimating the Formants

The coefficients of the difference equation (the The coefficients of the difference equation (the

pprediction coefficientsrediction coefficients) characterize the formants.) characterize the formants.

The LPC system needs to estimate these coefficients The LPC system needs to estimate these coefficients

which is done by minimizing the mean-square which is done by minimizing the mean-square

error between the predicted signal and the actual error between the predicted signal and the actual

signal. signal.

CELP CELP (Code Excited Linear Predictor) (Code Excited Linear Predictor)

A CELP coder does the same LPC modeling but then A CELP coder does the same LPC modeling but then

computes the errors between the original speech & the computes the errors between the original speech & the

synthetic model and transmits both model parameters synthetic model and transmits both model parameters

and a very compressed representation of the errors and a very compressed representation of the errors

(the compressed representation is an index into a 'code (the compressed representation is an index into a 'code

book' shared between coders & decoders -- this is why book' shared between coders & decoders -- this is why

it's called "Code Excited"). A CELP coder does much it's called "Code Excited"). A CELP coder does much

more work than an LPC coder (usually about an order more work than an LPC coder (usually about an order

of magnitude more) but the result is much higher of magnitude more) but the result is much higher

quality speech: quality speech:

Block diagram of the CELP

The perceptual weighting filter is defined as:The perceptual weighting filter is defined as:

0<r<10<r<1

This filter is used to de-emphasize the frequency regions that This filter is used to de-emphasize the frequency regions that correspond to the formants as determined by LPC analysis. correspond to the formants as determined by LPC analysis.

The noise, located in formant regions, that is more perceptibly The noise, located in formant regions, that is more perceptibly disturbing can be reduced.disturbing can be reduced.

The de-emphasis is controlled by factor r.The de-emphasis is controlled by factor r.

After determining the formant synthesis filter 1/A(z), the pitch After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length is usually equal to or shorter than the pitch subframe length. length.

The autocorrelation method assumes that the signal is The autocorrelation method assumes that the signal is identically identically

zero outside the analysis interval (0<=m<=N-1). Then it tries to zero outside the analysis interval (0<=m<=N-1). Then it tries to minimize the prediction error wherever it is nonzero, that is in minimize the prediction error wherever it is nonzero, that is in the interval 0<=m<=N-1+p, where p is the order of the model the interval 0<=m<=N-1+p, where p is the order of the model used. The error is likely to be large at the beginning and at the used. The error is likely to be large at the beginning and at the end of this interval. This is the reason why the speech segment end of this interval. This is the reason why the speech segment analyzed is usually tapered by the application of a Hamming analyzed is usually tapered by the application of a Hamming Window.Window.

Autocorrelation Parameter EstimationAutocorrelation Parameter Estimation

Given

Our goal is to find the predictor coefficients ai which minimizes k the square of the prediction error in a short segment of speech. The mean short time prediction error per frame is defined as:

To minimize this we take the derivative and set it to zero. This results in the equation:

Finding the Parameters

Letting , we haveLetting , we have

This equation is solved using the Levinson-Durbin algorithmThis equation is solved using the Levinson-Durbin algorithm

This algorithm is one used to assist in finding the filter This algorithm is one used to assist in finding the filter

coefficients acoefficients aii from the system Ra=r. What the Levinson-Durbin from the system Ra=r. What the Levinson-Durbin

algorithm does here is making the solution to the problem O(nalgorithm does here is making the solution to the problem O(n2)2)

instead of O(ninstead of O(n3)3) by exploiting the fact that matrix R is toeplitz by exploiting the fact that matrix R is toeplitz

hermitian. hermitian.

Matlab

% Levinson's method err(1) = autoCorVec(1);k(1) = 0;A = [];for index=1:Lnumerator = [1 A.']*autoCorVec(index+1:-1:2);denominator = -1*err(index);k(index) = numerator/denominator; % PARCOR coeffsA = [A+k(index)*flipud(A); k(index)]; err(index+1) = (1-k(index)^2)*err(index);end

aCoeff(:,nframe) = [1; A];parcor(:,nframe) = k';

Helpful matlab tools usedHelpful matlab tools used

synFrame = filter(1, A', residFrame)synFrame = filter(1, A', residFrame)

This filters the data in vector residframe with the filter described by vector AThis filters the data in vector residframe with the filter described by vector A

resid2 = dct(resid);resid2 = dct(resid);This returns the discrete cosine transform of resid as discrete cosine transform This returns the discrete cosine transform of resid as discrete cosine transform coefficients. Only the first 50 coefficients are kept since most of the energy is coefficients. Only the first 50 coefficients are kept since most of the energy is stored therestored there

resid3 = uencode(resid2,4);resid3 = uencode(resid2,4);

This function uniformly quantizes and encodes the data in the vector resid2 into This function uniformly quantizes and encodes the data in the vector resid2 into N-bitsN-bits..

newsignal = udecode(resid,4);newsignal = udecode(resid,4);

This does the opposite of uencode of residThis does the opposite of uencode of resid

ResultsResults

It can be seen from the waveforms that the CELP It can be seen from the waveforms that the CELP method looks much more like and hence is a better method looks much more like and hence is a better method for speech coding. This is emphasized from method for speech coding. This is emphasized from the log-magnitude spectrum. the log-magnitude spectrum.

The synthesized voice of linear prediction waveform The synthesized voice of linear prediction waveform is peaky and sounds buzzy since it is based on the is peaky and sounds buzzy since it is based on the autocorrelation method that has loss of absolute autocorrelation method that has loss of absolute phase structure because of its minimum phase phase structure because of its minimum phase

characteristics.characteristics.

ResultsResults

Male VoiceMale Voice

Original SignalOriginal SignalLPC SignalLPC SignalCELP SignalCELP Signal

Female VoiceFemale Voice

Original SignalOriginal SignalLPC SignalLPC SignalCELP SignalCELP Signal

4 bits encoding 8 bits encoding

4 bits encoding 8 bits encoding

DrawbacksDrawbacks

The LPC method has inherent errors (quantization) and The LPC method has inherent errors (quantization) and

in most cases doesn’t give accurate solution. in most cases doesn’t give accurate solution.

The tapering effects of the window (hamming window The tapering effects of the window (hamming window

used) also introduces error since the waveform may used) also introduces error since the waveform may

not follow an all pole model assumed. However the not follow an all pole model assumed. However the

tapering of window has an advantage that least square tapering of window has an advantage that least square

error in the finding the solution is reduced.error in the finding the solution is reduced.

ConclusionConclusion

By comparison of the original speech against LPC speech and the By comparison of the original speech against LPC speech and the CELP; in both cases, the reconstructed speech has lower quality CELP; in both cases, the reconstructed speech has lower quality than the input speech. Both of the reconstructed speech sounds than the input speech. Both of the reconstructed speech sounds noisy with the LPC model being nearly unintelligible. The sound noisy with the LPC model being nearly unintelligible. The sound seems to be whispered with an extensive amount of noise. seems to be whispered with an extensive amount of noise.

The CELP reconstructed speech sounds more spoken and less The CELP reconstructed speech sounds more spoken and less whispered. In all, the CELP speech sounded closer to the whispered. In all, the CELP speech sounded closer to the original one, still with a muffled sound. original one, still with a muffled sound.

Further investigation

MELP• The MELP (Mixed-Excitation Linear Predictive)• Vocoder is the new 2400 bps Federal Standard

speech coder. • It is robust in difficult background noise

environments such as those frequently encountered in commercial and military communication systems.

• It is very efficient in its computational requirements.• The MELP Vocoder is based on the traditional LPC parametric model, but also includes four additional features. These are mixed-excitation, aperiodic pulses, pulse dispersion, and adaptive spectral

enhancement.

The mixed-excitation is implemented using a multi-band mixing model. The primary effect of this multi-band mixed-excitation is to reduce the buzz usually associated with LPC vocoders, especially in broadband acoustic noise.

Require explicit multi-band decision and source Require explicit multi-band decision and source characterizationcharacterization

References:References:

[1][1] J.L. Flanagan and L. R. RabinerJ.L. Flanagan and L. R. Rabiner

Speech Synthesis, Dowden, Hutchington & Speech Synthesis, Dowden, Hutchington & Ross, Inc., Ross, Inc., Stroudsburg, Pennsylvania 1973.Stroudsburg, Pennsylvania 1973.

[2][2] Z Li and M. DrewZ Li and M. Drew

Fundamentals of Multimedia Prentice Hall Fundamentals of Multimedia Prentice Hall (October (October 22, 2003) 22, 2003)

[3][3] Atlanta Signal Processors, Inc.The New 2400 bps Federal Standard Speech

Coder(http://www.aspi.com/tech/specs/pdfs/melp.pdf

Documents

A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding (CELP) By: Kendall Khodra Instructor: Dr. Kepuska