Upload
gwenda-francine-malone
View
235
Download
6
Tags:
Embed Size (px)
Citation preview
A Comparison Of Speech A Comparison Of Speech Coding With Linear Coding With Linear
Predictive Coding (LPC) Predictive Coding (LPC) And Code-Excited Linear And Code-Excited Linear Predictor CodingPredictor Coding (CELP) (CELP)
By: By: Kendall KhodraKendall Khodra
Instructor: Instructor: Dr. KepuskaDr. Kepuska
IntroductionIntroduction
This project will develop Linear Predictive Coding (LPC) This project will develop Linear Predictive Coding (LPC)
to process a speech signal. The objective is to mitigate to process a speech signal. The objective is to mitigate
the lack of quality of the simple LPC model by using the lack of quality of the simple LPC model by using
a more complex description of the excitation, Code a more complex description of the excitation, Code
Excited Linear Prediction (CELP) to process the output Excited Linear Prediction (CELP) to process the output
of simple LPC.of simple LPC.
BackgroundBackground
Linear Predictive Coding (LPC) methods are the most Linear Predictive Coding (LPC) methods are the most widely used in speech coding, speech synthesis, widely used in speech coding, speech synthesis, speech recognition, speaker recognition and speech recognition, speaker recognition and verification and for speech storage.verification and for speech storage.
LPC has been considered one of the most powerful LPC has been considered one of the most powerful techniques for speech analysis. In fact, this techniques for speech analysis. In fact, this technique is the basis of other more recent and technique is the basis of other more recent and sophisticated algorithms that are used for estimating sophisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech. vocal tract and low bit representations of speech.
The basic principle of linear prediction, states that The basic principle of linear prediction, states that speech can be modeled as the output of a linear, time-speech can be modeled as the output of a linear, time-varying system excited by either periodic pulses or varying system excited by either periodic pulses or random noise. These two kinds of acoustic sources are random noise. These two kinds of acoustic sources are called voiced and unvoiced respectively. In this sense, called voiced and unvoiced respectively. In this sense, voiced emissions are those generated by the vibration voiced emissions are those generated by the vibration of the vocal cords in the presence of airflow and of the vocal cords in the presence of airflow and unvoiced sounds are those generated when the vocal unvoiced sounds are those generated when the vocal cords are relaxed. cords are relaxed.
A. Physical Model:A. Physical Model:
When you speak: •Air is pushed from your lungs through your vocal tract and out of your mouth comes speech.
• For certain voiced sound, your vocal cords (folds) vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration).
• For certain fricative and plosive (or unvoiced) sounds your vocal cords do not vibrate but remain constantly opened.
• The shape of your vocal tract, which changes as we speak, determines the sound that you make.
• The amount of air coming from your lung determines the loudness of your voice.
B. Mathematical ModelB. Mathematical Model
Block diagram of simplified mathematical model for speech production
· The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence.
The relationship between the physical and the
mathematical models:
Vocal Tract H(z) (LPC Filter)
Air u(n) (innovation)
Vocal Cord Vibration V(Voiced)
Vocal Cord Vibration Period T (Pitch period)
Fricatives and Plosives UV (Unvoiced)
)(1)(
ZP
AzH
p
k
kkZazP
1
)(Vocal tract system = function
where
The The LPC ModelLPC ModelThe LPC method considers a speech sample s(n) at time n, and approximates it as a linear combination of the past samples in the way:
(1)
Where G is the gain and u(n) the normalized excitation.
The predictor coefficients (the k’s) are determined (computed) by minimizing the sum of squared differences (over a finite interval) between actual speech samples and the linearly predicted ones( we will see later).
Block diagram of an LPC
In the LPC model the residual (excitation) is approximated In the LPC model the residual (excitation) is approximated
during during voicingvoicing by a quasi-periodic impulse train and during by a quasi-periodic impulse train and during
unvoicingunvoicing by a white noise sequence. This approximation is by a white noise sequence. This approximation is
denoted by . We then pass through the filter 1/A(z)denoted by . We then pass through the filter 1/A(z)
LPC consists of the following steps LPC consists of the following steps
• Pre-emphasis Filtering Pre-emphasis Filtering • Data Windowing Data Windowing • Autocorrelation Parameter Estimation Autocorrelation Parameter Estimation • Pitch Period and Gain Estimation Pitch Period and Gain Estimation • Quantization Quantization • Decoding and Frame Interpolation Decoding and Frame Interpolation
Pre-emphasis FilteringPre-emphasis Filtering
• When we speak, the speech signal experiences some spectral When we speak, the speech signal experiences some spectral roll off due to the radiation effects of the sound from the mouth roll off due to the radiation effects of the sound from the mouth
• As a result, the majority of the spectral energy is concentrated As a result, the majority of the spectral energy is concentrated in the lower frequencies.in the lower frequencies.
• To have our model give equal weight to both low and high To have our model give equal weight to both low and high frequencies, we need to apply a high-pass filter to the original frequencies, we need to apply a high-pass filter to the original signal. signal.
• This is done with a one zero filter, called the pre-emphasis filter. This is done with a one zero filter, called the pre-emphasis filter. The filter has the form: The filter has the form:
y[n] = 1 - a x[n] y[n] = 1 - a x[n]
Most standards use Most standards use aa = 15/16 = .9375 ( our default) = 15/16 = .9375 ( our default)
When we decode the speech, the last thing we do to eachWhen we decode the speech, the last thing we do to eachframe is to pass it through a de-emphasis filter to undo this frame is to pass it through a de-emphasis filter to undo this effect.effect.
Matlab: Matlab: speech = filter([1 -preemp], 1, data)'; % Preemphasize speechspeech = filter([1 -preemp], 1, data)'; % Preemphasize speech
Data WindowingData Windowing Because speech signals vary with time, this process is done on short Because speech signals vary with time, this process is done on short
chunks of the speech signal, which we call chunks of the speech signal, which we call framesframes. Usually 30 to 50 ms . Usually 30 to 50 ms
frames give intelligible speech with good compression.frames give intelligible speech with good compression.
• For implementation in this project we will use overlapping For implementation in this project we will use overlapping
data framesdata frames to avoid discontinuities in the model. We used to avoid discontinuities in the model. We used
a frame width of 30 ms and overlap of 10 ms.a frame width of 30 ms and overlap of 10 ms.• A hamming window was used to extract frames as shown belowA hamming window was used to extract frames as shown below
Determining Pitch PeriodDetermining Pitch Period
For each frame, we must determine if the speech is voiced or unvoiced. For each frame, we must determine if the speech is voiced or unvoiced.
We do this by searching for periodicities in the residual (prediction error) We do this by searching for periodicities in the residual (prediction error)
signal. signal.
To determine if the frame is voiced or unvoiced, we apply a threshold to To determine if the frame is voiced or unvoiced, we apply a threshold to
the autocorrelation. Typically, this threshold is set at Rthe autocorrelation. Typically, this threshold is set at Rxx(0) * 0.3.(0) * 0.3.
• If no values of the autocorrelation sequence exceed this threshold, then we If no values of the autocorrelation sequence exceed this threshold, then we declare the frame unvoiced. declare the frame unvoiced.
• If we have periodicities in the data , there should be spikes which exceed If we have periodicities in the data , there should be spikes which exceed the threshold; in this case we declare the frame voiced.the threshold; in this case we declare the frame voiced.
The distance between spikes in the autocorrelation function is The distance between spikes in the autocorrelation function is
equivalent to the pitch period of the original signal. equivalent to the pitch period of the original signal.
LPC analyzes the speech signal by:LPC analyzes the speech signal by:
• Estimating the formants Estimating the formants • Removing their effects from the speech signalRemoving their effects from the speech signal• Estimating the intensity and frequency of the remaining Estimating the intensity and frequency of the remaining
signal. signal.
The process of removing the formants is called The process of removing the formants is called inverse filteringinverse filtering, and the remaining signal is called , and the remaining signal is called the the residue.residue.
LPC synthesizes the speech signal by reversing the LPC synthesizes the speech signal by reversing the process: process:
• Use the residue to create a source signalUse the residue to create a source signal• Use the formants to create a filter (which represents the Use the formants to create a filter (which represents the
tube/tract)tube/tract)• Run the source through the filter, resulting in speech. Run the source through the filter, resulting in speech.
Estimating the FormantsEstimating the Formants
The coefficients of the difference equation (the The coefficients of the difference equation (the
pprediction coefficientsrediction coefficients) characterize the formants.) characterize the formants.
The LPC system needs to estimate these coefficients The LPC system needs to estimate these coefficients
which is done by minimizing the mean-square which is done by minimizing the mean-square
error between the predicted signal and the actual error between the predicted signal and the actual
signal. signal.
CELP CELP (Code Excited Linear Predictor) (Code Excited Linear Predictor)
A CELP coder does the same LPC modeling but then A CELP coder does the same LPC modeling but then
computes the errors between the original speech & the computes the errors between the original speech & the
synthetic model and transmits both model parameters synthetic model and transmits both model parameters
and a very compressed representation of the errors and a very compressed representation of the errors
(the compressed representation is an index into a 'code (the compressed representation is an index into a 'code
book' shared between coders & decoders -- this is why book' shared between coders & decoders -- this is why
it's called "Code Excited"). A CELP coder does much it's called "Code Excited"). A CELP coder does much
more work than an LPC coder (usually about an order more work than an LPC coder (usually about an order
of magnitude more) but the result is much higher of magnitude more) but the result is much higher
quality speech: quality speech:
Block diagram of the CELP
The perceptual weighting filter is defined as:The perceptual weighting filter is defined as:
0<r<10<r<1
This filter is used to de-emphasize the frequency regions that This filter is used to de-emphasize the frequency regions that correspond to the formants as determined by LPC analysis. correspond to the formants as determined by LPC analysis.
The noise, located in formant regions, that is more perceptibly The noise, located in formant regions, that is more perceptibly disturbing can be reduced.disturbing can be reduced.
The de-emphasis is controlled by factor r.The de-emphasis is controlled by factor r.
After determining the formant synthesis filter 1/A(z), the pitch After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length is usually equal to or shorter than the pitch subframe length. length.
The autocorrelation method assumes that the signal is The autocorrelation method assumes that the signal is identically identically
zero outside the analysis interval (0<=m<=N-1). Then it tries to zero outside the analysis interval (0<=m<=N-1). Then it tries to minimize the prediction error wherever it is nonzero, that is in minimize the prediction error wherever it is nonzero, that is in the interval 0<=m<=N-1+p, where p is the order of the model the interval 0<=m<=N-1+p, where p is the order of the model used. The error is likely to be large at the beginning and at the used. The error is likely to be large at the beginning and at the end of this interval. This is the reason why the speech segment end of this interval. This is the reason why the speech segment analyzed is usually tapered by the application of a Hamming analyzed is usually tapered by the application of a Hamming Window.Window.
Autocorrelation Parameter EstimationAutocorrelation Parameter Estimation
Given
Our goal is to find the predictor coefficients ai which minimizes k the square of the prediction error in a short segment of speech. The mean short time prediction error per frame is defined as:
To minimize this we take the derivative and set it to zero. This results in the equation:
Finding the Parameters
Letting , we haveLetting , we have
This equation is solved using the Levinson-Durbin algorithmThis equation is solved using the Levinson-Durbin algorithm
This algorithm is one used to assist in finding the filter This algorithm is one used to assist in finding the filter
coefficients acoefficients aii from the system Ra=r. What the Levinson-Durbin from the system Ra=r. What the Levinson-Durbin
algorithm does here is making the solution to the problem O(nalgorithm does here is making the solution to the problem O(n2)2)
instead of O(ninstead of O(n3)3) by exploiting the fact that matrix R is toeplitz by exploiting the fact that matrix R is toeplitz
hermitian. hermitian.
Matlab
% Levinson's method err(1) = autoCorVec(1);k(1) = 0;A = [];for index=1:Lnumerator = [1 A.']*autoCorVec(index+1:-1:2);denominator = -1*err(index);k(index) = numerator/denominator; % PARCOR coeffsA = [A+k(index)*flipud(A); k(index)]; err(index+1) = (1-k(index)^2)*err(index);end
aCoeff(:,nframe) = [1; A];parcor(:,nframe) = k';
Helpful matlab tools usedHelpful matlab tools used
synFrame = filter(1, A', residFrame)synFrame = filter(1, A', residFrame)
This filters the data in vector residframe with the filter described by vector AThis filters the data in vector residframe with the filter described by vector A
resid2 = dct(resid);resid2 = dct(resid);This returns the discrete cosine transform of resid as discrete cosine transform This returns the discrete cosine transform of resid as discrete cosine transform coefficients. Only the first 50 coefficients are kept since most of the energy is coefficients. Only the first 50 coefficients are kept since most of the energy is stored therestored there
resid3 = uencode(resid2,4);resid3 = uencode(resid2,4);
This function uniformly quantizes and encodes the data in the vector resid2 into This function uniformly quantizes and encodes the data in the vector resid2 into N-bitsN-bits..
newsignal = udecode(resid,4);newsignal = udecode(resid,4);
This does the opposite of uencode of residThis does the opposite of uencode of resid
ResultsResults
It can be seen from the waveforms that the CELP It can be seen from the waveforms that the CELP method looks much more like and hence is a better method looks much more like and hence is a better method for speech coding. This is emphasized from method for speech coding. This is emphasized from the log-magnitude spectrum. the log-magnitude spectrum.
The synthesized voice of linear prediction waveform The synthesized voice of linear prediction waveform is peaky and sounds buzzy since it is based on the is peaky and sounds buzzy since it is based on the autocorrelation method that has loss of absolute autocorrelation method that has loss of absolute phase structure because of its minimum phase phase structure because of its minimum phase
characteristics.characteristics.
ResultsResults
Male VoiceMale Voice
Original SignalOriginal SignalLPC SignalLPC SignalCELP SignalCELP Signal
Female VoiceFemale Voice
Original SignalOriginal SignalLPC SignalLPC SignalCELP SignalCELP Signal
4 bits encoding 8 bits encoding
4 bits encoding 8 bits encoding
DrawbacksDrawbacks
The LPC method has inherent errors (quantization) and The LPC method has inherent errors (quantization) and
in most cases doesn’t give accurate solution. in most cases doesn’t give accurate solution.
The tapering effects of the window (hamming window The tapering effects of the window (hamming window
used) also introduces error since the waveform may used) also introduces error since the waveform may
not follow an all pole model assumed. However the not follow an all pole model assumed. However the
tapering of window has an advantage that least square tapering of window has an advantage that least square
error in the finding the solution is reduced.error in the finding the solution is reduced.
ConclusionConclusion
By comparison of the original speech against LPC speech and the By comparison of the original speech against LPC speech and the CELP; in both cases, the reconstructed speech has lower quality CELP; in both cases, the reconstructed speech has lower quality than the input speech. Both of the reconstructed speech sounds than the input speech. Both of the reconstructed speech sounds noisy with the LPC model being nearly unintelligible. The sound noisy with the LPC model being nearly unintelligible. The sound seems to be whispered with an extensive amount of noise. seems to be whispered with an extensive amount of noise.
The CELP reconstructed speech sounds more spoken and less The CELP reconstructed speech sounds more spoken and less whispered. In all, the CELP speech sounded closer to the whispered. In all, the CELP speech sounded closer to the original one, still with a muffled sound. original one, still with a muffled sound.
Further investigation
MELP• The MELP (Mixed-Excitation Linear Predictive)• Vocoder is the new 2400 bps Federal Standard
speech coder. • It is robust in difficult background noise
environments such as those frequently encountered in commercial and military communication systems.
• It is very efficient in its computational requirements.• The MELP Vocoder is based on the traditional LPC parametric model, but also includes four additional features. These are mixed-excitation, aperiodic pulses, pulse dispersion, and adaptive spectral
enhancement.
The mixed-excitation is implemented using a multi-band mixing model. The primary effect of this multi-band mixed-excitation is to reduce the buzz usually associated with LPC vocoders, especially in broadband acoustic noise.
Require explicit multi-band decision and source Require explicit multi-band decision and source characterizationcharacterization
References:References:
[1][1] J.L. Flanagan and L. R. RabinerJ.L. Flanagan and L. R. Rabiner
Speech Synthesis, Dowden, Hutchington & Speech Synthesis, Dowden, Hutchington & Ross, Inc., Ross, Inc., Stroudsburg, Pennsylvania 1973.Stroudsburg, Pennsylvania 1973.
[2][2] Z Li and M. DrewZ Li and M. Drew
Fundamentals of Multimedia Prentice Hall Fundamentals of Multimedia Prentice Hall (October (October 22, 2003) 22, 2003)
[3][3] Atlanta Signal Processors, Inc.The New 2400 bps Federal Standard Speech
Coder(http://www.aspi.com/tech/specs/pdfs/melp.pdf