Upload
karthik0433
View
77
Download
3
Tags:
Embed Size (px)
Citation preview
A New speech compression method
ADAPTIVE ALGORITHM FOR SPEECHADAPTIVE ALGORITHM FOR SPEECH COMPRESSION USING COSINE PACKETCOMPRESSION USING COSINE PACKET
TRANSFORMTRANSFORM
`
BYBYK.RAJITHA REDDYK.RAJITHA REDDY
i
A New speech compression method
Department of Electronics & Communications EngineeringDepartment of Electronics & Communications EngineeringSREENIDHI INSTITUTE OF SCIENCE & TECHNOLOGYSREENIDHI INSTITUTE OF SCIENCE & TECHNOLOGY
Yamnampet, Ghatkesar, Hyderabad-501 301Yamnampet, Ghatkesar, Hyderabad-501 301
ii
A New speech compression method
ABSTRACT
iii
A New speech compression method
ABSTRACT
This paper presents a new adaptive algorithm for speech
Compression (an act of transforming the speech signal to a more
compact form (less storage space and less bandwidth). To meet this
goal different methods for compression have been designed and developed.
The traditional speech compression method is based on linear prediction.
The compression method, proposed in this paper, is based on the use of an orthogonal
transform, the discrete cosine packets transform.
The proposed algorithm uses packet decomposition,
which reduces a computational complexity of a system. One of the most
important processes of the proposed compression algorithm is the threshold detector. The
main role of this process is to nullify all the coefficients obtained from the Cosine Packet
Transform smaller to a threshold value. This is in fact the compression mechanism.
This paper compares the compression ratio of methods
using Wavelet Transform, Cosine Transform, Wavelet Packet
Transform and proposed adaptive algorithm using Cosine
Packet Transform for different speech signal samples. The mean
compression ratio is calculated for all the methods and compared. The
implemented results show that the proposed compression algorithm
gives the better performance for speech signals.
iv
A New speech compression method
ORGANIZATION OF THE REPORT
1- OVERVIEW
2- SPEECH PRODUCTION MECHANISM
3- ORTHOGONAL TRANSFORMS
4-SELECTION OF BEST PACKETS
5-SPEECH COMPRESSION ALGORITHM
6- RESULTS AND DISCUSSION
7- APPLICATIONS OF SPEECH COMPRESSION
8- CONCLUSION
9- REFERENCES
v
A New speech compression method
TABLE OF CONTENTS
ABSTRACT (iv)
LIST OF FIGURES (viii)
LIST OF TABLES (ix)
1. OVERVIEW
1.1 INTRODUCTION……………………………………………………...2
1.1.1 BRIEF HISTORY OF
COMPRESSION………………………...2
1.1.2 SPEECH COMPRESSION………………………………………3
1.2 AIM OF THE REPORT………………………………………………..4
1.3 METHODOLOGY……………………………………………………..5
1.4 SIGNIFICANCE……………………………………………………….5
1.5 ORGANISATION OF THE REPORT………………………………...6
2. SPEECH PRODUCTION MECHANISM
2.1 SOUNDS, VOICE AND SPEECH……………………………………8
2.2 CHARACTERISTICS/ FEATURES OF SPEECH…………………...8
2.3 PROCESS OF SPEECH PRODUCTION IN HUMAN BEINGS…….9
2.4 SOURCE FILTER THEORY………………………………………...13
2.5 TIME AND FREQUENCY DOMAIN REPRESENTATION……….17
3. ORTHOGONAL TRANSFORMS
3.1 DISCRETE COSINE TRANSFORM………………………………...22
3.2 DISCRETE WAVELET TRANSFORM……………………………..25
3.3 DISCRETE WAVELET PACKET TRANSFORM………………….25
3.4 DISCRETE COSINE PACKET TRANSFORM……………………..25
vi
A New speech compression method
4. SELECTING THE BEST TRANSFORM 29
5. SPEECH COMPRESSION ALGORITHM
5.1 PROPOSED
ALGORITHM………………………………………….33
5.1.1 FLOW
DIAGRAM…………………………………………….34
5.2 ADAPTIVE THRESHOLD DETECTOR……………………………36
5.2.1 FLOW DIAGRAM FOR THRESHOLD DETECTOR………..39
6. RESULTS AND DISCUSSION
41
7. APPLICATIONS OF SPEECH COMPRESSION
47
8. CONCLUSION
49
9. BIBILOGRAPHY
51
vii
A New speech compression method
viii
A New speech compression method
LIST OF FIGURES
FIGURE 1: HUMAN VOCAL ORGANS………………………………...11
FIGURE 2: SCHEMATIC REPRESENTATION OF THE COMPLETE
SPEECH PRODUCTION MECHANISM…………………....12
FIGURE 3: SOURCE FILTER THEORY OF SPEECH…………………..13
FIGURE 4: SOURCE-FILTER MODEL OF SPEECH…………..............15
FIGURE 5: SOURCE FILTER THEORY REPRESENTATION
OF SPEECH PRODUCTION………………………………...16
FIGURE 6: THE TIME AND FREQUENCY DOMAIN
REPRESENTATION OF VOWELS (/a/, /i/, & /u/)………....18
FIGURE 7: THE TIME DOMAIN AND FREQUENCY DOMAIN
REPRESENTATION OF A WORD…………………………20
FIGURE 8: ONE DIMENSIONAL COSINE BASIS FUNCTION (N=8)...24
FIGURE 9: THE WHITENING EFFECT OF THE DWT………………...26
FIGURE 10: THE WHITENING EFFECT OF THE DWPT……………...27
FIGURE 11: SPEECH SIGNAL SAMPLE………………………………..43
viii
A New speech compression method
LIST OF TABLES
1. COMPARISION OF COMPRESSION RATIO FOR 20 DIFFERENT
SPEECH SAMPLES……………………………………………………….44
2. BAR CHART REPRESENTATION OF COMPRESSION RATIOS
FOR SPEECH SAMPLES (1-10)
………………………………………………..45
3. BAR CHART REPRESENTATION OF COMPRESSION RATIOS
FOR SPEECH SAMPLES (11-20)
………………………………………………45
4. MEAN COMPRESSION RATIOS FOR 20 DIFFERENT SPEECH
SAMPLES………………………………………………………………….46
ix
A New speech compression method
1. OVERVIEW
1
A New speech compression method
1.1 INTRODUCTION
1.1.1 BRIEF HISTORY OF COMPRESSION
The primary purpose of early computers is computing. With the
advancements in technology, computers were developed to provide a wide variety of
applications. One of these is the representation and storage of non-numeric information
into binary code, e.g. audio signals and images. Such applications require longer data
representations for quality. These data requires larger storage space and longer access
time. Another set back would be in terms of data transmission. Transmission time would
be longer since the data that would be sent is larger. With these disadvantages comes the
need for the compression of data.
Data compression refers to the reduction of the number of bits needed to
represent the data used for storage and transmission. The ideal goal of data compression
is to contain the original information in as few bits as possible. Data compression can be
classified as:
Lossless compression
Lossy compression
Lossless compression refers to the technique that assures an exact
representation of data after going through the compression-decompression process. This
type of compression is well appropriate for applications where the loss of even minimal
information has a tremendous negative effect. Such effects can be seen in storage of text
files in databases where minimal loss would have tremendous consequences.
On the other hand, Lossy compression, as the name implies, represents the
compression of data with a certain degree of loss in information content. This technique
is more suited for applications where overall information content is not dramatically
diminished with minimal loss in information. Examples of such applications are in image
and speech signals. In these, applications, the loss of information is evidenced by losses
in signal quality, e.g. loss of sharpness in images and loss of intelligibility in speech.
2
A New speech compression method
1.1.2 SPEECH COMPRESSION
One of the most important speech processing methods is the Compression.
This is because the speech signal is very redundant. With rapid deployment of speech
compression technologies, more and more speech content is stored and transmitted in
compressed formats. Speech signals has unique properties that differ from a general
audio/music signals. First, speech is a signal that is more structured and band-limited
around 4 kHz. These two facts can be exploited through different models and approaches
and at the end, make it easier to compress.
Today, applications of speech compression involve real time processing in
mobile satellite communications, cellular telephony, internet telephony, audio for
videophones or video teleconferencing systems, among others.
The idea of speech compression is to compress speech signal to take up
less storage space and less bandwidth for transmission. To meet this goal different
methods for compression have been designed and developed. The speech compression is
used in digital telephony, in multimedia and in the security of digital communications.
The speech compression methods are lossy. There are two categories of speech
compression methods:
Methods based on LINEARLINEAR PREDICTIONPREDICTION
Methods based on ORTHOGONALORTHOGONAL TRANSFORMSTRANSFORMS .
Before the introduction of Packet based transform techniques, audio
coding techniques used DFT (Discrete Fourier transform) and DCT (Discrete Cosine
transform) with window functions such as Rectangular and sine-taper functions.
However, these early coding techniques have failed to fulfill the contradictory
requirements imposed by high-quality audio coding. For example, with a rectangular
window the analysis/synthesis system is critically sampled, i.e., the overall number of the
transformed domain samples is equal to the number of time domain samples, but the
system suffers from poor frequency resolution and block effects, which are introduced
3
A New speech compression method
after quantization or other manipulation in the frequency domain. Overlapped windows
allow for better frequency response functions but carry the penalty of additional values in
the frequency domain, thus not critically sampled. Discrete Cosine Packet Transform is
currently the best solution, which has satisfactorily solved the paradox.
The quality of compression system can be appreciated with
the aid of its rate distortion function. A compression system is better
than another if, at equal distortions; it realizes a higher compression
rate. The maximization of compression rate can be done, if a good
selection of orthogonal transform is made.
The quality of compression system can be appreciated with
the aid of its rate distortion function. A compression system is better
than another if, at equal distortions; it realizes a higher compression
rate. The maximization of compression rate can be done, if a good
selection of orthogonal transform is made.
1.2 AIM OF THE REPORT
To propose a new adaptive algorithm for speech Compression
using
Cosine packets transform
To show that, the proposed compression algorithm gives the
better
Performance for speech signals (by comparing the
compression ratio of
Methods using Wavelet Transform, Cosine Transform,
4
A New speech compression method
Wavelet Packet Transform and proposed adaptive
algorithm using
Cosine Packet Transform for different speech signal
samples.)
1.3 METHODOLOGY
The proposed speech compression algorithm makes use of orthogonal
transform, which is used to de-correlate the speech signal. There are several orthogonal
transforms. The orthogonal transform which is being used in this paper is DISCRETE
COSINE PACKET TRANSFORM
The proposed algorithm uses packet decomposition,
which reduces a computational complexity of a system. One of the most
important processes of the proposed compression algorithm is the threshold detector. The
main role of this process is to nullify all the coefficients obtained from the Cosine Packet
Transform smaller to a threshold value. This is in fact the compression mechanism.
In order to show that the proposed compression
algorithm gives the better performance for speech signals, this paper
compares the compression ratio of methods using Wavelet
Transform, Cosine Transform, Wavelet Packet Transform and
proposed adaptive algorithm using Cosine Packet Transform for
different speech signals.
1.4 SIGNIFICANCE
5
A New speech compression method
The compression of speech signals has many practical applications. One
example is in digital cellular technology where many users share the same frequency
bandwidth. Compression allows more users to share the system than otherwise possible.
Another example is in digital voice storage (e.g. answering machines). For a given
memory size, compression allows longer messages to be stored than otherwise.
Historically, digital speech signals are sampled at a rate of 8000
samples/sec. Typically, each sample is represented by 8 bits (using U-law). This
corresponds to an uncompressed rate of 64 kbps (Kbits/sec). With current compression
techniques (all of which are lossy), it is possible to reduce the rate to 8 kbps with almost
no perceptible loss in quality.
Although we have many other speech compression techniques, the proposed
algorithm makes use of an orthogonal transform (Discrete cosine packet transform) as it
achieves higher compression rate than any other.
Hence, an adaptive algorithm for speech compression using cosine packet
transform achieves higher performance (high compression rate) for speech signals.
1.5 ORGANIZATION OF THE REPORT
This paper is organized as follows. The mechanism of speech production
followed by the mathematical model for speech signal and the description about Discrete
Cosine Transform and three other orthogonal transforms discrete wavelet transform,
discrete wavelet packet transform, discrete cosine packet transform is presented in
chapters 1,2,3,4.
With necessary mathematical modeling, the proposed adaptive algorithm
for speech compression is explained in chapter 5. In chapter 6 the developed algorithm is
6
A New speech compression method
tested for various speech signal samples and comparison is made with Wavelet
Transform, Cosine Transform and Wavelet Packet Transform. Finally, chapter 7&8
concludes the paper with some discussions and applications of speech compression.
7
A New speech compression method
2. SPEECH PRODUCTION
MECHANISM
8
A New speech compression method
2.1 SOUNDS, VOICE AND SPEECH
Sound is generally known as vibrational transmission of mechanical
energy that propagates through matter as a wave that can be audibly perceived by a living
organism through its sense of hearing. For humans, hearing is limited to frequencies
between about 20 Hz and 2000 Hz. Voice and speech are different facets of it.
Voice (or vocalization) is the sound by humans and other vertebrates using
the lungs and the vocal folds in the larynx or voice box. Voice is not always produced as
speech, however infants babble and coo; animals bark, moo, whinny and growl; and adult
humans laugh, sing and cry.
Voice is generated by the airflow coming out from lungs as the vocal folds
are brought close together. When the air is pushed past the vocal folds with sufficient
pressure, the vocal folds vibrate, giving out voice. Voice is as unique as a fingerprint. It
helps to define the personality, mood and health.
Humans express thoughts, feelings, and ideas orally to one another
through a series of complex movements that alter and mold the basic tone created by
voice into specific, decodable sounds called speech. Speech is produced by precisely
coordinated muscle actions in the head, neck, chest, and abdomen. Speech development
is gradual process that requires years of practice. During this process, a child learns how
to regulate these muscles to produce understandable speech. Speech bandwidth is 100 Hz
to 8000 Hz.
2.2 CHARACTERISTICS/ FEATURES OF SPEECH
Sound being a wave, periodicity is important for it to carry information. In
air, the disturbance propagates as the successive compression and decompression (also
called rarefaction) of small regions in the medium. If we generate a pure note and place a
detector (our ear, for example) at a point in the surrounding medium, a distance from the
9
A New speech compression method
source, the number of compression-decompression sequences arriving at the detector
during a unit time interval is called the frequency.
The time interval between successive maximal compressions is called the
period. The product of the frequency and the wavelength is the velocity.
Pitch is the perceived fundamental frequency of a sound. While the actual
fundamental frequency can be precisely determined through physical measurement, it
may differ from the perceived pitch because of overtones, or partials, in the sound and
sometimes, the human auditory perception system may also have trouble distinguishing
frequency differences between various types of notes under certain circumstances. It is
the auditory attribute of sound according to which sounds can be ordered on a scale from
low to high. Thus pitch can be called as intensity of a sound based on the frequency of
the speech.
When the measure of highness or lowness of speech is referred, then
loudness also comes into the picture. Loudness is the perceived volume or amplitude of
the speech signal. Intensity, also which measures on its same parameter is proportional to
square of loudness. And finally, quality of speech, which refers to the character or
distinctive attributes of a sound.
2.3 PROCESS OF SPEECH PRODUCTION IN HUMAN
BEINGS
BIOLOGICAL MECHANISM
Vocal organs present in the figure1 produce speeches. The main energy
source is the lungs with the diaphragm. When speaking, the airflow is forced through the
glottis between the vocal cords and the larynx to the three main cavities of the vocal tract,
the pharynx and the oral, and the nasal cavities. From the oral and nasal cavities, the
airflow exits through the nose and mouth, respectively.
10
A New speech compression method
The V-shaped opening between the vocal chords, called the glottis, is the
most important sound source in the vocal system. The vocal cords may act in several
different ways during speech. The most important function is to modulate the air-flow by
rapidly opening or closing, causing buzzing sound from which vowels and voiced
consonants are produced. The fundamental frequency of vibration depends upon mass
and tension and is about 110 Hz, 200 Hz and 300 Hz with men, women and children
respectively. The position of vocal cords is different, while rendering different (voiced
and unvoiced) consonants.
The fundamental frequency of vibration depends upon mass and tension
and is about 110 Hz, 200 Hz and 300 Hz with men, women and children respectively.
The position of vocal cords is different, while rendering different (voiced and unvoiced)
consonants.
The pharynx connects the larynx to oral cavity. It has almost fixed
dimensions, but raising or lowering the larynx at the one end and soft plate at the other
end may slightly change its length. The soft palate also isolates or connects the route
from the nasal cavity to the pharynx. At the bottom of pharynx are the epiglottis and false
vocal cords to prevent food reaching the larynx and to isolate the esophagus acoustically
from the vocal tract. The epiglottis, the false vocal cords are closed during swallowing
and open during normal breathing.
11
A New speech compression method
FIGURE 1: HUMAN VOCAL ORGANS
12
A New speech compression method
FIGURE 2: SCHEMATIC REPRESENTATION OF
THE COMPLETE SPEECH PRODUCTION
MECHANISM
13
A New speech compression method
2.4 SOURCE FILTER THEORY
The process of production of speech in human beings that has been
explained in the above section in biological terms can be explained in technical terms
using source filter theory.
FIGURE 3: SOURCE FILTER THEORY OF SPEECH
The source-filter theory describes speech production as a two-stage
process involving the generation of a sound source, with its own spectral shape and
spectral fine structure, which is then shaped or filtered by the resonant properties of the
vocal tract.
Most of the filtering of a source spectrum is carried out by that part of the
vocal tract interior to the sound source. In the case of a glottal source, the filter is the
entire supra-glottal vocal tract. The vocal tract filter always includes some part of the oral
cavity and can also, optionally, include the nasal cavity (depending upon whether the
velum is open or closed).
Sound sources can be either periodic or aperiodic. Glottal sound sources
can be periodic (voiced), aperiodic (whisper and /h/) or mixed (e.g. breathy voice).
Supra-glottal sound sources that are used contrastively in speech are aperiodic (i.e.
random noise) although some trill sounds can resemble periodic sources to some extent.
Sound SourceVocal Tract Filter Speech
Sounds
Glottal Source
Supra-glottal Source
14
A New speech compression method
A voiced glottal source has its own spectrum, which includes spectral fine
structure (harmonics and some noise) and a characteristic spectral slope (sloping
downwards at approximately -12dB/octave).
An aperiodic source (glottal or supra-glottal) has its own spectrum, which
includes spectral fine structure (random spectral components) and a characteristic spectral
slope.
Periodic and aperiodic sources can be generated simultaneously to
produce mixed voiced and aperiodic speech typical of sounds such as voiced fricatives.
In voiced speech the fundamental frequency (perceived as vocal pitch) is a
characteristic of the glottal source acoustics whilst features such as vowel formants are
characteristics of the vocal tract filter (resonances).
The cavities of the supra-glottal vocal tract resonate when excited by
sound energy. The exact resonance characteristics of the vocal tract are a function of
many factors, but the two of the primary importance are the length of the vocal tract and
the shape of the vocal tract, that is, the cross-sectional area profile from the glottis to lips.
Resonance in the vocal tract reinforces the sound energy of the source function at
particular frequencies. These frequencies are variously called resonant frequencies, poles,
or formant frequencies.
In the figure 5, we see the effects of passing a model source function with
its dense harmonic structure through a model vocal tract producing a vowel. The bottom
panel in the figure 6 shows the spectrum of the source function. The middle panel in this
figure shows the response characteristics of the vocal tract. This is relatively smooth and
continues function of frequency with peaks at frequencies corresponding to the resonant
frequencies of the vocal tract. The top panel in the figure shows the acoustic output from
the model. It retains both the fine harmonic structure of the source and the structure
introduced by the resonance characteristics of the vocal tract.
15
A New speech compression method
FIGURE 4: SOURCE-FILTER MODEL OF SPEECH
16
A New speech compression method
FIGURE 5 : SOURCE FILTER THEORY
REPRESENTATION OF SPEECH
PRODUCTION
17
A New speech compression method
2.5 TIME AND FRQUENCY DOMAIN
REPRESENTATION OF SPEECH SIGNAL
Speech is basically an analog (Continuous) signal. It is a set of
complicated audio signals production of them artificially is rather difficult. Speech signal
is usually considered as voice or unvoiced, but in some cases they are something between
these two.
Voiced speech is characterized as a quasi-periodic deterministic waveform
for which general spectral properties are also known. Unvoiced speech is identified, when
the standardized energy Es of the signal is below a threshold, or else referred as
ambiguous if below the threshold.
The classification procedures of speech signals into voiced and unvoiced
are determined by using a harmonistic measure acquired after evaluating the
instantaneous frequency amplitude spectrum.
Voiced sounds consist of fundamental frequency (F0) and its harmonic
components produced by the vocal cords. The vocal tract modifies this excitation signal
causing formant (pole) and sometimes anti-formant (zero) frequencies. Each formant
frequency has also amplitude and bandwidth and it may be sometimes difficult to define
some of these parameters correctly. The fundamental frequency and the formant
frequencies are probably important concepts in speech processing. Each formant
frequency has also amplitude and bandwidth and it may be sometimes difficult to define
some of these parameters correctly. The fundamental frequency and the formant
frequencies are probably important concepts in speech processing.
18
A New speech compression method
Speech signals of the three vowels (/a/, /i/, and /u/) are presented in time
and frequency domain in the figure.
FIGURE 6: TIME AND FREQUENCY DOMAIN
REPRESENTATION OF VOWELS (/a/, /i/, & /u/).
19
A New speech compression method
The fundamental frequency is about 100 Hz in all the cases and the
formant frequencies are F1, F2 and F3 with vowel /a/ re approximately 600 Hz, 1000 Hz
and 2500 Hz respectively. With the vowel /i/ the first three formants are 200 Hz, 2300 Hz
and 3000 Hz and with /u/ 300 Hz, 600 Hz and 2300 Hz, respectively. The harmonic
structure of the excitation is also to perceive from frequency domain representation.
It can be seen that the first three formants are inside the normal telephone
channel (from 300 Hz to 3400 Hz) so the needed bandwidth is maximum up to 10 KHz,
which leads to 20 KHz sampling frequency. Unless, the fundamental frequency is outside
the telephone channel, the human hearing system is capable to reconstruct it from its
harmonics.
Another commonly used method to describe a speech signal is the
spectrogram, which is frequency-amplitude representation. The spectrogram and the time
domain representation of a speech signal are presented in the figure 7.
In spectrograph, higher amplitudes are represented with darker levels, so
the formant frequencies and trajectories are easy to perceive. Thus, it also helps in
comprehending the spectral differences between vowels and consonants. Hence
spectrogram is the most powerful tool for the representation of the speech signals and for
speech research.
20
A New speech compression method
FIGURE 7: THE TIME DOMAIN AND FREQUENCY
DOMAIN REPRESENTATION OF A WORD.
21
A New speech compression method
3. ORTHOGONAL
TRANSFORMS
[3.1 DISCRETE COSINE
TRANSFORM]
22
A New speech compression method
ORTHOGONAL TRANSFORMS
The role of the orthogonal transform is to de-correlate the speech signal. There
are a lot of orthogonal transforms. The transform that realizes the better de-correlation is
the Karhunen-Loeve transform; unfortunately there is not any fast algorithm for the
computation of this transform. This is the reason why in practice other orthogonal
transforms are used. For example the Fast Fourier Transform or the Discrete Cosine
Transform can be used. Following are the four orthogonal transforms that are used in this
paper:
DISCRETE COSINE TRANSFORM (DCT)
DISCRETE WAVELET TRANSFORM (DWT)
DISCRETE WAVELET PACKETS TRANSFORM (DPWT)
DISCRETE COSINE PACKETS TRANSFORM (DPCT)
3.1 DISCRETE COSINE TRANSFORM (DCT): The Discrete Cosine Transform (DCT) attempts to decorrelate the speech data.
After decorrelation each transform coefficient can be encoded independently without
losing compression efficiency.
One-Dimensional DCT
The most common DCT definition of a 1-D sequence of length N is
for u = 0,1,2,…,N −1. Similarly, the inverse transformation is defined as
for x = 0,1,2,…,N −1. In both equations (1) and (2) is defined as
23
A New speech compression method
It is clear from (1) that for
Thus, the first transform coefficient is the average value of the sample sequence. In
literature, this value is referred to as the DC Coefficient. All other transform coefficients
are called the AC Coefficient.
To fix ideas, ignore the f(x) and α(u) component in (1).
The plot of
For N = 8 and varying values of u is shown in Figure. In accordance with our previous
observation, the first the top-left waveform (u = 0 ) renders a constant (DC) value,
whereas, all other waveforms (u = 1,2,…,7 ) give waveforms at progressively
increasing frequencies. These waveforms are called the cosine basis function. Note that
these basis functions are orthogonal. Hence, multiplication of any waveform in Figure
with another waveform followed by a summation over all sample points yields a zero
(scalar) value, whereas multiplication of any waveform in Figure 3 with itself followed
by a summation yields a constant (scalar) value.
If the input sequence has more than N sample points then it can be divided
into sub-sequences of length N and DCT can be applied to these chunks independently.
Here, a very important point to note is that in each such computation the values of the
basis function points will not change. Only the values of f (x) will change in each sub-
sequence. This is a very important property, since it shows that the basis functions can be
pre-computed offline and then multiplied with the sub-sequences. This reduces the
number of mathematical operations (.multiplications and additions) thereby rendering
computation efficiency.
24
A New speech compression method
ONE DIMENSIONAL COSINE BASIS FUNCTIONONE DIMENSIONAL COSINE BASIS FUNCTION (( NN =8).=8).
25
A New speech compression method
3.2 DISCRETE WAVELET TRANSFORM
3.3 DISCRETE WAVELET PACKET TRANSFORM
3.4 DISCRETE COSINE PACKET TRANSFORM
26
A New speech compression method
Discrete wavelet transform (DWT) is any wavelet transform for which
the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it
has over Fourier transforms is temporal resolution: it captures both frequency and
location information.
In the following a statistical analysis of three discrete wavelet
transforms: the discrete wavelet transform, DWT, the discrete wavelet packets
transform DWPT and the DCPT is presented. This analysis is based on simulations.
The correspondent transform of a realization of a colored noise is computed. The result is
practically a white noise.
The DWT case
The whitening effect of the DWT
The result is presented in Fig.1 In the top of this figure is presented the
power spectral density of the input noise. This a colored noise, generated by filtering a
white noise. The filter used was a running average with 20 tapes. In the bottom of Fig. is
presented the power spectral density of the result obtained, computing the DWT of the
realization with the power spectral density represented in top. Because the envelope of
the power spectral density represented in bottom is practically constant we can assert that
27
A New speech compression method
the output signal is a white noise. So, the DWT asymptotically converges to the
Karhunen-Lo_eve transform.
The DWPT case
In Fig. is illustrated the DWPT whitening effect. The experiment already described
is repeated for the case of the DWPT. In top of Fig. is represented the power spectral
density of the input signal. This is a realization of a colored noise obtained by band-pass
filtering a white noise. In bottom is represented the power spectral density of the signal
obtained after the computation of the DWPT of the signal with the power spectral density
represented in top. Because the envelope of the power spectral density is constant almost
everywhere, like the power spectral density of a white noise, we can assert that the
DWPT also converges to the Karhunen-Lo_eve transform.
The whitening effect of the DWPT.
The DCPT case
The DCPT is a combination of the discrete cosine transform, DCT, with the theory of
wavelet packets. In the case of this transform the segmentation of the support of the
28
A New speech compression method
signal to be analyzed (considered of length N) is realized. The lengths of the segments
obtained depend on the number of iterations of the transform performed.
For the iteration index m these blocks have a length of .The signal
contained in each such segment is transformed using the DCT. The asymptotic analysis
of DCPT is based on the observation that for the DCT asymptotically converges
to the Karhunen-Lo_eve transform
If the number of samples of each DCT coefficients sequence,
corresponding to a given segment, at a given iteration, tends to infinity and the
coefficients of the corresponding DCT converge to the Karhunen-Lo_eve transform.
Hence, the DCPT asymptotically converges to the Karhunen-Lo_eve transform.
The DCPT whitening effect is presented in Fig. In top is represented the
power spectral density of a colored noise, obtained filtering white noise with the aid of a
running average and in bottom is represented the power spectral density of the signal
obtained after the computation of the DPCT of the signal with the power spectral density
represented in top. The envelope of the power spectral density of the signal represented in
bottom is a good approximation of a constant. Hence the signal with the power spectral
density represented in bottom is practically a white noise.
Hence the whitening effect of the DCPT was highlighted. Comparing the
figures 1,2, 3 it can be observed that the DCPT has the higher convergence rate to a white
noise. It is followed by the DWT. The DWPT has the slower convergence. Each
transform can be used in a compression system because all of them converge to the
Karhunen-Lo_eve transform. Taking into account the convergence speed the best is the
DCPT. This transform has also another advantage, the temporal localization of the
analysis filters. Also it is recommended for the compression of speech taking into account
the sinusoidal model of the speech. This proposition is proved in the following.
29
A New speech compression method
4. SELECTING THE BEST TRANSFORM
The quality of a compression system can be appreciated with the aid of his
rate distortion function. A compression system is better than another if, at equal
distortions, it realizes a higher compression rate. The maximization of the compression
rate can be done, if a good selection of the parameters is made.
One of the three orthogonal transforms, already mentioned, and its
parameters must be selected, taking into account the fact that the signal to be processed is
a speech signal. Each of the three orthogonal transforms already mentioned has some
parameters. A common parameter for all the three transforms is the mother wavelets.
This parameter must be selected in accordance with the type of the signal to be
compressed.
Each spoken word is a sequence of tons having different intensities,
frequencies and durations. Every ton is a sinusoidal signal with specific amplitude,
frequency and duration. This is the speech sinusoidal model. Its mathematical description
is,
Where the components are named partials .Each term of this sum is a
double modulated signal. So, the speech signal is not stationary. But, usually the speech
signal is regarded like a sequence of stationary signals. Segmenting the speech signal in
blocks with duration inferior to 25 ms, a sequence of stationary signals is obtained. On
every segment the speech model is of the form
30
A New speech compression method
This decomposition is very similar to a decomposition of the signal Xs(t) into a cosine
packet basis.
The decomposition of the same signal into an orthogonal wavelet basis is,
Where are the wavelets generated by the mother wavelets y(t)
. The compression rate obtained using specified mother wavelets is higher if the number
of not null coefficients,
is smaller. But
But the energy of the signal Xs(t) can be computed using the relation
Because the signal energy is a constant, not dependent on the mother
wavelets used, it can be said that the number is smaller if the magnitude of not null
coefficients is higher. But this magnitude is higher if the signals Xs(t) and are
more similar. Following (2) the higher similarity is obtained when the functions
are elements of a cosine packet. This is the reason why, for the compression of speech
signals, the best transform is the DCPT.
Another parameter of the DCPT is the cost functional used for the
selection of the best packet. This transform is an adaptive one. The result of its
31
A New speech compression method
utilization in a given application can be optimized using the best packet selection
procedure. This is a very efficient procedure able to enhance very much the quality of a
given signal processing method. There are some cost functions that can be minimized for
the selection of the best cosine packet. The most used is the entropy but its utilization do
not realizes the maximization of the compression rate.
The optimal cost functional for compression is that realizing the
minimization of the number of coefficients superior to a given threshold, t, (who
determines the distortion), Ns. Using this cost functional, Ns coefficients superior to the
threshold t are Obtained. This is a minimal number because it was obtained using the
appropriate cost functional for the selection of the best packet. This is the reason why this
cost functional realizes the maximization of the compression rate.
An optimal cost functional for compression is that realizing the
minimization of the number of coefficients superior to a given threshold, t, (who
determines the distortion), Ns. Using this cost functional, Ns coefficients superior to the
threshold t are Obtained. This is a minimal number because it was obtained using the
appropriate cost functional for the selection of the best packet. This is the reason why this
cost functional realizes the maximization of the compression rate.
Increasing the threshold value t, the number Ns decreases and the
compression rate increases. Unfortunately the reconstruction distortion increases when t
increases. This is the reason why the threshold's value t must be controlled to assure the
compression transparency. Hence, the block TD must be an adaptive one.
Another parameter of the DCPT which must be considered for the
optimization of the compression is its number of iterations.
32
A New speech compression method
5. SPEECH5. SPEECH COMPRESSIONCOMPRESSION ALGORITHMALGORITHM
33
A New speech compression method
5.1 PROPOSED ALGORITHM
The idea of speech compression is to compress speech signal to take up less
storage space and less bandwidth for transmission. To meet this goal different methods
for compression have been designed and developed
The proposed adaptive algorithm for speech compression
using Cosine Packet Transform is shown in Fig. The speech signal to be
compressed is converted in to packets with finite duration. The
Discrete Cosine Transform is applied to each packet and transformed
coefficients are computed. The coefficients are extracted and fed into
the adaptive threshold detector to nullify the inferior coefficient for
better compression.
The proposed adaptive algorithm for speech compression using Cosine
Packet Transform is shown in Fig 8.
The speech signal to be compressed is converted in to packets
with finite duration.
The Discrete Cosine Transform is applied to each packet and
transformed coefficients are computed.
The coefficients are extracted and fed into the adaptive
threshold detector to nullify the inferior coefficient for better
compression.
34
A New speech compression method
5.1.1 FLOW DIAGRAM FOR5.1.1 FLOW DIAGRAM FOR PROPOSED ADAPTIVEPROPOSED ADAPTIVE
ALGORITHMALGORITHM
35
A New speech compression method
36
A New speech compression method
Packet Decomposition
Computation ofDCT
Extracting the co-efficients (ci)
Adaptive ThresholdDetector
FLOW DIAGRAM FOR THE PROPOSED ADAPTIVE ALGORITHM
37
A New speech compression method
5.2 ADAPTIVE5.2 ADAPTIVE
THRESHOLDTHRESHOLD
DETECTORDETECTOR
ADAPTIVE THRESHOLD DETECTOR
38
A New speech compression method
One of the most important processes of the proposed compression
algorithm is the threshold detector. The main role of this process is to nullify all the
coefficients obtained from the Cosine Packet Transform smaller to a threshold value.
This is in fact the compression mechanism. This process is an adaptive system, which
automatically choose the threshold value depending upon the transform coefficient value
and repeat the process for a certain condition. The distortion due to compression has the
mean square error:
Because the D.P.C.T and its inverse the I.D.P.C.T are orthogonal
transforms the last relation becomes:
The value of the threshold, t, must be chosen such that the following
condition to be satisfied:
D < α.Ex (α < 1)
Where Ex represents the energy of the input signal, x[n].So for the
preservation of the distortion under the value α.Ex (where Ex represents the energy of the
input signal) is sufficient to select a threshold value:
The constant α can be related to the signal to noise ratio of the signals u[n]
and xˆ[n] (having the same energy), rsz0 . It can be written:
So, an inferior bound of the signal to noise ratio, depending on α was established:
β = −10 ⋅ log10 α
39
A New speech compression method
Taking the sign equal an inferior bound for the constant α can be obtained:
Using this value, an inferior bound for the threshold t can be obtained too:
So selecting for the threshold a value t, superior to tm , an output signal to
noise ratio superior to β will be obtained. Unfortunately the exact value of Ex will not be
known a priori. This is the reason why an adaptive algorithm for the selection of the
threshold value is recommended. This algorithm can use the value tm (obtained in the last
relation) for initialization.
FLOW DIAGRAM FOR ADAPTIVE THRESHOLD DETECTOR
The flow diagram of adaptive threshold detector is shown in
Fig 9.
The energy of the input signal to be compressed is
computed and the value of b is initialized.
The threshold value is calculated using equation. The
threshold value is increased starting from this value.
At every iteration the value Ex is computed. If this value
is higher than b then the extracted coefficient is
compared with threshold value t. If it is less then the
threshold value then the corresponding coefficients is
replaced with zero value otherwise the coefficients value
is maintained the same.
The proposed adaptive algorithm is stopped when for the
first time the value Ex becomes smaller than b.
40
A New speech compression method
5.2.1 FLOW DIAGRAM FOR5.2.1 FLOW DIAGRAM FOR
ADAPTIVE THRESHOLDADAPTIVE THRESHOLD
DETECTORDETECTOR
41
A New speech compression method
42
A New speech compression method
6. RESULTS6. RESULTS
AND DISCUSSIONAND DISCUSSION
43
A New speech compression method
The various speech signal sample is simulated using MATLAB. The generated
speech signal sample is shown in Fig 3.
The generated speech signal is segmented in to 15 packets with 512
samples (the duration of each block being inferior to 25 ms) per packet.
The Discrete Cosine Transform is computed for each packets using
equation.
The transformed coefficients are extracted for further processing.
The energy of the input signal is computed and the threshold value is
calculated using equation.
The value of input energy is compared with b.
If it is higher, then each and every transformed coefficient value is
compared with threshold value. The inferior coefficients are nullified. The
new energy of the signal is calculated and compared with b.
If energy is lower than b the above process is repeated for new threshold
value otherwise the compression process is stopped.
In order to show that the proposed adaptive algorithm gives
better performance for speech signals, compression is performed for
20 different speech signals, using Discrete Wavelet Transform,
Wavelet Packet Transform and the proposed adaptive
algorithm. The compression ratios achieved through these methods
are tabulated for various speech signal sample.
44
A New speech compression method
SPEECH SIGNAL SAMPLE
45
A New speech compression method
Analyzing the Table, the good performance of the proposed
adaptive algorithm can be observed. The smallest compression rate,
11.1433, was obtained on the 3rd sample and the better compression
rate, 68.0507, was obtained on the 20th sample. The proposed
algorithm gives the better compression ratio for most he the speech
samples
46
A New speech compression method
TABLE. 1COMPARISION OF COMPRESSION RATIO
FOR 20 DIFFERENT SPEECH SAMPLES
The comparison of compression ratio for speech signal
sample from 1 to 10 and from 11 to 20 is plotted as shown in Fig A and
B respectively for easy understanding.
A. GRAPHICAL REPRESENTATION OF COMPRESSION RATIOS FOR
SPEECH SAMPLES (1-10)
B.GRAPHICAL REPRESENTATION OF COMPRESSION RATIOS FOR
SPEECH SAMPLES (11-20)
47
A New speech compression method
The analysis from the above figures show that out of 20
signal sample only 2 samples have a less compression ratio as
compared with WPT method and high as compared with other two
methods. The mean compression ratio for all the methods are
computed and tabulated in Table 2
TABLE 2 MEAN COMPRESSION RATIOS FOR 20 DIFFERENT SPEECH
SAMPLES
The analysis from Table 2 shows that the mean
compression ratio for 20 samples was achieved using the proposed
adaptive algorithm is 28.275. This is a sufficiently high value, taking
48
A New speech compression method
into account the fact that any lossless compression method was not
used.
49
A New speech compression method
7. APPLICATIONS OF
SPEECH
COMPRESSION
APPLICATIONS
50
A New speech compression method
Speech compression has been and still is a major issue in
the field of digital speech processing. Speech compression I an act of
transforming the speech signal to a more compact form such that it
takes less storage space and less bandwidth for transmission.
Today, applications of speech compression involve real time
processing in mobile satellite communications, cellular
telephony, internet telephony, audio for videophones or video
teleconferencing systems, among others.
Other applications include also storage and synthesis systems
used, for example, in voice mail systems, voice memo
wristwatches, voice logging recorders and interactive PC
software
The speech compression is used in digital telephony, in
multimedia and in the security of digital communications.
Speech compression is required in
Long-distance communications
High-quality speech storage
Message encryption
For example, in digital cellular technology many users need to
share the same frequency bandwidth, utilizing the speech
compression makes it possible for more users to share the
available system.
Another example where speech compression is required is in
digital voice storage. For a fixed amount of memory compression
makes it possible to store longer messages.
51
A New speech compression method
8.CONCLUSION
52
A New speech compression method
CONCLUSION
A new compression method based on adaptive threshold detector is
proposed and tested. Prediction its construction is based on the rate-distortion function
concept. Taking into account the high redundancy of the speech signal, such a method is
very appropriate. The proposed method assures a good quality reconstruction (on each
segment an output signal to noise ratio superior to 16 dB and the mean output signal to
noise ratio superior to 19 dB). It can be said that this compression method is transparent.
The simulated results show that the proposed algorithm gives the better
compression ratio as compared with other methods. Using this method, a mean
compression rate of 28.275, was obtained in the simulation report. This value is superior
to mean compression rate, of other methods.
The proposed compression method is also fast, requiring a relative small
number of multiplications. For example the computation of the DCPT requires the same
multiplications number like the Fast Fourier Transform of the same signal. This is the
reason why the proposed compression method can be implemented on a Digital Signal
Processor. The system proposed represents a good alternative to the speech compression
systems based on the use of the linear prediction.
53
A New speech compression method
9. BIBILOGRAPHY
54
A New speech compression method
REFERENCES
[1]. R. W. Yeung, “A First Course in Information Theory,” New York:
Kluwer Academic/Plenum Publishers, 2002.
[2]. A.Gersho, “Advances in Speech and Video Compressions,”
Proceedings of the IEEE, vol. 82, pp. 900-918, June 1994.
[3]. J.L.Flanagaran, M.R.Schroeder, B.S.Atal, R.E.Crocherie,
N.S.Jayant and J.M.Tribolet, “Speech Coding,” IEEE Transactions
on Communications, vol. 27, pp.710-737, April 1979.
[4]. P.Noll, “Wideband Speech and Audio Coding,” IEEE
Communications Magazine, pp. 34-44, Nov. 1993.
[5]. K. Sayood and J. C. Borkenhagen, “Use of residual redundancy in
the design of joint source/channel coders,” IEEE Transactions on
Communications, 39(6):838-846, June 1991.
[6]. Edler, B., “Coding of Audio Signals with Overlapping Block
Transform and Adaptive Window Functions,” (in German),
Frequenz, vol.43, pp.252-256, 1989.
[7]. Q. Memon, T. Kasparis, “Transform Coding of Signals Using
Approximate Trigonometric Expansions”. Journal of Electronic
Imaging, Vol. 6, No. 4, October 1997, pp. 494-503.
[8]. C. E. Shannon, .A mathematical theory of communications,. Bell
System Technical Journal, vol. 27, pp. 379.423, 623.656, 1948.
[9]. A. N. Kolmogorov, .On the Shannon theory of information
transmission in the case of continuous signals,. Trans. IRE, vol. IT-
2, pp. 102.108, 1956.
[10]. D. L. Donoho, M. Vetterli, R. A. Devore, and I. Daubechies, .Data
compression and harmonic analysis,. IEEE Trans. Inf. Theory, vol.
44, no. 6, pp. 2435.2476, 1998.
[11]. N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine
transform,” IEEE Transactions on Computers, vol. C-32, pp. 90-93,
Jan. 1974.
55