adaptive speech compression using cosine

A New speech compression method

ADAPTIVE ALGORITHM FOR SPEECHADAPTIVE ALGORITHM FOR SPEECH COMPRESSION USING COSINE PACKETCOMPRESSION USING COSINE PACKET

TRANSFORMTRANSFORM

`

BYBYK.RAJITHA REDDYK.RAJITHA REDDY

i


Department of Electronics & Communications EngineeringDepartment of Electronics & Communications EngineeringSREENIDHI INSTITUTE OF SCIENCE & TECHNOLOGYSREENIDHI INSTITUTE OF SCIENCE & TECHNOLOGY

Yamnampet, Ghatkesar, Hyderabad-501 301Yamnampet, Ghatkesar, Hyderabad-501 301

ii


ABSTRACT

iii


ABSTRACT

This paper presents a new adaptive algorithm for speech

Compression (an act of transforming the speech signal to a more

compact form (less storage space and less bandwidth). To meet this

goal different methods for compression have been designed and developed.

The traditional speech compression method is based on linear prediction.

The compression method, proposed in this paper, is based on the use of an orthogonal

transform, the discrete cosine packets transform.

The proposed algorithm uses packet decomposition,

which reduces a computational complexity of a system. One of the most

important processes of the proposed compression algorithm is the threshold detector. The

main role of this process is to nullify all the coefficients obtained from the Cosine Packet

Transform smaller to a threshold value. This is in fact the compression mechanism.

This paper compares the compression ratio of methods

using Wavelet Transform, Cosine Transform, Wavelet Packet

Transform and proposed adaptive algorithm using Cosine

Packet Transform for different speech signal samples. The mean

compression ratio is calculated for all the methods and compared. The

implemented results show that the proposed compression algorithm

gives the better performance for speech signals.

iv


ORGANIZATION OF THE REPORT

1- OVERVIEW

2- SPEECH PRODUCTION MECHANISM

3- ORTHOGONAL TRANSFORMS

4-SELECTION OF BEST PACKETS

5-SPEECH COMPRESSION ALGORITHM

6- RESULTS AND DISCUSSION

7- APPLICATIONS OF SPEECH COMPRESSION

8- CONCLUSION

9- REFERENCES

v


TABLE OF CONTENTS

ABSTRACT (iv)

LIST OF FIGURES (viii)

LIST OF TABLES (ix)

1. OVERVIEW

1.1 INTRODUCTION……………………………………………………...2

1.1.1 BRIEF HISTORY OF

COMPRESSION………………………...2

1.1.2 SPEECH COMPRESSION………………………………………3

1.2 AIM OF THE REPORT………………………………………………..4

1.3 METHODOLOGY……………………………………………………..5

1.4 SIGNIFICANCE……………………………………………………….5

1.5 ORGANISATION OF THE REPORT………………………………...6

2. SPEECH PRODUCTION MECHANISM

2.1 SOUNDS, VOICE AND SPEECH……………………………………8

2.2 CHARACTERISTICS/ FEATURES OF SPEECH…………………...8

2.3 PROCESS OF SPEECH PRODUCTION IN HUMAN BEINGS…….9

2.4 SOURCE FILTER THEORY………………………………………...13

2.5 TIME AND FREQUENCY DOMAIN REPRESENTATION……….17

3. ORTHOGONAL TRANSFORMS

3.1 DISCRETE COSINE TRANSFORM………………………………...22

3.2 DISCRETE WAVELET TRANSFORM……………………………..25

3.3 DISCRETE WAVELET PACKET TRANSFORM………………….25

3.4 DISCRETE COSINE PACKET TRANSFORM……………………..25

vi


4. SELECTING THE BEST TRANSFORM 29

5. SPEECH COMPRESSION ALGORITHM

5.1 PROPOSED

ALGORITHM………………………………………….33

5.1.1 FLOW

DIAGRAM…………………………………………….34

5.2 ADAPTIVE THRESHOLD DETECTOR……………………………36

5.2.1 FLOW DIAGRAM FOR THRESHOLD DETECTOR………..39

6. RESULTS AND DISCUSSION

41

7. APPLICATIONS OF SPEECH COMPRESSION

47

8. CONCLUSION

49

9. BIBILOGRAPHY

51

vii


viii


LIST OF FIGURES

FIGURE 1: HUMAN VOCAL ORGANS………………………………...11

FIGURE 2: SCHEMATIC REPRESENTATION OF THE COMPLETE

SPEECH PRODUCTION MECHANISM…………………....12

FIGURE 3: SOURCE FILTER THEORY OF SPEECH…………………..13

FIGURE 4: SOURCE-FILTER MODEL OF SPEECH…………..............15

FIGURE 5: SOURCE FILTER THEORY REPRESENTATION

OF SPEECH PRODUCTION………………………………...16

FIGURE 6: THE TIME AND FREQUENCY DOMAIN

REPRESENTATION OF VOWELS (/a/, /i/, & /u/)………....18

FIGURE 7: THE TIME DOMAIN AND FREQUENCY DOMAIN

REPRESENTATION OF A WORD…………………………20

FIGURE 8: ONE DIMENSIONAL COSINE BASIS FUNCTION (N=8)...24

FIGURE 9: THE WHITENING EFFECT OF THE DWT………………...26

FIGURE 10: THE WHITENING EFFECT OF THE DWPT……………...27

FIGURE 11: SPEECH SIGNAL SAMPLE………………………………..43

viii


LIST OF TABLES

1. COMPARISION OF COMPRESSION RATIO FOR 20 DIFFERENT

SPEECH SAMPLES……………………………………………………….44

2. BAR CHART REPRESENTATION OF COMPRESSION RATIOS

FOR SPEECH SAMPLES (1-10)

………………………………………………..45

3. BAR CHART REPRESENTATION OF COMPRESSION RATIOS

FOR SPEECH SAMPLES (11-20)

………………………………………………45

4. MEAN COMPRESSION RATIOS FOR 20 DIFFERENT SPEECH

SAMPLES………………………………………………………………….46

ix


1. OVERVIEW

1


1.1 INTRODUCTION

1.1.1 BRIEF HISTORY OF COMPRESSION

The primary purpose of early computers is computing. With the

advancements in technology, computers were developed to provide a wide variety of

applications. One of these is the representation and storage of non-numeric information

into binary code, e.g. audio signals and images. Such applications require longer data

representations for quality. These data requires larger storage space and longer access

time. Another set back would be in terms of data transmission. Transmission time would

be longer since the data that would be sent is larger. With these disadvantages comes the

need for the compression of data.

Data compression refers to the reduction of the number of bits needed to

represent the data used for storage and transmission. The ideal goal of data compression

is to contain the original information in as few bits as possible. Data compression can be

classified as:

Lossless compression

Lossy compression

Lossless compression refers to the technique that assures an exact

representation of data after going through the compression-decompression process. This

type of compression is well appropriate for applications where the loss of even minimal

information has a tremendous negative effect. Such effects can be seen in storage of text

files in databases where minimal loss would have tremendous consequences.

On the other hand, Lossy compression, as the name implies, represents the

compression of data with a certain degree of loss in information content. This technique

is more suited for applications where overall information content is not dramatically

diminished with minimal loss in information. Examples of such applications are in image

and speech signals. In these, applications, the loss of information is evidenced by losses

in signal quality, e.g. loss of sharpness in images and loss of intelligibility in speech.

2


1.1.2 SPEECH COMPRESSION

One of the most important speech processing methods is the Compression.

This is because the speech signal is very redundant. With rapid deployment of speech

compression technologies, more and more speech content is stored and transmitted in

compressed formats. Speech signals has unique properties that differ from a general

audio/music signals. First, speech is a signal that is more structured and band-limited

around 4 kHz. These two facts can be exploited through different models and approaches

and at the end, make it easier to compress.

Today, applications of speech compression involve real time processing in

mobile satellite communications, cellular telephony, internet telephony, audio for

videophones or video teleconferencing systems, among others.

The idea of speech compression is to compress speech signal to take up

less storage space and less bandwidth for transmission. To meet this goal different

methods for compression have been designed and developed. The speech compression is

used in digital telephony, in multimedia and in the security of digital communications.

The speech compression methods are lossy. There are two categories of speech

compression methods:

Methods based on LINEARLINEAR PREDICTIONPREDICTION

Methods based on ORTHOGONALORTHOGONAL TRANSFORMSTRANSFORMS .

Before the introduction of Packet based transform techniques, audio

coding techniques used DFT (Discrete Fourier transform) and DCT (Discrete Cosine

transform) with window functions such as Rectangular and sine-taper functions.

However, these early coding techniques have failed to fulfill the contradictory

requirements imposed by high-quality audio coding. For example, with a rectangular

window the analysis/synthesis system is critically sampled, i.e., the overall number of the

transformed domain samples is equal to the number of time domain samples, but the

system suffers from poor frequency resolution and block effects, which are introduced

3


after quantization or other manipulation in the frequency domain. Overlapped windows

allow for better frequency response functions but carry the penalty of additional values in

the frequency domain, thus not critically sampled. Discrete Cosine Packet Transform is

currently the best solution, which has satisfactorily solved the paradox.

The quality of compression system can be appreciated with

the aid of its rate distortion function. A compression system is better

than another if, at equal distortions; it realizes a higher compression

rate. The maximization of compression rate can be done, if a good

selection of orthogonal transform is made.

The quality of compression system can be appreciated with

the aid of its rate distortion function. A compression system is better

than another if, at equal distortions; it realizes a higher compression

rate. The maximization of compression rate can be done, if a good

selection of orthogonal transform is made.

1.2 AIM OF THE REPORT

To propose a new adaptive algorithm for speech Compression

using

Cosine packets transform

To show that, the proposed compression algorithm gives the

better

Performance for speech signals (by comparing the

compression ratio of

Methods using Wavelet Transform, Cosine Transform,

4


Wavelet Packet Transform and proposed adaptive

algorithm using

Cosine Packet Transform for different speech signal

samples.)

1.3 METHODOLOGY

The proposed speech compression algorithm makes use of orthogonal

transform, which is used to de-correlate the speech signal. There are several orthogonal

transforms. The orthogonal transform which is being used in this paper is DISCRETE

COSINE PACKET TRANSFORM

The proposed algorithm uses packet decomposition,

which reduces a computational complexity of a system. One of the most

important processes of the proposed compression algorithm is the threshold detector. The

main role of this process is to nullify all the coefficients obtained from the Cosine Packet

Transform smaller to a threshold value. This is in fact the compression mechanism.

In order to show that the proposed compression

algorithm gives the better performance for speech signals, this paper

compares the compression ratio of methods using Wavelet

Transform, Cosine Transform, Wavelet Packet Transform and

proposed adaptive algorithm using Cosine Packet Transform for

different speech signals.

1.4 SIGNIFICANCE

5


The compression of speech signals has many practical applications. One

example is in digital cellular technology where many users share the same frequency

bandwidth. Compression allows more users to share the system than otherwise possible.

Another example is in digital voice storage (e.g. answering machines). For a given

memory size, compression allows longer messages to be stored than otherwise.

Historically, digital speech signals are sampled at a rate of 8000

samples/sec. Typically, each sample is represented by 8 bits (using U-law). This

corresponds to an uncompressed rate of 64 kbps (Kbits/sec). With current compression

techniques (all of which are lossy), it is possible to reduce the rate to 8 kbps with almost

no perceptible loss in quality.

Although we have many other speech compression techniques, the proposed

algorithm makes use of an orthogonal transform (Discrete cosine packet transform) as it

achieves higher compression rate than any other.

Hence, an adaptive algorithm for speech compression using cosine packet

transform achieves higher performance (high compression rate) for speech signals.

1.5 ORGANIZATION OF THE REPORT

This paper is organized as follows. The mechanism of speech production

followed by the mathematical model for speech signal and the description about Discrete

Cosine Transform and three other orthogonal transforms discrete wavelet transform,

discrete wavelet packet transform, discrete cosine packet transform is presented in

chapters 1,2,3,4.

With necessary mathematical modeling, the proposed adaptive algorithm

for speech compression is explained in chapter 5. In chapter 6 the developed algorithm is

6


tested for various speech signal samples and comparison is made with Wavelet

Transform, Cosine Transform and Wavelet Packet Transform. Finally, chapter 7&8

concludes the paper with some discussions and applications of speech compression.

7


2. SPEECH PRODUCTION

MECHANISM

8


2.1 SOUNDS, VOICE AND SPEECH

Sound is generally known as vibrational transmission of mechanical

energy that propagates through matter as a wave that can be audibly perceived by a living

organism through its sense of hearing. For humans, hearing is limited to frequencies

between about 20 Hz and 2000 Hz. Voice and speech are different facets of it.

Voice (or vocalization) is the sound by humans and other vertebrates using

the lungs and the vocal folds in the larynx or voice box. Voice is not always produced as

speech, however infants babble and coo; animals bark, moo, whinny and growl; and adult

humans laugh, sing and cry.

Voice is generated by the airflow coming out from lungs as the vocal folds

are brought close together. When the air is pushed past the vocal folds with sufficient

pressure, the vocal folds vibrate, giving out voice. Voice is as unique as a fingerprint. It

helps to define the personality, mood and health.

Humans express thoughts, feelings, and ideas orally to one another

through a series of complex movements that alter and mold the basic tone created by

voice into specific, decodable sounds called speech. Speech is produced by precisely

coordinated muscle actions in the head, neck, chest, and abdomen. Speech development

is gradual process that requires years of practice. During this process, a child learns how

to regulate these muscles to produce understandable speech. Speech bandwidth is 100 Hz

to 8000 Hz.

2.2 CHARACTERISTICS/ FEATURES OF SPEECH

Sound being a wave, periodicity is important for it to carry information. In

air, the disturbance propagates as the successive compression and decompression (also

called rarefaction) of small regions in the medium. If we generate a pure note and place a

detector (our ear, for example) at a point in the surrounding medium, a distance from the

9


source, the number of compression-decompression sequences arriving at the detector

during a unit time interval is called the frequency.

The time interval between successive maximal compressions is called the

period. The product of the frequency and the wavelength is the velocity.

Pitch is the perceived fundamental frequency of a sound. While the actual

fundamental frequency can be precisely determined through physical measurement, it

may differ from the perceived pitch because of overtones, or partials, in the sound and

sometimes, the human auditory perception system may also have trouble distinguishing

frequency differences between various types of notes under certain circumstances. It is

the auditory attribute of sound according to which sounds can be ordered on a scale from

low to high. Thus pitch can be called as intensity of a sound based on the frequency of

the speech.

When the measure of highness or lowness of speech is referred, then

loudness also comes into the picture. Loudness is the perceived volume or amplitude of

the speech signal. Intensity, also which measures on its same parameter is proportional to

square of loudness. And finally, quality of speech, which refers to the character or

distinctive attributes of a sound.

2.3 PROCESS OF SPEECH PRODUCTION IN HUMAN

BEINGS

BIOLOGICAL MECHANISM

Vocal organs present in the figure1 produce speeches. The main energy

source is the lungs with the diaphragm. When speaking, the airflow is forced through the

glottis between the vocal cords and the larynx to the three main cavities of the vocal tract,

the pharynx and the oral, and the nasal cavities. From the oral and nasal cavities, the

airflow exits through the nose and mouth, respectively.

10


The V-shaped opening between the vocal chords, called the glottis, is the

most important sound source in the vocal system. The vocal cords may act in several

different ways during speech. The most important function is to modulate the air-flow by

rapidly opening or closing, causing buzzing sound from which vowels and voiced

consonants are produced. The fundamental frequency of vibration depends upon mass

and tension and is about 110 Hz, 200 Hz and 300 Hz with men, women and children

respectively. The position of vocal cords is different, while rendering different (voiced

and unvoiced) consonants.

The fundamental frequency of vibration depends upon mass and tension

and is about 110 Hz, 200 Hz and 300 Hz with men, women and children respectively.

The position of vocal cords is different, while rendering different (voiced and unvoiced)

consonants.

The pharynx connects the larynx to oral cavity. It has almost fixed

dimensions, but raising or lowering the larynx at the one end and soft plate at the other

end may slightly change its length. The soft palate also isolates or connects the route

from the nasal cavity to the pharynx. At the bottom of pharynx are the epiglottis and false

vocal cords to prevent food reaching the larynx and to isolate the esophagus acoustically

from the vocal tract. The epiglottis, the false vocal cords are closed during swallowing

and open during normal breathing.

11


FIGURE 1: HUMAN VOCAL ORGANS

12


FIGURE 2: SCHEMATIC REPRESENTATION OF

THE COMPLETE SPEECH PRODUCTION

MECHANISM

13


2.4 SOURCE FILTER THEORY

The process of production of speech in human beings that has been

explained in the above section in biological terms can be explained in technical terms

using source filter theory.

FIGURE 3: SOURCE FILTER THEORY OF SPEECH

The source-filter theory describes speech production as a two-stage

process involving the generation of a sound source, with its own spectral shape and

spectral fine structure, which is then shaped or filtered by the resonant properties of the

vocal tract.

Most of the filtering of a source spectrum is carried out by that part of the

vocal tract interior to the sound source. In the case of a glottal source, the filter is the

entire supra-glottal vocal tract. The vocal tract filter always includes some part of the oral

cavity and can also, optionally, include the nasal cavity (depending upon whether the

velum is open or closed).

Sound sources can be either periodic or aperiodic. Glottal sound sources

can be periodic (voiced), aperiodic (whisper and /h/) or mixed (e.g. breathy voice).

Supra-glottal sound sources that are used contrastively in speech are aperiodic (i.e.

random noise) although some trill sounds can resemble periodic sources to some extent.

Sound SourceVocal Tract Filter Speech

Sounds

Glottal Source

Supra-glottal Source

14


A voiced glottal source has its own spectrum, which includes spectral fine

structure (harmonics and some noise) and a characteristic spectral slope (sloping

downwards at approximately -12dB/octave).

An aperiodic source (glottal or supra-glottal) has its own spectrum, which

includes spectral fine structure (random spectral components) and a characteristic spectral

slope.

Periodic and aperiodic sources can be generated simultaneously to

produce mixed voiced and aperiodic speech typical of sounds such as voiced fricatives.

In voiced speech the fundamental frequency (perceived as vocal pitch) is a

characteristic of the glottal source acoustics whilst features such as vowel formants are

characteristics of the vocal tract filter (resonances).

The cavities of the supra-glottal vocal tract resonate when excited by

sound energy. The exact resonance characteristics of the vocal tract are a function of

many factors, but the two of the primary importance are the length of the vocal tract and

the shape of the vocal tract, that is, the cross-sectional area profile from the glottis to lips.

Resonance in the vocal tract reinforces the sound energy of the source function at

particular frequencies. These frequencies are variously called resonant frequencies, poles,

or formant frequencies.

In the figure 5, we see the effects of passing a model source function with

its dense harmonic structure through a model vocal tract producing a vowel. The bottom

panel in the figure 6 shows the spectrum of the source function. The middle panel in this

figure shows the response characteristics of the vocal tract. This is relatively smooth and

continues function of frequency with peaks at frequencies corresponding to the resonant

frequencies of the vocal tract. The top panel in the figure shows the acoustic output from

the model. It retains both the fine harmonic structure of the source and the structure

introduced by the resonance characteristics of the vocal tract.

15


FIGURE 4: SOURCE-FILTER MODEL OF SPEECH

16


FIGURE 5 : SOURCE FILTER THEORY

REPRESENTATION OF SPEECH

PRODUCTION

17


2.5 TIME AND FRQUENCY DOMAIN

REPRESENTATION OF SPEECH SIGNAL

Speech is basically an analog (Continuous) signal. It is a set of

complicated audio signals production of them artificially is rather difficult. Speech signal

is usually considered as voice or unvoiced, but in some cases they are something between

these two.

Voiced speech is characterized as a quasi-periodic deterministic waveform

for which general spectral properties are also known. Unvoiced speech is identified, when

the standardized energy Es of the signal is below a threshold, or else referred as

ambiguous if below the threshold.

The classification procedures of speech signals into voiced and unvoiced

are determined by using a harmonistic measure acquired after evaluating the

instantaneous frequency amplitude spectrum.

Voiced sounds consist of fundamental frequency (F0) and its harmonic

components produced by the vocal cords. The vocal tract modifies this excitation signal

causing formant (pole) and sometimes anti-formant (zero) frequencies. Each formant

frequency has also amplitude and bandwidth and it may be sometimes difficult to define

some of these parameters correctly. The fundamental frequency and the formant

frequencies are probably important concepts in speech processing. Each formant

frequency has also amplitude and bandwidth and it may be sometimes difficult to define

some of these parameters correctly. The fundamental frequency and the formant

frequencies are probably important concepts in speech processing.

18


Speech signals of the three vowels (/a/, /i/, and /u/) are presented in time

and frequency domain in the figure.

FIGURE 6: TIME AND FREQUENCY DOMAIN

REPRESENTATION OF VOWELS (/a/, /i/, & /u/).

19


The fundamental frequency is about 100 Hz in all the cases and the

formant frequencies are F1, F2 and F3 with vowel /a/ re approximately 600 Hz, 1000 Hz

and 2500 Hz respectively. With the vowel /i/ the first three formants are 200 Hz, 2300 Hz

and 3000 Hz and with /u/ 300 Hz, 600 Hz and 2300 Hz, respectively. The harmonic

structure of the excitation is also to perceive from frequency domain representation.

It can be seen that the first three formants are inside the normal telephone

channel (from 300 Hz to 3400 Hz) so the needed bandwidth is maximum up to 10 KHz,

which leads to 20 KHz sampling frequency. Unless, the fundamental frequency is outside

the telephone channel, the human hearing system is capable to reconstruct it from its

harmonics.

Another commonly used method to describe a speech signal is the

spectrogram, which is frequency-amplitude representation. The spectrogram and the time

domain representation of a speech signal are presented in the figure 7.

In spectrograph, higher amplitudes are represented with darker levels, so

the formant frequencies and trajectories are easy to perceive. Thus, it also helps in

comprehending the spectral differences between vowels and consonants. Hence

spectrogram is the most powerful tool for the representation of the speech signals and for

speech research.

20


FIGURE 7: THE TIME DOMAIN AND FREQUENCY

DOMAIN REPRESENTATION OF A WORD.

21


3. ORTHOGONAL

TRANSFORMS

[3.1 DISCRETE COSINE

TRANSFORM]

22


ORTHOGONAL TRANSFORMS

The role of the orthogonal transform is to de-correlate the speech signal. There

are a lot of orthogonal transforms. The transform that realizes the better de-correlation is

the Karhunen-Loeve transform; unfortunately there is not any fast algorithm for the

computation of this transform. This is the reason why in practice other orthogonal

transforms are used. For example the Fast Fourier Transform or the Discrete Cosine

Transform can be used. Following are the four orthogonal transforms that are used in this

paper:

DISCRETE COSINE TRANSFORM (DCT)

DISCRETE WAVELET TRANSFORM (DWT)

DISCRETE WAVELET PACKETS TRANSFORM (DPWT)

DISCRETE COSINE PACKETS TRANSFORM (DPCT)

3.1 DISCRETE COSINE TRANSFORM (DCT): The Discrete Cosine Transform (DCT) attempts to decorrelate the speech data.

After decorrelation each transform coefficient can be encoded independently without

losing compression efficiency.

One-Dimensional DCT

The most common DCT definition of a 1-D sequence of length N is

for u = 0,1,2,…,N −1. Similarly, the inverse transformation is defined as

for x = 0,1,2,…,N −1. In both equations (1) and (2) is defined as

23


It is clear from (1) that for

Thus, the first transform coefficient is the average value of the sample sequence. In

literature, this value is referred to as the DC Coefficient. All other transform coefficients

are called the AC Coefficient.

To fix ideas, ignore the f(x) and α(u) component in (1).

The plot of

For N = 8 and varying values of u is shown in Figure. In accordance with our previous

observation, the first the top-left waveform (u = 0 ) renders a constant (DC) value,

whereas, all other waveforms (u = 1,2,…,7 ) give waveforms at progressively

increasing frequencies. These waveforms are called the cosine basis function. Note that

these basis functions are orthogonal. Hence, multiplication of any waveform in Figure

with another waveform followed by a summation over all sample points yields a zero

(scalar) value, whereas multiplication of any waveform in Figure 3 with itself followed

by a summation yields a constant (scalar) value.

If the input sequence has more than N sample points then it can be divided

into sub-sequences of length N and DCT can be applied to these chunks independently.

Here, a very important point to note is that in each such computation the values of the

basis function points will not change. Only the values of f (x) will change in each sub-

sequence. This is a very important property, since it shows that the basis functions can be

pre-computed offline and then multiplied with the sub-sequences. This reduces the

number of mathematical operations (.multiplications and additions) thereby rendering

computation efficiency.

24


ONE DIMENSIONAL COSINE BASIS FUNCTIONONE DIMENSIONAL COSINE BASIS FUNCTION (( NN =8).=8).

25


3.2 DISCRETE WAVELET TRANSFORM

3.3 DISCRETE WAVELET PACKET TRANSFORM

3.4 DISCRETE COSINE PACKET TRANSFORM

26


Discrete wavelet transform (DWT) is any wavelet transform for which

the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it

has over Fourier transforms is temporal resolution: it captures both frequency and

location information.

In the following a statistical analysis of three discrete wavelet

transforms: the discrete wavelet transform, DWT, the discrete wavelet packets

transform DWPT and the DCPT is presented. This analysis is based on simulations.

The correspondent transform of a realization of a colored noise is computed. The result is

practically a white noise.

The DWT case

The whitening effect of the DWT

The result is presented in Fig.1 In the top of this figure is presented the

power spectral density of the input noise. This a colored noise, generated by filtering a

white noise. The filter used was a running average with 20 tapes. In the bottom of Fig. is

presented the power spectral density of the result obtained, computing the DWT of the

realization with the power spectral density represented in top. Because the envelope of

the power spectral density represented in bottom is practically constant we can assert that

27

http://en.wikipedia.org/wiki/Fourier_transform

http://en.wikipedia.org/wiki/Wavelet

http://en.wikipedia.org/wiki/Wavelet_transform


the output signal is a white noise. So, the DWT asymptotically converges to the

Karhunen-Lo_eve transform.

The DWPT case

In Fig. is illustrated the DWPT whitening effect. The experiment already described

is repeated for the case of the DWPT. In top of Fig. is represented the power spectral

density of the input signal. This is a realization of a colored noise obtained by band-pass

filtering a white noise. In bottom is represented the power spectral density of the signal

obtained after the computation of the DWPT of the signal with the power spectral density

represented in top. Because the envelope of the power spectral density is constant almost

everywhere, like the power spectral density of a white noise, we can assert that the

DWPT also converges to the Karhunen-Lo_eve transform.

The whitening effect of the DWPT.

The DCPT case

The DCPT is a combination of the discrete cosine transform, DCT, with the theory of

wavelet packets. In the case of this transform the segmentation of the support of the

28


signal to be analyzed (considered of length N) is realized. The lengths of the segments

obtained depend on the number of iterations of the transform performed.

For the iteration index m these blocks have a length of .The signal

contained in each such segment is transformed using the DCT. The asymptotic analysis

of DCPT is based on the observation that for the DCT asymptotically converges

to the Karhunen-Lo_eve transform

If the number of samples of each DCT coefficients sequence,

corresponding to a given segment, at a given iteration, tends to infinity and the

coefficients of the corresponding DCT converge to the Karhunen-Lo_eve transform.

Hence, the DCPT asymptotically converges to the Karhunen-Lo_eve transform.

The DCPT whitening effect is presented in Fig. In top is represented the

power spectral density of a colored noise, obtained filtering white noise with the aid of a

running average and in bottom is represented the power spectral density of the signal

obtained after the computation of the DPCT of the signal with the power spectral density

represented in top. The envelope of the power spectral density of the signal represented in

bottom is a good approximation of a constant. Hence the signal with the power spectral

density represented in bottom is practically a white noise.

Hence the whitening effect of the DCPT was highlighted. Comparing the

figures 1,2, 3 it can be observed that the DCPT has the higher convergence rate to a white

noise. It is followed by the DWT. The DWPT has the slower convergence. Each

transform can be used in a compression system because all of them converge to the

Karhunen-Lo_eve transform. Taking into account the convergence speed the best is the

DCPT. This transform has also another advantage, the temporal localization of the

analysis filters. Also it is recommended for the compression of speech taking into account

the sinusoidal model of the speech. This proposition is proved in the following.

29


4. SELECTING THE BEST TRANSFORM

The quality of a compression system can be appreciated with the aid of his

rate distortion function. A compression system is better than another if, at equal

distortions, it realizes a higher compression rate. The maximization of the compression

rate can be done, if a good selection of the parameters is made.

One of the three orthogonal transforms, already mentioned, and its

parameters must be selected, taking into account the fact that the signal to be processed is

a speech signal. Each of the three orthogonal transforms already mentioned has some

parameters. A common parameter for all the three transforms is the mother wavelets.

This parameter must be selected in accordance with the type of the signal to be

compressed.

Each spoken word is a sequence of tons having different intensities,

frequencies and durations. Every ton is a sinusoidal signal with specific amplitude,

frequency and duration. This is the speech sinusoidal model. Its mathematical description

is,

Where the components are named partials .Each term of this sum is a

double modulated signal. So, the speech signal is not stationary. But, usually the speech

signal is regarded like a sequence of stationary signals. Segmenting the speech signal in

blocks with duration inferior to 25 ms, a sequence of stationary signals is obtained. On

every segment the speech model is of the form

30


This decomposition is very similar to a decomposition of the signal Xs(t) into a cosine

packet basis.

The decomposition of the same signal into an orthogonal wavelet basis is,

Where are the wavelets generated by the mother wavelets y(t)

. The compression rate obtained using specified mother wavelets is higher if the number

of not null coefficients,

is smaller. But

But the energy of the signal Xs(t) can be computed using the relation

Because the signal energy is a constant, not dependent on the mother

wavelets used, it can be said that the number is smaller if the magnitude of not null

coefficients is higher. But this magnitude is higher if the signals Xs(t) and are

more similar. Following (2) the higher similarity is obtained when the functions

are elements of a cosine packet. This is the reason why, for the compression of speech

signals, the best transform is the DCPT.

Another parameter of the DCPT is the cost functional used for the

selection of the best packet. This transform is an adaptive one. The result of its

31


utilization in a given application can be optimized using the best packet selection

procedure. This is a very efficient procedure able to enhance very much the quality of a

given signal processing method. There are some cost functions that can be minimized for

the selection of the best cosine packet. The most used is the entropy but its utilization do

not realizes the maximization of the compression rate.

The optimal cost functional for compression is that realizing the

minimization of the number of coefficients superior to a given threshold, t, (who

determines the distortion), Ns. Using this cost functional, Ns coefficients superior to the

threshold t are Obtained. This is a minimal number because it was obtained using the

appropriate cost functional for the selection of the best packet. This is the reason why this

cost functional realizes the maximization of the compression rate.

An optimal cost functional for compression is that realizing the

minimization of the number of coefficients superior to a given threshold, t, (who

determines the distortion), Ns. Using this cost functional, Ns coefficients superior to the

threshold t are Obtained. This is a minimal number because it was obtained using the

appropriate cost functional for the selection of the best packet. This is the reason why this

cost functional realizes the maximization of the compression rate.

Increasing the threshold value t, the number Ns decreases and the

compression rate increases. Unfortunately the reconstruction distortion increases when t

increases. This is the reason why the threshold's value t must be controlled to assure the

compression transparency. Hence, the block TD must be an adaptive one.

Another parameter of the DCPT which must be considered for the

optimization of the compression is its number of iterations.

32


5. SPEECH5. SPEECH COMPRESSIONCOMPRESSION ALGORITHMALGORITHM

33


5.1 PROPOSED ALGORITHM

The idea of speech compression is to compress speech signal to take up less

storage space and less bandwidth for transmission. To meet this goal different methods

for compression have been designed and developed

The proposed adaptive algorithm for speech compression

using Cosine Packet Transform is shown in Fig. The speech signal to be

compressed is converted in to packets with finite duration. The

Discrete Cosine Transform is applied to each packet and transformed

coefficients are computed. The coefficients are extracted and fed into

the adaptive threshold detector to nullify the inferior coefficient for

better compression.

The proposed adaptive algorithm for speech compression using Cosine

Packet Transform is shown in Fig 8.

The speech signal to be compressed is converted in to packets

with finite duration.

The Discrete Cosine Transform is applied to each packet and

transformed coefficients are computed.

The coefficients are extracted and fed into the adaptive

threshold detector to nullify the inferior coefficient for better

compression.

34


5.1.1 FLOW DIAGRAM FOR5.1.1 FLOW DIAGRAM FOR PROPOSED ADAPTIVEPROPOSED ADAPTIVE

ALGORITHMALGORITHM

35


36


Packet Decomposition

Computation ofDCT

Extracting the co-efficients (ci)

Adaptive ThresholdDetector

FLOW DIAGRAM FOR THE PROPOSED ADAPTIVE ALGORITHM

37


5.2 ADAPTIVE5.2 ADAPTIVE

THRESHOLDTHRESHOLD

DETECTORDETECTOR

ADAPTIVE THRESHOLD DETECTOR

38


One of the most important processes of the proposed compression

algorithm is the threshold detector. The main role of this process is to nullify all the

coefficients obtained from the Cosine Packet Transform smaller to a threshold value.

This is in fact the compression mechanism. This process is an adaptive system, which

automatically choose the threshold value depending upon the transform coefficient value

and repeat the process for a certain condition. The distortion due to compression has the

mean square error:

Because the D.P.C.T and its inverse the I.D.P.C.T are orthogonal

transforms the last relation becomes:

The value of the threshold, t, must be chosen such that the following

condition to be satisfied:

D < α.Ex (α < 1)

Where Ex represents the energy of the input signal, x[n].So for the

preservation of the distortion under the value α.Ex (where Ex represents the energy of the

input signal) is sufficient to select a threshold value:

The constant α can be related to the signal to noise ratio of the signals u[n]

and xˆ[n] (having the same energy), rsz0 . It can be written:

So, an inferior bound of the signal to noise ratio, depending on α was established:

β = −10 ⋅ log10 α

39


Taking the sign equal an inferior bound for the constant α can be obtained:

Using this value, an inferior bound for the threshold t can be obtained too:

So selecting for the threshold a value t, superior to tm , an output signal to

noise ratio superior to β will be obtained. Unfortunately the exact value of Ex will not be

known a priori. This is the reason why an adaptive algorithm for the selection of the

threshold value is recommended. This algorithm can use the value tm (obtained in the last

relation) for initialization.

FLOW DIAGRAM FOR ADAPTIVE THRESHOLD DETECTOR

The flow diagram of adaptive threshold detector is shown in

Fig 9.

The energy of the input signal to be compressed is

computed and the value of b is initialized.

The threshold value is calculated using equation. The

threshold value is increased starting from this value.

At every iteration the value Ex is computed. If this value

is higher than b then the extracted coefficient is

compared with threshold value t. If it is less then the

threshold value then the corresponding coefficients is

replaced with zero value otherwise the coefficients value

is maintained the same.

The proposed adaptive algorithm is stopped when for the

first time the value Ex becomes smaller than b.

40


5.2.1 FLOW DIAGRAM FOR5.2.1 FLOW DIAGRAM FOR

ADAPTIVE THRESHOLDADAPTIVE THRESHOLD

DETECTORDETECTOR

41


42


6. RESULTS6. RESULTS

AND DISCUSSIONAND DISCUSSION

43


The various speech signal sample is simulated using MATLAB. The generated

speech signal sample is shown in Fig 3.

The generated speech signal is segmented in to 15 packets with 512

samples (the duration of each block being inferior to 25 ms) per packet.

The Discrete Cosine Transform is computed for each packets using

equation.

The transformed coefficients are extracted for further processing.

The energy of the input signal is computed and the threshold value is

calculated using equation.

The value of input energy is compared with b.

If it is higher, then each and every transformed coefficient value is

compared with threshold value. The inferior coefficients are nullified. The

new energy of the signal is calculated and compared with b.

If energy is lower than b the above process is repeated for new threshold

value otherwise the compression process is stopped.

In order to show that the proposed adaptive algorithm gives

better performance for speech signals, compression is performed for

20 different speech signals, using Discrete Wavelet Transform,

Wavelet Packet Transform and the proposed adaptive

algorithm. The compression ratios achieved through these methods

are tabulated for various speech signal sample.

44


SPEECH SIGNAL SAMPLE

45


Analyzing the Table, the good performance of the proposed

adaptive algorithm can be observed. The smallest compression rate,

11.1433, was obtained on the 3rd sample and the better compression

rate, 68.0507, was obtained on the 20th sample. The proposed

algorithm gives the better compression ratio for most he the speech

samples

46


TABLE. 1COMPARISION OF COMPRESSION RATIO

FOR 20 DIFFERENT SPEECH SAMPLES

The comparison of compression ratio for speech signal

sample from 1 to 10 and from 11 to 20 is plotted as shown in Fig A and

B respectively for easy understanding.

A. GRAPHICAL REPRESENTATION OF COMPRESSION RATIOS FOR

SPEECH SAMPLES (1-10)

B.GRAPHICAL REPRESENTATION OF COMPRESSION RATIOS FOR

SPEECH SAMPLES (11-20)

47


The analysis from the above figures show that out of 20

signal sample only 2 samples have a less compression ratio as

compared with WPT method and high as compared with other two

methods. The mean compression ratio for all the methods are

computed and tabulated in Table 2

TABLE 2 MEAN COMPRESSION RATIOS FOR 20 DIFFERENT SPEECH

SAMPLES

The analysis from Table 2 shows that the mean

compression ratio for 20 samples was achieved using the proposed

adaptive algorithm is 28.275. This is a sufficiently high value, taking

48


into account the fact that any lossless compression method was not

used.

49


7. APPLICATIONS OF

SPEECH

COMPRESSION

APPLICATIONS

50


Speech compression has been and still is a major issue in

the field of digital speech processing. Speech compression I an act of

transforming the speech signal to a more compact form such that it

takes less storage space and less bandwidth for transmission.

Today, applications of speech compression involve real time

processing in mobile satellite communications, cellular

telephony, internet telephony, audio for videophones or video

teleconferencing systems, among others.

Other applications include also storage and synthesis systems

used, for example, in voice mail systems, voice memo

wristwatches, voice logging recorders and interactive PC

software

The speech compression is used in digital telephony, in

multimedia and in the security of digital communications.

Speech compression is required in

Long-distance communications

High-quality speech storage

Message encryption

For example, in digital cellular technology many users need to

share the same frequency bandwidth, utilizing the speech

compression makes it possible for more users to share the

available system.

Another example where speech compression is required is in

digital voice storage. For a fixed amount of memory compression

makes it possible to store longer messages.

51


8.CONCLUSION

52


CONCLUSION

A new compression method based on adaptive threshold detector is

proposed and tested. Prediction its construction is based on the rate-distortion function

concept. Taking into account the high redundancy of the speech signal, such a method is

very appropriate. The proposed method assures a good quality reconstruction (on each

segment an output signal to noise ratio superior to 16 dB and the mean output signal to

noise ratio superior to 19 dB). It can be said that this compression method is transparent.

The simulated results show that the proposed algorithm gives the better

compression ratio as compared with other methods. Using this method, a mean

compression rate of 28.275, was obtained in the simulation report. This value is superior

to mean compression rate, of other methods.

The proposed compression method is also fast, requiring a relative small

number of multiplications. For example the computation of the DCPT requires the same

multiplications number like the Fast Fourier Transform of the same signal. This is the

reason why the proposed compression method can be implemented on a Digital Signal

Processor. The system proposed represents a good alternative to the speech compression

systems based on the use of the linear prediction.

53


9. BIBILOGRAPHY

54


REFERENCES

[1]. R. W. Yeung, “A First Course in Information Theory,” New York:

Kluwer Academic/Plenum Publishers, 2002.

[2]. A.Gersho, “Advances in Speech and Video Compressions,”

Proceedings of the IEEE, vol. 82, pp. 900-918, June 1994.

[3]. J.L.Flanagaran, M.R.Schroeder, B.S.Atal, R.E.Crocherie,

N.S.Jayant and J.M.Tribolet, “Speech Coding,” IEEE Transactions

on Communications, vol. 27, pp.710-737, April 1979.

[4]. P.Noll, “Wideband Speech and Audio Coding,” IEEE

Communications Magazine, pp. 34-44, Nov. 1993.

[5]. K. Sayood and J. C. Borkenhagen, “Use of residual redundancy in

the design of joint source/channel coders,” IEEE Transactions on

Communications, 39(6):838-846, June 1991.

[6]. Edler, B., “Coding of Audio Signals with Overlapping Block

Transform and Adaptive Window Functions,” (in German),

Frequenz, vol.43, pp.252-256, 1989.

[7]. Q. Memon, T. Kasparis, “Transform Coding of Signals Using

Approximate Trigonometric Expansions”. Journal of Electronic

Imaging, Vol. 6, No. 4, October 1997, pp. 494-503.

[8]. C. E. Shannon, .A mathematical theory of communications,. Bell

System Technical Journal, vol. 27, pp. 379.423, 623.656, 1948.

[9]. A. N. Kolmogorov, .On the Shannon theory of information

transmission in the case of continuous signals,. Trans. IRE, vol. IT-

2, pp. 102.108, 1956.

[10]. D. L. Donoho, M. Vetterli, R. A. Devore, and I. Daubechies, .Data

compression and harmonic analysis,. IEEE Trans. Inf. Theory, vol.

44, no. 6, pp. 2435.2476, 1998.

[11]. N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine

transform,” IEEE Transactions on Computers, vol. C-32, pp. 90-93,

Jan. 1974.

55

Documents

adaptive speech compression using cosine