Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis [email protected]

Multimedia Specification Design and Production

2013 / Semester 2 / week 3Lecturer: Dr. Nikos [email protected]

2

Outline

Introduction

Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition

Audio Elements

Conclusion

Speech in Multimedia

3

Introduction

Speech is our basic communication tool.

We have been hoping to be able to communicate with machines using speech.


4

Speech Production Model


Anatomy Structure Mechanical Model

5

Speech Production Model


Waveform

Spectrogram

0 0.5 1 1.5 2

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time

Fre

quen

cy

0 2000 4000 6000 8000 100000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Speech

6

Voiced and Unvoiced Speech


0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Silence unvoicedvoiced

7

Short Time Parameters


0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Short timepower

WaveformEnvelop

8

Short Time Parameters (cont.)


0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 100 200 300 400 500 600 700-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Zerocrossing rate

Pitchperiod

9

Linear Predictive Coding (LPC) Speech Coder


Speechbuffer

SpeechAnalysis

Pitch

Voiced/unvoiced

Vocal track Parameter

EnergyParameter

QuantizerCode

generation

speechCodestream

Frame n Frame n+1

10

LPC and Vocal Track


Mathematically, speech can be modeled as the following generation model:

{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.

e(n) is the excitation to generate the speech.

x(n) = p=1k ap x(n-p) + e(n)

11

An Example for Synthesizing Speech


Blending region

Glottal Pulse

Go through vocal track filter with gain control

Go through radiation filter

12

Speech Recognition


Speech recognition is the foundation of human computer interaction using speech.

Speech recognition in different contexts Dependent or independent on the speaker. Discrete words or continuous speech. Small vocabulary or large vocabulary. In quiet environment or noisy environment.

Parameteranalyzer

Comparisonand decisionalgorithm

Language model

Reference patterns

speech Words

13

How does Speech Recognition Work?


Words: grey whales

Phonemes: g r ey w ey l z

Each phonemehas different characteristics(for example,The power distribution).

14

Speech Recognition


g g r ey ey ey ey w ey ey l l z

How do we “match” the word when there are time and other variations?

15

Dynamic Programming in Decoding


time

states

We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.

16

Speech Synthesis


Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)

Speech synthesis has been widely used for text-to-speech systems and different telephone services. The easiest and most often used speech synthesis method is waveform concatenation

Increase the pitch without changing the speed

17

Speaker Recognition


Identifying or verifying the identity of a speaker is an application where computer exceeds human being.

Vocal track parameter can be used as a feature for speaker recognition.

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

Speaker one Speaker two

18

Applications


Speech recognition

Call routing

Directory Assistance

Operator Services

Document input

Speakerrecognition

Personalized service

Fraud Control

Text-to-Speechsynthesis

Speech Interface

Document Correction

Voice Commands

Speech Coding

Wireless Telephone

Voice over Internet

19

Audio Elements in Speech

Audio Elements

Auditory icons use an intuitive linkage between the model world of sonically represented objects and events, using sounds familiar to listeners from the everyday world.

Auditory IconsEarcons

Earcons are short, structured musical phrases that can be parameterized to communicate information in an Auditory Display.

20


Earcons

An earcon is the audio equivalent of an icon and just like visual icons we hear earcons throughout the day. Its job is to communicate meaning through the use of sound. What’s powerful about this and sound in general is that even though light travels faster then sound we process sound quicker.

Some examples of earcons:Empty trash sound on your computerMicrowave end beeps (some models sing a song now)Seatbelt on warning signal in your carCar doors locked horn honkBeeps when you press a button on your phone

21


Auditory Icons

Auditory icons are caricatures of naturally occurring sounds, could be used to provide information about sources of data.

Some examples of auditory icons:Car Horn WarningWater splashingA flowing riverFilling a bottle with water A car engine starting and idlingA door opening or closing

22


Sound Filter Effects http://manual.audacityteam.org/man/Effect_Menu

1. Volume NormalizationUse the Normalize effect to set the peak amplitude of single or multiple

tracks, equalize the peak amplitude of the left and right channels of stereo tracks

2. Noise ReductionThis effect is ideal for removing constant background noise such as

fans, tape noise, or hums. It will not work very well for removing talking or music in the background.

3. AmplitudeThis effect increases or decreases the volume of a track or set of tracks. When you open the dialog, Audacity automatically calculates the maximum amount you could amplify the selected audio without causing clipping (from being too loud).

23


Sound Filter Effects

4. Fade InApplies a fade-in to the selected audio, so that the amplitude changes gradually from silence at the start of the selection to the original amplitude at the end of the selection. The shape of the fade is linear.

5. Fade OutApplies a fade-out to the selected audio, so that the amplitude changes gradually from the original amplitude at the start of the selection down to silence at the end of the selection. The shape of the fade is linear.

6. EqualizerEqualization is a way of manipulating sounds by Frequency. It allows you to adjust the volume levels of particular frequencies.

Documents

Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis [email protected]