Upload
jody-small
View
217
Download
4
Tags:
Embed Size (px)
Citation preview
Multimedia Specification Design and Production
2013 / Semester 2 / week 3Lecturer: Dr. Nikos [email protected]
2
Outline
Introduction
Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition
Audio Elements
Conclusion
Speech in Multimedia
3
Introduction
Speech is our basic communication tool.
We have been hoping to be able to communicate with machines using speech.
Speech in Multimedia
5
Speech Production Model
Speech in Multimedia
Waveform
Spectrogram
0 0.5 1 1.5 2
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time
Fre
quen
cy
0 2000 4000 6000 8000 100000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Speech
6
Voiced and Unvoiced Speech
Speech in Multimedia
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Silence unvoicedvoiced
7
Short Time Parameters
Speech in Multimedia
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Short timepower
WaveformEnvelop
8
Short Time Parameters (cont.)
Speech in Multimedia
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0 100 200 300 400 500 600 700-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Zerocrossing rate
Pitchperiod
9
Linear Predictive Coding (LPC) Speech Coder
Speech in Multimedia
Speechbuffer
SpeechAnalysis
Pitch
Voiced/unvoiced
Vocal track Parameter
EnergyParameter
QuantizerCode
generation
speechCodestream
Frame n Frame n+1
10
LPC and Vocal Track
Speech in Multimedia
Mathematically, speech can be modeled as the following generation model:
{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.
e(n) is the excitation to generate the speech.
x(n) = p=1k ap x(n-p) + e(n)
11
An Example for Synthesizing Speech
Speech in Multimedia
Blending region
Glottal Pulse
Go through vocal track filter with gain control
Go through radiation filter
12
Speech Recognition
Speech in Multimedia
Speech recognition is the foundation of human computer interaction using speech.
Speech recognition in different contexts Dependent or independent on the speaker. Discrete words or continuous speech. Small vocabulary or large vocabulary. In quiet environment or noisy environment.
Parameteranalyzer
Comparisonand decisionalgorithm
Language model
Reference patterns
speech Words
13
How does Speech Recognition Work?
Speech in Multimedia
Words: grey whales
Phonemes: g r ey w ey l z
Each phonemehas different characteristics(for example,The power distribution).
14
Speech Recognition
Speech in Multimedia
g g r ey ey ey ey w ey ey l l z
How do we “match” the word when there are time and other variations?
15
Dynamic Programming in Decoding
Speech in Multimedia
time
states
We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.
16
Speech Synthesis
Speech in Multimedia
Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)
Speech synthesis has been widely used for text-to-speech systems and different telephone services. The easiest and most often used speech synthesis method is waveform concatenation
Increase the pitch without changing the speed
17
Speaker Recognition
Speech in Multimedia
Identifying or verifying the identity of a speaker is an application where computer exceeds human being.
Vocal track parameter can be used as a feature for speaker recognition.
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
Speaker one Speaker two
18
Applications
Speech in Multimedia
Speech recognition
Call routing
Directory Assistance
Operator Services
Document input
Speakerrecognition
Personalized service
Fraud Control
Text-to-Speechsynthesis
Speech Interface
Document Correction
Voice Commands
Speech Coding
Wireless Telephone
Voice over Internet
19
Audio Elements in Speech
Audio Elements
Auditory icons use an intuitive linkage between the model world of sonically represented objects and events, using sounds familiar to listeners from the everyday world.
Auditory IconsEarcons
Earcons are short, structured musical phrases that can be parameterized to communicate information in an Auditory Display.
20
Audio Elements in Speech
Earcons
An earcon is the audio equivalent of an icon and just like visual icons we hear earcons throughout the day. Its job is to communicate meaning through the use of sound. What’s powerful about this and sound in general is that even though light travels faster then sound we process sound quicker.
Some examples of earcons:Empty trash sound on your computerMicrowave end beeps (some models sing a song now)Seatbelt on warning signal in your carCar doors locked horn honkBeeps when you press a button on your phone
21
Audio Elements in Speech
Auditory Icons
Auditory icons are caricatures of naturally occurring sounds, could be used to provide information about sources of data.
Some examples of auditory icons:Car Horn WarningWater splashingA flowing riverFilling a bottle with water A car engine starting and idlingA door opening or closing
22
Audio Elements in Speech
Sound Filter Effects http://manual.audacityteam.org/man/Effect_Menu
1. Volume NormalizationUse the Normalize effect to set the peak amplitude of single or multiple
tracks, equalize the peak amplitude of the left and right channels of stereo tracks
2. Noise ReductionThis effect is ideal for removing constant background noise such as
fans, tape noise, or hums. It will not work very well for removing talking or music in the background.
3. AmplitudeThis effect increases or decreases the volume of a track or set of tracks. When you open the dialog, Audacity automatically calculates the maximum amount you could amplify the selected audio without causing clipping (from being too loud).
23
Audio Elements in Speech
Sound Filter Effects
4. Fade InApplies a fade-in to the selected audio, so that the amplitude changes gradually from silence at the start of the selection to the original amplitude at the end of the selection. The shape of the fade is linear.
5. Fade OutApplies a fade-out to the selected audio, so that the amplitude changes gradually from the original amplitude at the start of the selection down to silence at the end of the selection. The shape of the fade is linear.
6. EqualizerEqualization is a way of manipulating sounds by Frequency. It allows you to adjust the volume levels of particular frequencies.