Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
24.963���
Linguistic Phonetics���Basic Audition���
���
1
Diagram of the inner ear removed due to copyright restrictions.
• Reading: Keating 1985 .963 also read Flemming 2001 • 24
• Assignment 1 - basic acoustics. Due 9/22.
2
Audition
Anatomy 3
Eustachian Tube
Outer Ear MiddleEar Inner Ear
AuditoryNerve
CochleaEardrum
Ear Canal
Ear Flap
Hammer
Anvil
Stirrup
Image by MIT OCW.
Auditory ‘spectrograms’ The auditory system performs a running frequency analysis of
acoustic signals - cf. spectrogram. • But the auditory spectrogram differs from a regular
spectrogram in significant ways, e.g.: • Frequency - a regular spectrogram analyzes frequency bands
of equal widths, but the peripheral auditory system analyzes frequency bands that are wider at higher frequencies.
• Loudness is non-linearly related to intensity • Masking effects (simultaneous and non-simultaneous). • It is useful to bear these differences in mind when looking at
acoustic spectrograms. • It is possible to generate ‘auditory spectrograms’ based on
models of the auditory system.
4
Loudness • The perceived loudness of a sound depends on the
amplitude of the pressure fluctuations in the sound wave.
• Amplitude is usually measured in terms of root-mean-square (rms amplitude): – The square root of the mean of the squared amplitude
over some time window.
5
rms amplitude • Square each sample in the analysis window. • Calculate the mean value of the squared waveform:
– Sum the values of the samples and divide by the number of samples.
• Take the square root of the mean.
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.05 0.1 0.15 0.2
time
pressure pressure
pressure^2rms amplitude
6
rms amplitude
7
.01 .020Time in seconds
Image by MIT OCW.
Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Intensity
• Perceived loudness is more closely related to intensity (power per unit area), which is proportional to the square of the amplitude.
• relative intensity in Bels = log10(x2/r2) • relative intensity in dB = 10 log 2
10(x /r2) = 20 log10(x/r) • In absolute intensity measurements, the comparison
amplitude is usually 20µPa, the lowest audible pressure fluctuation of a 1000 Hz tone (dB SPL).
8
logarithmic scales
• log xy = log x + log y
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50x
log(x)
9
Loudness
• The relationship between intensity and perceived loudness of a pure tone is not exactly logarithmic.
– Loudness of a pure tone (> 40 dB) in Sones:
– Loudness is defined to be 1 sone for a 1000 Hz tone at 40 dB SPL
N = 2dB−40( )10
10
500,000 1,000,000 1,500,000 2,000,00000
10
20
30
40
50
60
70
80
90
100
0
2
4
6
8
10
12
14
16
18
20
Pressure (µPa)
dB S
PL
Sone
s
dB SPL
Sones
Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.
Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Loudness • Pure tones: 131-2092 Hz• Which sounds loudest?
11
Loudness • Loudness also depends on frequency.• equal loudness contours for pure tones:
12
© Springer. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Loudness • At short durations, loudness also depends on duration.• Temporal integration: loudness depends on energy in the
signal, integrated over a time window.• Duration of integration is often said to be about 200ms, i.e.
relevant to the perceived loudness of vowels.
13
Pitch
• Perceived pitch is approximately linear with respect tofrequency from 100-1000 Hz, between 1000-10,000 Hz therelationship is approximately logarithmic.
14
00
4
8
12
16
20
24
1 2 3 4 5 6 7 8 9 10Frequency (kHz)
Freq
uenc
y (B
ark)
Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.
Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Pitch • The non-linear frequency response of the auditory system is related to the
physical structure of the basilar membrane.• basilar membrane uncoiled :
15
Image by MIT OCW.
100
300750
5001,1002,0003,700
8,700 4,900 2,700
6,500
1,500
11,500
15,400
‘ ’
Diagram of the inner ear removed due to copyright restrictions.
Frequency resolution • In the same way, the structure of the basilar membrane
affects the frequency resolution of the auditory‘spectrogram’.
• An acoustic spectrogram represents the variation in intensityover time in a set of frequency bands.– E.g. a standard broad-band spectrogram might use frequency bands of
0-200 Hz, 200-400 Hz, 400-600 Hz, etc.• The ear represents loudness in frequency bands that are
narrower at low frequencies and wider at high frequencies.– ?-100Hz, 100-200 Hz, …, 400-510 Hz, …1080-1270 Hz, …
12000-15500.– These ‘critical bands’ correspond to a length of about 1.3mm along
the basilar membrane (Fastl & Zwicker 2007: 162)the basilar membrane ( & Zwicker 2007: 162) & Zwicker 2007: 162)
16
100
300750
5001,1002,0003,700
8,700 4,900 2,700
6,500
1,500
11,500
15,400
Image by MIT OCW.
Auditory spectra
17
26242220181614121010
20
30
50
40
60
70
80
8642Number of ERBs, E
Exci
tatio
n Le
vel,
dB
Vowel /I/
The spectrum of a synthetic vowel /I/ (top) plotted on a linear frequency scale,and the excitation patterns for that vowel (bottom) for two overall levels, 50 and80 dB. The excitation patterns are plotted on an ERB scale.
80
50
4,0003,5003,0002,5002,0001,5001,000500040
45
50
55
60
70
65
75
80
85
Frequency, Hz
Vowel /I/
Leve
l, dB
Image by MIT OCW.Adapted from Moore, Brian. The Handbook of Phonetic Science. Edited by William J.
Hardcastle and John Laver. Malden, MA: Blackwell, 1997. ISBN: 9780631188483.
Image by MIT OCW.Adapted from Moore, Brian. The Handbook of Phonetic Science. Edited by William J.
Hardcastle and John Laver. Malden, MA: Blackwell, 1997. ISBN: 9780631188483.
Italian vowels
ie
a
ou
200
400
600
800
50070090011001300150017001900210023002500F2 (Hz)
F1 (
Hz)
ERB scales
6
8
10
12
14
16
151719212325F2 (E)
F1(E
)
E(F1)
18
19
© Wiley-Blackwell. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Johnson, K. (2011) "Acoustic and Auditory Phonetics, 3rd ed. Wiley-Blackwell, Oxford.
Masking - simultaneous • Energy at one frequency can reduce audibility of
simultaneous energy at another frequency (masking).• Single tone, followed by the same tone and a higher-
frequency tone, repeated with progressive reduction in theintensity of the higher tone– For how many steps can you hear two tones?
• Same pattern, same amplitude tones, but greater difference inthe frequencies of the tones– For how many steps can you hear two tones?
20
Masking - simultaneous • Energy at one frequency can reduce audibility of
simultaneous energy at another frequency (masking).
Stevens (1999)
21
40003600320026002400200016001200
Frequency of masked tone
Mas
king
1000800600400
10
1
102
103
104
Example of masking of a tone by a tone. The frequency of the masking tone is 1200 Hz. Each curve corresponds to adifferent masker level, and gives the amount by which the threshold intensity of the masked tone is multiplied in thepresence of the masker, relative to its threshold in quiet. The dashed lines near 1200 Hz and its harmonics are estimatesof the masking functions in the absence of the effect of beats.
Image by MIT OCW.Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN; 9780262194044.
Time course of auditory nerve response Response to a noise burst: • Strong initial response• Rapid adaptation (~5 ms)• Slow adaptation (>100ms)• After tone offset, firing rate
only gradually returns tospontaneous level.
Kiang et al (1965)
22
256
-60 128
0
256
-40 128
00 64 128
msec
Image by MIT OCW.Adapted from Kiang et al. (1965)
Interactions between sequential sounds
• A preceding sound can affect the auditory nerve responseto a following tone (Delgutte 1980).
23
0 200 200 200
200
400
0
600
400M
AT
NO ATTT = 27 dB SPL
Dis
char
ge ra
te (S
P/S) AT = 13 dB SPL AT = 37 dB SPL
TT
0 400M
0 400M
Image by MIT OCW.Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN; 9780262194044, after Delgutte, B. "Representationof speech-like sounds in the discharge patterns of auditory-nerve fibers." Journal of the Acoustical Society of America 68, no. 3 (1980): 843-857.
24
Courtesy of the MIT Press from Schnupp, Jan, Israel Nelken, and Andrew King. Auditoryneuroscience: Making sense of sound. MIT press, 2011. Used with permission.Source: Schnupp, Nelken & King (2011) "Auditory Neuroscience: Making Senseof Sound. MIT Press.
24.963���
Linguistic Phonetics���Analog-to-digital conversion of
speech signals������
25Image by MIT OCW.
0 .01 .02 .03 s
-Air
pres
sure
+
Time in seconds
Analog-to-digital conversion • Almost all acoustic analysis is now computer-based.• Sound waves are analog (or continuous) signals, but digital
computers require a digital representation - i.e. a series ofnumbers, each with a finite number of digits.
• There are two continuous scales that must be divided intodiscrete steps in analog-to-digital conversion of speech: timeand pressure (or voltage).– Dividing time into discrete chunks is called sampling.– Dividing the amplitude scale into discrete steps is calledquantization.
26Image by MIT OCW.
0 .01 .02 .03 s
- Air
pres
sure
+
Time in seconds
Sampling • The amplitude of the analog
signal is sampled at regularintervals.
• The sampling rate is measuredin Hz (samples per second).
• The higher the sampling rate,the more accurate the digitalrepresentation will be.
27
0 .01 .02 .03Time Sec
0 .01 .02 .03Time Sec
0 .01 .02 .03Time Sec
A wave with a fundamental frequency of 100 Hz and a majorcomponent at 300Hz sampled at 1500Hz.
The same wave sampled at 600 Hz.
(a)
(b)
(c)
Image by MIT OCW.Adapted from Ladefoged, Peter. L104/204 Phonetic Theorylecture notes, University of California, Los Angeles.
Sampling • In order to represent a wave
component of a given frequency, itis necessary to sample the signalwith at least twice that frequency(the Nyquist Theorem).
• The highest frequency that can berepresented at a given sampling rateis called the Nyquist frequency.
• The wave at right has a significantharmonic at 300 Hz– (a) sampling rate 1500 Hz– (b) sampling rate 600 Hz– (c) sampling rate 500 Hz
28
0 .01 .02 .03Time Sec
0 .01 .02 .03Time Sec
0 .01 .02 .03Time Sec
A wave with a fundamental frequency of 100 Hz and a majorcomponent at 300Hz sampled at 1500Hz.
The same wave sampled at 600 Hz.
(a)
(b)
(c)
Image by MIT OCW.Adapted from Ladefoged, Peter. L104/204 Phonetic Theorylecture notes, University of California, Los Angeles.
What sampling rate should you use? • The highest frequency that (young, undamaged) ears can
perceive is about 20 kHz, so to ensure that all audiblefrequencies are represented we must sample at 2×20 kHz =40 kHz.
• The ear is relatively insensitive to frequencies above 10kHz, and almost all of the information relevant to speechsounds is below 10 kHz, so high quality sound is stillobtained at a sampling rate of 20 kHz.
• There is a practical trade-off between fidelity of the signaland memory, but memory is getting cheaper all the time.
29
What sampling rate should you use? • For some purposes (e.g. measuring vowel formants), a high
sampling rate can be a liability, but it is always possible todownsample before performing an analysis.
• Audio CD uses a sampling rate of 44.1 kHz.• Many A-to-D systems only operate at fractions of this rate
(44100 Hz, 22050 Hz, 11025 Hz).• For most purposes, use a sampling rate of 44.1 kHz.
30
Aliasing • Components of a signal which are above the Nyquist
frequency are misrepresented as lower frequencycomponents (aliasing).
• To avoid aliasing, a signal must be filtered to eliminatefrequencies above the Nyquist frequency.
• Since practical filters are not infinitely sharp, this willattenuate energy near to the Nyquist frequency also.
31
1086420Time (ms)
Am
plitu
de
Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.
Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Quantization • The amplitude of the signal at each sampling point must be
specified digitally - quantization.• Divide the continuous amplitude scale into a finite number
of steps. The more levels we use, the more accurately weapproximate the analog signal.
32
20151050
20 steps
200 steps
Time (ms)
Am
plitu
de
Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Quantization • The number of levels is specified in terms of the number of
bits used to encode the amplitude at each sample.– Using n bits we can distinguish 2n levels of amplitude.– e.g. 8 bits, 256 levels.– 16 bits, 65536 levels.
• Now that memory is cheap, speech is almost alwaysdigitized at 16 bits (the CD standard).
33
Quantization • Quantizing an analog signal necessarily introduces
quantization errors.• If the signal level is lower, the degradation in signal-to-
noise ratio introduced by quantization noise will be greater,so digitize recordings at as high a level as possible withoutexceeding the maximum amplitude that can be represented(clipping).
• On the other hand, it is essential to avoid clipping.
34
Am
plitu
de
0 5 10 15 20 25Time (ms)
Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.
Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
MIT OpenCourseWarehttps://ocw.mit.edu
24.915 / 24.963 Linguistic PhoneticsFall 2015
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.