24.963 Linguistic Phonetics Basic Audition · Acoustic and Auditory Phonetics. Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483. Intensity" • Perceived loudness is more

24.963��

Linguistic Phonetics��Basic Audition��

��

1

Diagram of the inner ear removed due to copyright restrictions.

•  Reading: Keating 1985 .963 also read Flemming 2001 •  24

•  Assignment 1 - basic acoustics. Due 9/22.

2

Audition

Anatomy 3

Eustachian Tube

Outer Ear MiddleEar Inner Ear

AuditoryNerve

CochleaEardrum

Ear Canal

Ear Flap

Hammer

Anvil

Stirrup

Image by MIT OCW.

Auditory ‘spectrograms’ The auditory system performs a running frequency analysis of

acoustic signals - cf. spectrogram. •  But the auditory spectrogram differs from a regular

spectrogram in significant ways, e.g.: •  Frequency - a regular spectrogram analyzes frequency bands

of equal widths, but the peripheral auditory system analyzes frequency bands that are wider at higher frequencies.

•  Loudness is non-linearly related to intensity •  Masking effects (simultaneous and non-simultaneous). •  It is useful to bear these differences in mind when looking at

acoustic spectrograms. •  It is possible to generate ‘auditory spectrograms’ based on

models of the auditory system.

4

Loudness •  The perceived loudness of a sound depends on the

amplitude of the pressure fluctuations in the sound wave.

•  Amplitude is usually measured in terms of root-mean-square (rms amplitude): –  The square root of the mean of the squared amplitude

over some time window.

5

rms amplitude •  Square each sample in the analysis window. •  Calculate the mean value of the squared waveform:

–  Sum the values of the samples and divide by the number of samples.

•  Take the square root of the mean.

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.05 0.1 0.15 0.2

time

pressure pressure

pressure^2rms amplitude

6

rms amplitude

7

.01 .020Time in seconds

Image by MIT OCW.

Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.

Intensity

•  Perceived loudness is more closely related to intensity (power per unit area), which is proportional to the square of the amplitude.

•  relative intensity in Bels = log10(x2/r2) •  relative intensity in dB = 10 log 2

10(x /r2) = 20 log10(x/r) •  In absolute intensity measurements, the comparison

amplitude is usually 20µPa, the lowest audible pressure fluctuation of a 1000 Hz tone (dB SPL).

8

logarithmic scales

•  log xy = log x + log y

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 10 20 30 40 50x

log(x)

9

Loudness

•  The relationship between intensity and perceived loudness of a pure tone is not exactly logarithmic.

–  Loudness of a pure tone (> 40 dB) in Sones:

–  Loudness is defined to be 1 sone for a 1000 Hz tone at 40 dB SPL

N = 2dB−40( )10

10

500,000 1,000,000 1,500,000 2,000,00000

10

20

30

40

50

60

70

80

90

100

0

2

4

6

8

10

12

14

16

18

20

Pressure (µPa)

dB S

PL

Sone

s

dB SPL

Sones

Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.

Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.

Loudness • Pure tones: 131-2092 Hz• Which sounds loudest?

11

Loudness • Loudness also depends on frequency.• equal loudness contours for pure tones:

12

© Springer. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

https://ocw.mit.edu/help/faq-fair-use/

Loudness • At short durations, loudness also depends on duration.• Temporal integration: loudness depends on energy in the

signal, integrated over a time window.• Duration of integration is often said to be about 200ms, i.e.

relevant to the perceived loudness of vowels.

13

Pitch

• Perceived pitch is approximately linear with respect tofrequency from 100-1000 Hz, between 1000-10,000 Hz therelationship is approximately logarithmic.

14

00

4

8

12

16

20

24

1 2 3 4 5 6 7 8 9 10Frequency (kHz)

Freq

uenc

y (B

ark)



Pitch • The non-linear frequency response of the auditory system is related to the

physical structure of the basilar membrane.• basilar membrane uncoiled :

15

Image by MIT OCW.

100

300750

5001,1002,0003,700

8,700 4,900 2,700

6,500

1,500

11,500

15,400

‘ ’

Diagram of the inner ear removed due to copyright restrictions.

Frequency resolution • In the same way, the structure of the basilar membrane

affects the frequency resolution of the auditory‘spectrogram’.

• An acoustic spectrogram represents the variation in intensityover time in a set of frequency bands.– E.g. a standard broad-band spectrogram might use frequency bands of

0-200 Hz, 200-400 Hz, 400-600 Hz, etc.• The ear represents loudness in frequency bands that are

narrower at low frequencies and wider at high frequencies.– ?-100Hz, 100-200 Hz, …, 400-510 Hz, …1080-1270 Hz, …

12000-15500.– These ‘critical bands’ correspond to a length of about 1.3mm along

the basilar membrane (Fastl & Zwicker 2007: 162)the basilar membrane ( & Zwicker 2007: 162) & Zwicker 2007: 162)

16

100

300750

5001,1002,0003,700

8,700 4,900 2,700

6,500

1,500

11,500

15,400

Image by MIT OCW.

Auditory spectra

17

26242220181614121010

20

30

50

40

60

70

80

8642Number of ERBs, E

Exci

tatio

n Le

vel,

dB

Vowel /I/

The spectrum of a synthetic vowel /I/ (top) plotted on a linear frequency scale,and the excitation patterns for that vowel (bottom) for two overall levels, 50 and80 dB. The excitation patterns are plotted on an ERB scale.

80

50

4,0003,5003,0002,5002,0001,5001,000500040

45

50

55

60

70

65

75

80

85

Frequency, Hz

Vowel /I/

Leve

l, dB

Image by MIT OCW.Adapted from Moore, Brian. The Handbook of Phonetic Science. Edited by William J.

Hardcastle and John Laver. Malden, MA: Blackwell, 1997. ISBN: 9780631188483.

Image by MIT OCW.Adapted from Moore, Brian. The Handbook of Phonetic Science. Edited by William J.

Hardcastle and John Laver. Malden, MA: Blackwell, 1997. ISBN: 9780631188483.

Italian vowels

ie

a

ou

200

400

600

800

50070090011001300150017001900210023002500F2 (Hz)

F1 (

Hz)

ERB scales

6

8

10

12

14

16

151719212325F2 (E)

F1(E

)

E(F1)

18

19

© Wiley-Blackwell. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Johnson, K. (2011) "Acoustic and Auditory Phonetics, 3rd ed. Wiley-Blackwell, Oxford.

http://ocw.mit.edu/help/faq-fair-use/

Masking - simultaneous • Energy at one frequency can reduce audibility of

simultaneous energy at another frequency (masking).• Single tone, followed by the same tone and a higher-

frequency tone, repeated with progressive reduction in theintensity of the higher tone– For how many steps can you hear two tones?

• Same pattern, same amplitude tones, but greater difference inthe frequencies of the tones– For how many steps can you hear two tones?

20

Masking - simultaneous • Energy at one frequency can reduce audibility of

simultaneous energy at another frequency (masking).

Stevens (1999)

21

40003600320026002400200016001200

Frequency of masked tone

Mas

king

1000800600400

10

1

102

103

104

Example of masking of a tone by a tone. The frequency of the masking tone is 1200 Hz. Each curve corresponds to adifferent masker level, and gives the amount by which the threshold intensity of the masked tone is multiplied in thepresence of the masker, relative to its threshold in quiet. The dashed lines near 1200 Hz and its harmonics are estimatesof the masking functions in the absence of the effect of beats.

Image by MIT OCW.Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN; 9780262194044.

Time course of auditory nerve response Response to a noise burst: • Strong initial response• Rapid adaptation (~5 ms)• Slow adaptation (>100ms)• After tone offset, firing rate

only gradually returns tospontaneous level.

Kiang et al (1965)

22

256

-60 128

0

256

-40 128

00 64 128

msec

Image by MIT OCW.Adapted from Kiang et al. (1965)

Interactions between sequential sounds

• A preceding sound can affect the auditory nerve responseto a following tone (Delgutte 1980).

23

0 200 200 200

200

400

0

600

400M

AT

NO ATTT = 27 dB SPL

Dis

char

ge ra

te (S

P/S) AT = 13 dB SPL AT = 37 dB SPL

TT

0 400M

0 400M

Image by MIT OCW.Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN; 9780262194044, after Delgutte, B. "Representationof speech-like sounds in the discharge patterns of auditory-nerve fibers." Journal of the Acoustical Society of America 68, no. 3 (1980): 843-857.

24

Courtesy of the MIT Press from Schnupp, Jan, Israel Nelken, and Andrew King. Auditoryneuroscience: Making sense of sound. MIT press, 2011. Used with permission.Source: Schnupp, Nelken & King (2011) "Auditory Neuroscience: Making Senseof Sound. MIT Press.

24.963��

Linguistic Phonetics��Analog-to-digital conversion of

speech signals��

25Image by MIT OCW.

0 .01 .02 .03 s

-Air

pres

sure

+

Time in seconds

Analog-to-digital conversion • Almost all acoustic analysis is now computer-based.• Sound waves are analog (or continuous) signals, but digital

computers require a digital representation - i.e. a series ofnumbers, each with a finite number of digits.

• There are two continuous scales that must be divided intodiscrete steps in analog-to-digital conversion of speech: timeand pressure (or voltage).– Dividing time into discrete chunks is called sampling.– Dividing the amplitude scale into discrete steps is calledquantization.

26Image by MIT OCW.

0 .01 .02 .03 s

- Air

pres

sure

+

Time in seconds

Sampling • The amplitude of the analog

signal is sampled at regularintervals.

• The sampling rate is measuredin Hz (samples per second).

• The higher the sampling rate,the more accurate the digitalrepresentation will be.

27

0 .01 .02 .03Time Sec

0 .01 .02 .03Time Sec

0 .01 .02 .03Time Sec

A wave with a fundamental frequency of 100 Hz and a majorcomponent at 300Hz sampled at 1500Hz.

The same wave sampled at 600 Hz.

(a)

(b)

(c)

Image by MIT OCW.Adapted from Ladefoged, Peter. L104/204 Phonetic Theorylecture notes, University of California, Los Angeles.

Sampling • In order to represent a wave

component of a given frequency, itis necessary to sample the signalwith at least twice that frequency(the Nyquist Theorem).

• The highest frequency that can berepresented at a given sampling rateis called the Nyquist frequency.

• The wave at right has a significantharmonic at 300 Hz– (a) sampling rate 1500 Hz– (b) sampling rate 600 Hz– (c) sampling rate 500 Hz

28

0 .01 .02 .03Time Sec

0 .01 .02 .03Time Sec

0 .01 .02 .03Time Sec

A wave with a fundamental frequency of 100 Hz and a majorcomponent at 300Hz sampled at 1500Hz.

The same wave sampled at 600 Hz.

(a)

(b)

(c)

Image by MIT OCW.Adapted from Ladefoged, Peter. L104/204 Phonetic Theorylecture notes, University of California, Los Angeles.

What sampling rate should you use? • The highest frequency that (young, undamaged) ears can

perceive is about 20 kHz, so to ensure that all audiblefrequencies are represented we must sample at 2×20 kHz =40 kHz.

• The ear is relatively insensitive to frequencies above 10kHz, and almost all of the information relevant to speechsounds is below 10 kHz, so high quality sound is stillobtained at a sampling rate of 20 kHz.

• There is a practical trade-off between fidelity of the signaland memory, but memory is getting cheaper all the time.

29

What sampling rate should you use? • For some purposes (e.g. measuring vowel formants), a high

sampling rate can be a liability, but it is always possible todownsample before performing an analysis.

• Audio CD uses a sampling rate of 44.1 kHz.• Many A-to-D systems only operate at fractions of this rate

(44100 Hz, 22050 Hz, 11025 Hz).• For most purposes, use a sampling rate of 44.1 kHz.

30

Aliasing • Components of a signal which are above the Nyquist

frequency are misrepresented as lower frequencycomponents (aliasing).

• To avoid aliasing, a signal must be filtered to eliminatefrequencies above the Nyquist frequency.

• Since practical filters are not infinitely sharp, this willattenuate energy near to the Nyquist frequency also.

31

1086420Time (ms)

Am

plitu

de



Quantization • The amplitude of the signal at each sampling point must be

specified digitally - quantization.• Divide the continuous amplitude scale into a finite number

of steps. The more levels we use, the more accurately weapproximate the analog signal.

32

20151050

20 steps

200 steps

Time (ms)

Am

plitu

de

Image by MIT OCW.Adapted from Johnson, Keith. Acoustic and Auditory Phonetics.Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.

Quantization • The number of levels is specified in terms of the number of

bits used to encode the amplitude at each sample.– Using n bits we can distinguish 2n levels of amplitude.– e.g. 8 bits, 256 levels.– 16 bits, 65536 levels.

• Now that memory is cheap, speech is almost alwaysdigitized at 16 bits (the CD standard).

33

Quantization • Quantizing an analog signal necessarily introduces

quantization errors.• If the signal level is lower, the degradation in signal-to-

noise ratio introduced by quantization noise will be greater,so digitize recordings at as high a level as possible withoutexceeding the maximum amplitude that can be represented(clipping).

• On the other hand, it is essential to avoid clipping.

34

Am

plitu

de

0 5 10 15 20 25Time (ms)



MIT OpenCourseWarehttps://ocw.mit.edu

24.915 / 24.963 Linguistic PhoneticsFall 2015

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

https://ocw.mit.edu

https://ocw.mit.edu/terms

Documents

24.963 Linguistic Phonetics Basic Audition · Acoustic and Auditory Phonetics. Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483. Intensity" • Perceived loudness is more