42
1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering [email protected] http://personal.ee.surrey.ac.uk/Personal/W.Wang/te aching.html

Speech and Audio Processing and Coding (cont.)

Embed Size (px)

DESCRIPTION

Speech and Audio Processing and Coding (cont.). Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering [email protected] http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html. Timber Perception. (Ack. S. Zielinski). What is Timbre?. - PowerPoint PPT Presentation

Citation preview

Page 1: Speech and Audio Processing  and Coding (cont.)

1

Speech and Audio Processing and Coding (cont.)

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Page 2: Speech and Audio Processing  and Coding (cont.)

Timber Perception

(Ack. S. Zielinski)

2

Page 3: Speech and Audio Processing  and Coding (cont.)

What is Timbre?

According to American Standard Association, it is defined as “that attribute of sensation in terms of which a listener can judge that two sounds have the same loudness and pitch are dissimilar”.

Musically, it is “the quality of a musical note which distinguishes different types of musical instruments.”

It can be defined as “everything that is not loudness, pitch or spatial perception”.

• Loudness < - > Amplitude (frequency dependent)

• Pitch < - > Fundamental Frequency

• Spatial perception <-> IID, IPD

• Timbre <-> ???

3

Page 4: Speech and Audio Processing  and Coding (cont.)

Physical Parameters

Timbre relates to:

• Static spectrum (e.g. harmonic content of spectrum)

• Envelope of spectrum (e.g. the peaks in the LPC spectrum which corresponds to formants)

• Dynamic spectrum (time evolving)

• Phase

• …

4

Page 5: Speech and Audio Processing  and Coding (cont.)

Static Spectrum

5

Page 6: Speech and Audio Processing  and Coding (cont.)

Spectrum Envelope

Formant affects the sensation of timbre

6

Page 7: Speech and Audio Processing  and Coding (cont.)

Spectrum Envelope (cont)

Formants determines not only timbre, but also the recognition of vowels

7

Page 8: Speech and Audio Processing  and Coding (cont.)

Spectrum Envelope (cont)

This figure shows how the spectral envelope looks like in a trumpet sound

8

Page 9: Speech and Audio Processing  and Coding (cont.)

Spectrum Envelope (cont)

The spectral envelopes of the flute (the above figure) and the piano (the below figure) suggest that they are different for different music instrument. 9

Page 10: Speech and Audio Processing  and Coding (cont.)

Dynamic Spectrum

This figure shows how the spectral envelope looks like in a trumpet sound

10

Page 11: Speech and Audio Processing  and Coding (cont.)

Phase

The above two magnitude spectra are identical, while their waveforms are totally different. The timbre of these two sounds are almost identical, and hence phase affects the timbre but to very little extent. This also suggests that human hearing is not sensitive to phase difference. 11

Page 12: Speech and Audio Processing  and Coding (cont.)

Demos for Timbre Perception Resources: Audio Box CD from Univ. of Victoria

Examples of differences in timbres

12

Page 13: Speech and Audio Processing  and Coding (cont.)

Auditory Masking

13

Page 14: Speech and Audio Processing  and Coding (cont.)

What is masking ?

Masking: One sound is made inaudible by another one.

• Simultaneous masking refers to the situation where one sound (signal) is made inaudible by another simultaneous sound (i.e. the masker). In other words, both the signal and the masker happen at the same duration. It is also known as frequency masking or spectral masking since if two sounds share a same frequency band, they can be perceived clearly when separated, but cannot be perceived clearly when simultaneous, such as the tones at 440Hz and 450Hz

• Non-simultaneous masking refers to the situation where one sound (signal) is made inaudible by another sound (i.e. the masker) that proceeds or follows the signal. In other words, they do not present at the same time.

14

Page 15: Speech and Audio Processing  and Coding (cont.)

What is masking? (cont)

15

Page 16: Speech and Audio Processing  and Coding (cont.)

Simultaneous Masking

On-frequency masking

The masker and the signal are within the same auditory filter band, with the louder sound masks the quieter one.

Off-frequency masking

The masker and the signal are with different frequency bands. The masking effect is weaker as compared with the on-frequency masking.

(Source: figures from wikipedia, 2010) 16

Page 17: Speech and Audio Processing  and Coding (cont.)

Simultaneous Masking (cont)

In off-frequency masking, the amount that the masker raises the threshold of the signal is much less as compared with on-frequency masking, however, it does have some masking effect on the signal, as shown in the above figure.

To have a same masking effect as in on-frequency masking, the level of masker needs to be greater in off-frequency masking.

(Source: figures from wikipedia, 2010) 17

Page 18: Speech and Audio Processing  and Coding (cont.)

Demos for Simultaneous Masking (Frequency Domain Masking)

Resources: Audio Box CD from Univ. of Victoria

A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 12 dB, then by steps of 5 dB.  The sequence is repeated twice.  The second time the frequency separation between the tones is increased.

Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask high frequencies.

This demo shows a tone of greater intensity masks a broader ranger of tones than a tone of less intensity.   A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 10 dB, then by steps of 3 dB.  The sequence above is repeated twice, the second time increasing the intensity of the single tone by 28 dB.

Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask low frequencies.

18

Page 19: Speech and Audio Processing  and Coding (cont.)

The Amount of Masking

In the example above, the amount of masking is 16dB, which is the difference between the masked threshold and un-masked threshold. Note that the threshold for a signal that is masked will be raised as compared with the signal is not masked (for example, when the signal is heard in a quiet environment.)

(Source: figures from wikipedia, 2010) 19

Page 20: Speech and Audio Processing  and Coding (cont.)

Masking Interprets Frequency Resolution of Auditory System

Frequency selectivity, also known as frequency resolution, is referred to as the ability of human auditory system to separate the different frequency components of a complex sound. Recall the concept of the critical bandwidth, two sounds with different frequencies (pitches) can be heard as two separate tones.

It is achieved and performed by the filtering process of the cochlear, where the complex sound is (band-pass) filtered and decomposed into individual frequency components (sinusoids), and then coded independently in the auditory nerve.

Masking is usually used to quantify and characterise the frequency resolution of the auditory system. The auditory system would not be able to separate the two frequencies if the sound of one frequency is masked by that of the other. Therefore, masking explains the limits of frequency resolution of the human auditory system.

20

Page 21: Speech and Audio Processing  and Coding (cont.)

Use Masking to Estimate the Critical Band

The original experiment by Fletcher (1940) to measure the threshold for detecting a sinusoidal signal as a function of the bandwidth of a bandpass noise masker

Conditions: The noise was centred at the signal frequency. Noise power density was constant.

Findings: At first, the threshold increases as the noise bandwidth increases. However, it flats off with the further increases in noise. This was due to the critical bandwidth: where the noise bandwidth exceeds the bandwidth of the auditory filter and the threshold ceases to increase even if the noise power increases.

The power-spectrum model of masking assumes (Moore, 1995):

The auditory system is a bank of linear overlapping band-pass filters.

Use one filter with a centre frequency close to that of the signal for the detection of the signal.

The signal is only masked by the noise component that passes through the auditory filter.

The threshold corresponds to a certain signal-to-noise (masker) ratio.21

Page 22: Speech and Audio Processing  and Coding (cont.)

Psychophysical Tuning Curves

Psychophysical tuning curves (PTCs) is a method for the estimation of the shape of the auditory filter. The PTCs above were determined in simultaneous masking, using sinusoidal signals at 10 dB SPL. For each curve, the diamond below it shows the frequency and the level of the signal. The masker was a sinusoid that had a fixed starting phase relationship to the signal. The masker level required for threshold (i.e. just mask the signal) is plotted as a function of masker frequency on a logarithmic scale. The dashed line represents the absolute threshold for the signal. Figure from (Moore, 1995). 22

Page 23: Speech and Audio Processing  and Coding (cont.)

Shape of Auditory Filter

The shape of the auditory filter centred at 1kHz plotted for input sound levels ranging from 20 to 90 dB SPL/ ERB. The output level of the filter is plotted as a function of the frequency. On the low-frequency side, the filter becomes progressively less sharply tuned with increasing sound level. On the high-frequency side, the sharpness of tuning increases slightly with increasing sound level. At moderate sound levels the filter is approximately symmetric on the linear frequency scale used. Figure from (Moore, 1995)

23

Page 24: Speech and Audio Processing  and Coding (cont.)

Bark Scale

Proposed in 1961 by Eberhard Zwicker, named after Heinrich Barkhausen who proposed the first subjective measurement of loudness.

The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing. The subsequent band edges are (in Hz) 20, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500.

))7500/arctan((5.3)00076.0arctan(13 2ffBark

24

Page 25: Speech and Audio Processing  and Coding (cont.)

Non-Simultaneous Masking

Forward masking

Backward masking

Masking tone Masked tone

Masking toneMasked tone

T

T

T cannot be as long as 20-30ms

T cannot be more than 10ms

time

time

25

Page 26: Speech and Audio Processing  and Coding (cont.)

Forwarding Masking

The left figure shows the amount of forward masking of a 2kHz signal as a function of the time delay between the signal and the end of the noise masker. Each curve represents a different noise level. The results for each spectrum level fall on a straight line when the signal delay is plotted on a logarithmic scale. The right figure shows the same thresholds plotted as a function of the masker level. The slopes of these growth of masking functions decrease with increasing signal delay. Figures from (Moore, 1995)

26

Page 27: Speech and Audio Processing  and Coding (cont.)

Forwarding Masking Forward masking is greater the nearer in time to the masker that the

signal occurs.

Increments in masker level do not produce equal increments in amount of forward masking, i.e. the slope of the growth of masking function is less than 1, which is in contrast to the simultaneous masking where the slope is close to 1.

27

Page 28: Speech and Audio Processing  and Coding (cont.)

PTCs Comparisons

Comparison of the psychophysical tuning curves determined by the simultaneous masking (triangle) and the forward masking (square). The masker frequency is plotted as a function of the deviation of the centre frequency divided by the centre frequency. The unit for the centre frequency is kHz. Figures from (Moore et al, 1984)28

Page 29: Speech and Audio Processing  and Coding (cont.)

Demos for Non-simultaneous Masking (Time Domain Masking)

Resources: Audio Box CD from Univ. of Victoria

Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 100ms delay in the between. Two tones can be heard even though the second tone is decreased in 3dB increments.

Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 10ms delay in the between. Masking occurs in this demo. How many steps are audible before the second tone is masked.

Backward masking: the initial tone is masked by the one that follows. The time delay is 100ms.

Backward masking: the initial tone is masked by the one that follows. The time delay is decreased by still more than 10ms.

Backward masking: the initial tone is masked by the one that follows. The time delay is below 10ms. Masking occurs. How many steps are audible?

29

Page 30: Speech and Audio Processing  and Coding (cont.)

Examples of Modern Audio Formats MP3: MPEG-1 or MPEG-2 Audio Layer 3 (or III), is a patented lossy audio codec. It is a common audio format for consumer audio storage, as well as a standard of digital audio compression for the transfer and playback of music on digital audio players.

Ogg Vorbis: an lossy audio codec developed by the Xiph.Org Foundation (formerly Xiphophorus company). Free and open source.

AAC: Advanced Audio Coding, an audio compression format specified by MPEG-2 and MPEG-4, and successor to MPEG-1’s “MP3” format.

WMA: Windows Media Audio, is an audio codec developed by Microsoft.

MPEG-1 Layer II or MPEG-2 Audio Layer II (MP2): a lossy audio compression format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and internet applications, MP2 remains a dominant standard for audio broadcasting.

ATRAC: Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. ATRAC allowed a relatively small disc like MiniDisc to have the same running time as CD while storing audio information with minimal loss in perceptible quality.

30

Page 31: Speech and Audio Processing  and Coding (cont.)

Auditory Scene Analysis

31

Page 32: Speech and Audio Processing  and Coding (cont.)

Demos for Sequential Organisation

Resources: Audio Box CD from Univ. of Victoria

In this demo, the sound is perceived as a single stream of notes C4 G4 F4 B3

If the time delay is further decreased.  We no longer hear a melody, we only hear the rhythmic beats.  Our auditory system is now hearing four groups of one note each. 

As the notes are sped up, rhythmic beats played as a melody begin to be heard.  The auditory system is now hearing two groups of two notes. 

32

Page 33: Speech and Audio Processing  and Coding (cont.)

Demo for Speech Segregation Resources: Audio Box CD from Univ. of Victoria

This demo begins the two melodies of “Camptown Races” and “Yankee Doodle” at the same pitch.  Each time the interleaved melody is played, one of the songs is shifted in pitch until eventually the two melodies become distinguishable.

This demo adjusts the amplitude of the two songs while leaving the pitch constant. 

This demo plays the two melodies at the same pitch, but at different timbre.  The two melodies are distinguishable instantly.

33

Page 34: Speech and Audio Processing  and Coding (cont.)

Segregation of a melody from interfering tones

Track 1 in Bregman’s ASA Demonstration

34

Page 35: Speech and Audio Processing  and Coding (cont.)

Segregation of a melody from interfering tones

Track 5 in Bregman’s ASA Demonstration

35

Page 36: Speech and Audio Processing  and Coding (cont.)

Segregation of high notes from low ones in a sonata by Telemann

Track 6 in Bregman’s ASA Demonstration

36

Page 37: Speech and Audio Processing  and Coding (cont.)

Streaming in African xylophone music

Track 7 in Bregman’s ASA Demonstration

37

Page 38: Speech and Audio Processing  and Coding (cont.)

Effects of a timbre difference between the two parts in African xylophone music

Track 9 in Bregman’s ASA Demonstration

38

Page 39: Speech and Audio Processing  and Coding (cont.)

Stream segregation of vowels and diphthongs

Track 11 in Bregman’s ASA Demonstration

39

Page 40: Speech and Audio Processing  and Coding (cont.)

Stream segregation of high and low bands of noise

Track 14 in Bregman’s ASA Demonstration40

Page 41: Speech and Audio Processing  and Coding (cont.)

Apparent Continuity

Track 28 in Bregman’s ASA Demonstration

41

Page 42: Speech and Audio Processing  and Coding (cont.)

Perceptual continuation

Track 29 in Bregman’s ASA Demonstration

42