Speech and Audio Processing and Coding (cont.)

1

Speech and Audio Processing and Coding (cont.)

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

mailto:[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Timber Perception

(Ack. S. Zielinski)

2

What is Timbre?

According to American Standard Association, it is defined as “that attribute of sensation in terms of which a listener can judge that two sounds have the same loudness and pitch are dissimilar”.

Musically, it is “the quality of a musical note which distinguishes different types of musical instruments.”

It can be defined as “everything that is not loudness, pitch or spatial perception”.

• Loudness < - > Amplitude (frequency dependent)

• Pitch < - > Fundamental Frequency

• Spatial perception <-> IID, IPD

• Timbre <-> ???

3

Physical Parameters

Timbre relates to:

• Static spectrum (e.g. harmonic content of spectrum)

• Envelope of spectrum (e.g. the peaks in the LPC spectrum which corresponds to formants)

• Dynamic spectrum (time evolving)

• Phase

• …

4

Static Spectrum

5

Spectrum Envelope

Formant affects the sensation of timbre

6

Spectrum Envelope (cont)

Formants determines not only timbre, but also the recognition of vowels

7


This figure shows how the spectral envelope looks like in a trumpet sound

8


The spectral envelopes of the flute (the above figure) and the piano (the below figure) suggest that they are different for different music instrument. 9

Dynamic Spectrum

This figure shows how the spectral envelope looks like in a trumpet sound

10

Phase

The above two magnitude spectra are identical, while their waveforms are totally different. The timbre of these two sounds are almost identical, and hence phase affects the timbre but to very little extent. This also suggests that human hearing is not sensitive to phase difference. 11

Demos for Timbre Perception Resources: Audio Box CD from Univ. of Victoria

Examples of differences in timbres

12

Auditory Masking

13

What is masking ?

Masking: One sound is made inaudible by another one.

• Simultaneous masking refers to the situation where one sound (signal) is made inaudible by another simultaneous sound (i.e. the masker). In other words, both the signal and the masker happen at the same duration. It is also known as frequency masking or spectral masking since if two sounds share a same frequency band, they can be perceived clearly when separated, but cannot be perceived clearly when simultaneous, such as the tones at 440Hz and 450Hz

• Non-simultaneous masking refers to the situation where one sound (signal) is made inaudible by another sound (i.e. the masker) that proceeds or follows the signal. In other words, they do not present at the same time.

14

What is masking? (cont)

15

Simultaneous Masking

On-frequency masking

The masker and the signal are within the same auditory filter band, with the louder sound masks the quieter one.

Off-frequency masking

The masker and the signal are with different frequency bands. The masking effect is weaker as compared with the on-frequency masking.

(Source: figures from wikipedia, 2010) 16

Simultaneous Masking (cont)

In off-frequency masking, the amount that the masker raises the threshold of the signal is much less as compared with on-frequency masking, however, it does have some masking effect on the signal, as shown in the above figure.

To have a same masking effect as in on-frequency masking, the level of masker needs to be greater in off-frequency masking.


Demos for Simultaneous Masking (Frequency Domain Masking)

Resources: Audio Box CD from Univ. of Victoria

A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 12 dB, then by steps of 5 dB. The sequence is repeated twice. The second time the frequency separation between the tones is increased.

Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask high frequencies.

This demo shows a tone of greater intensity masks a broader ranger of tones than a tone of less intensity. A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 10 dB, then by steps of 3 dB. The sequence above is repeated twice, the second time increasing the intensity of the single tone by 28 dB.

Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask low frequencies.

18

The Amount of Masking

In the example above, the amount of masking is 16dB, which is the difference between the masked threshold and un-masked threshold. Note that the threshold for a signal that is masked will be raised as compared with the signal is not masked (for example, when the signal is heard in a quiet environment.)


Masking Interprets Frequency Resolution of Auditory System

Frequency selectivity, also known as frequency resolution, is referred to as the ability of human auditory system to separate the different frequency components of a complex sound. Recall the concept of the critical bandwidth, two sounds with different frequencies (pitches) can be heard as two separate tones.

It is achieved and performed by the filtering process of the cochlear, where the complex sound is (band-pass) filtered and decomposed into individual frequency components (sinusoids), and then coded independently in the auditory nerve.

Masking is usually used to quantify and characterise the frequency resolution of the auditory system. The auditory system would not be able to separate the two frequencies if the sound of one frequency is masked by that of the other. Therefore, masking explains the limits of frequency resolution of the human auditory system.

20

Use Masking to Estimate the Critical Band

The original experiment by Fletcher (1940) to measure the threshold for detecting a sinusoidal signal as a function of the bandwidth of a bandpass noise masker

Conditions: The noise was centred at the signal frequency. Noise power density was constant.

Findings: At first, the threshold increases as the noise bandwidth increases. However, it flats off with the further increases in noise. This was due to the critical bandwidth: where the noise bandwidth exceeds the bandwidth of the auditory filter and the threshold ceases to increase even if the noise power increases.

The power-spectrum model of masking assumes (Moore, 1995):

The auditory system is a bank of linear overlapping band-pass filters.

Use one filter with a centre frequency close to that of the signal for the detection of the signal.

The signal is only masked by the noise component that passes through the auditory filter.

The threshold corresponds to a certain signal-to-noise (masker) ratio.21

Psychophysical Tuning Curves

Psychophysical tuning curves (PTCs) is a method for the estimation of the shape of the auditory filter. The PTCs above were determined in simultaneous masking, using sinusoidal signals at 10 dB SPL. For each curve, the diamond below it shows the frequency and the level of the signal. The masker was a sinusoid that had a fixed starting phase relationship to the signal. The masker level required for threshold (i.e. just mask the signal) is plotted as a function of masker frequency on a logarithmic scale. The dashed line represents the absolute threshold for the signal. Figure from (Moore, 1995). 22

Shape of Auditory Filter

The shape of the auditory filter centred at 1kHz plotted for input sound levels ranging from 20 to 90 dB SPL/ ERB. The output level of the filter is plotted as a function of the frequency. On the low-frequency side, the filter becomes progressively less sharply tuned with increasing sound level. On the high-frequency side, the sharpness of tuning increases slightly with increasing sound level. At moderate sound levels the filter is approximately symmetric on the linear frequency scale used. Figure from (Moore, 1995)

23

Bark Scale

Proposed in 1961 by Eberhard Zwicker, named after Heinrich Barkhausen who proposed the first subjective measurement of loudness.

The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing. The subsequent band edges are (in Hz) 20, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500.

))7500/arctan((5.3)00076.0arctan(13 2ffBark

24

Non-Simultaneous Masking

Forward masking

Backward masking

Masking tone Masked tone

Masking toneMasked tone

T

T

T cannot be as long as 20-30ms

T cannot be more than 10ms

time

time

25

Forwarding Masking

The left figure shows the amount of forward masking of a 2kHz signal as a function of the time delay between the signal and the end of the noise masker. Each curve represents a different noise level. The results for each spectrum level fall on a straight line when the signal delay is plotted on a logarithmic scale. The right figure shows the same thresholds plotted as a function of the masker level. The slopes of these growth of masking functions decrease with increasing signal delay. Figures from (Moore, 1995)

26

Forwarding Masking Forward masking is greater the nearer in time to the masker that the

signal occurs.

Increments in masker level do not produce equal increments in amount of forward masking, i.e. the slope of the growth of masking function is less than 1, which is in contrast to the simultaneous masking where the slope is close to 1.

27

PTCs Comparisons

Comparison of the psychophysical tuning curves determined by the simultaneous masking (triangle) and the forward masking (square). The masker frequency is plotted as a function of the deviation of the centre frequency divided by the centre frequency. The unit for the centre frequency is kHz. Figures from (Moore et al, 1984)28

Demos for Non-simultaneous Masking (Time Domain Masking)


Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 100ms delay in the between. Two tones can be heard even though the second tone is decreased in 3dB increments.

Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 10ms delay in the between. Masking occurs in this demo. How many steps are audible before the second tone is masked.

Backward masking: the initial tone is masked by the one that follows. The time delay is 100ms.

Backward masking: the initial tone is masked by the one that follows. The time delay is decreased by still more than 10ms.

Backward masking: the initial tone is masked by the one that follows. The time delay is below 10ms. Masking occurs. How many steps are audible?

29

Examples of Modern Audio Formats MP3: MPEG-1 or MPEG-2 Audio Layer 3 (or III), is a patented lossy audio codec. It is a common audio format for consumer audio storage, as well as a standard of digital audio compression for the transfer and playback of music on digital audio players.

Ogg Vorbis: an lossy audio codec developed by the Xiph.Org Foundation (formerly Xiphophorus company). Free and open source.

AAC: Advanced Audio Coding, an audio compression format specified by MPEG-2 and MPEG-4, and successor to MPEG-1’s “MP3” format.

WMA: Windows Media Audio, is an audio codec developed by Microsoft.

MPEG-1 Layer II or MPEG-2 Audio Layer II (MP2): a lossy audio compression format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and internet applications, MP2 remains a dominant standard for audio broadcasting.

ATRAC: Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. ATRAC allowed a relatively small disc like MiniDisc to have the same running time as CD while storing audio information with minimal loss in perceptible quality.

30

Auditory Scene Analysis

31

Demos for Sequential Organisation


In this demo, the sound is perceived as a single stream of notes C4 G4 F4 B3

If the time delay is further decreased. We no longer hear a melody, we only hear the rhythmic beats. Our auditory system is now hearing four groups of one note each.

As the notes are sped up, rhythmic beats played as a melody begin to be heard. The auditory system is now hearing two groups of two notes.

32

Demo for Speech Segregation Resources: Audio Box CD from Univ. of Victoria

This demo begins the two melodies of “Camptown Races” and “Yankee Doodle” at the same pitch. Each time the interleaved melody is played, one of the songs is shifted in pitch until eventually the two melodies become distinguishable.

This demo adjusts the amplitude of the two songs while leaving the pitch constant.

This demo plays the two melodies at the same pitch, but at different timbre. The two melodies are distinguishable instantly.

33

Segregation of a melody from interfering tones

Track 1 in Bregman’s ASA Demonstration

34

Segregation of a melody from interfering tones


35

Segregation of high notes from low ones in a sonata by Telemann


36

Streaming in African xylophone music


37

Effects of a timbre difference between the two parts in African xylophone music


38

Stream segregation of vowels and diphthongs


39

Stream segregation of high and low bands of noise

Track 14 in Bregman’s ASA Demonstration40

Apparent Continuity


41

Perceptual continuation


42

Documents

Speech and Audio Processing and Coding (cont.)