Pitch-Synchronous Spectrogram: Principles and Applicationsjcc2161/images/Pitch-Synchronous...

Preview:

Citation preview

Pitch-Synchronous Spectrogram: Principles and Applications

C. Julian Chen

Department of Applied Physics and Applied Mathematics

May 24, 2018

Outline• The traditional spectrogram• Observations with the electroglottograph (EGG)• Process of human voice production• Pitch-synchronous segmentation of voice signals• Pitch-synchronous spectrogram• Display of timbre spectrum within each period • Display of power evolution within each period• Free evaluation version and full versions

The traditional spectrogram

The graph is always a mixture of pitch and timbre.

(A) with a wide window, the overtones of fundamental frequency dominate.

(B) with a narrow window, a mixture of formant peaks and details in each pitch period dominate.

Display of timbre spectrum

The curve is always a mixture of pitch and timbre. It is very difficult to decipher formant frequencies and peak profiles.

The Source-Filter Theory

Source: Fourier transform of glottal airflow waveform, -12 dB/oct. Filter: an all-pole transfer function. Radiation factor: +6 dB per octave, which is against the law of energy conservation.

Pitch-Asynchronous Speech Parameterization (1)

The speech signal is blocked into overlapping frames with a fixed window size (25 msec) and a fixed shift (10 msec), and then multiplied by a processing window, typically a Hamming window. The windows often cross phoneme boundaries. Timbre and pitch cannot be separated.

Pitch-Asynchronous Speech Parameterization (2)

Using an all-pole filter model from LPC analysis, the formants of speech signals can be extracted. But the process is not convergent.

Anatomy of voice-production organs

Observation of Speech Signals (1)

Vowel [a], King-TTS-012, 050007, 2.23-2.28 sec.

Observation of Speech Signals (2)

Vowel [i], King-TTS-012, 004419, 1.938 – 1.968 sec.

Observation of Speech Signals (3)

Vowel [u], King-TTS-012, 005044, 1.06 – 1.11 sec.

Observation of Speech Signals (4)

Vowel [e], King-TTS-012, 050053, 2.535 – 2.585 sec.

Observation of Speech Signals (5)

Vowel [o], King-TTS-012, 051022, 1.827 – 1.877 sec.

The Electroglottograph (EGG)

A non-invasive instrument to detect the change of electric conductance between the two vocal cords, thus to monitor the opening and closing of the glottis (circa 1956).

What the Correlation of EGG Signals and Voice Signals Tells Us?

A voice waveform is triggered by a glottal closing, starting with an impulse. The acoustic wave is strong in the closed phase, and weak in the open phase. (Fig. 5.6, Resonance in Singing, D. G. Miller).

The Handclap Analogy (Robert Sataloff)

“Sound is actually produced by the closing of the vocal folds, in a manner similar to the sound generated by hand clapping. … (T)he more frequent they open and close, the higher the pitch.” (Sataloff).

The Water-Hammer Analogy (Ronald Baken)

“The sharp cutoff of flow is particularly crucial, because it is this relatively sudden stoppage of the air flow that is the raw material of voice. An impulse-like shock wave is produced that “excites” air molecules in the vocal tract.” (R. Baken)

Principle of Superposition (Peter Ladefoged)

The voice signal is a superposition of elementary decaying waves, each elementary wave starts at a glottal closing event. Pitch is the repetition rate of glottal closing. (Ladefoged)

What Is Timber Spectrum?

• As the glottis closes, the air moving in the vocal tract at that moment maintains its momentum.

• The kinetic energy of the moving air in the vocal tract is converted into acoustic energy.

• The impulse resonates in the vocal tract.• The decaying elementary wave in each pitch

period is determined by the geometry of vocal tract, thus it represents instantaneous timber.

• Accurate timber spectrum must be computed from the waveform in each pitch period.

Process Within Each Pitch Period

• A glottal closing starts a pitch period.• The acoustic wave decays exponentially during

the closed phase.• A glottal opening connects the vocal tract with

the lungs thus accelerates power decay.• A glottal opening also generates random noise.• The excitation at a glottal opening is mostly

weaker than that at a glottal closing.

Pitch-Synchronous Segmentation Using EGG

The sharp peaks in EGG derivative occur about 1 msec before the starting impulse, which is in the weakly varying section of a pitch period, suitable as segmentation points.

Pitch-Synchronous Segmentation from Voice

By multiplying the voice signal with an asymmetric window, an excitation profile function is generated. The peaks of the excitation profile function generate pitch marks.

Ends-meeting procedure to make waveform cyclic

After an ends-meeting procedure, the waveform of each pitch period becomes a sample of a smooth periodic function.

Example of a pitch-synchronous spectrogram

For voiced sections, the vertical lines represent glottal closing instants. In each pitch period, the amplitude timbre spectrum is displayed. Unvoiced sections has no glottal closings.

Display of Timbre Spectrum and Power Decay

By left-clicking the spectrogram at a pitch period, its timbre spectrum is displayed. By right-clicking at a pitch period, a graph of power decay in that pitch period is displayed.

Examples: Timbre spectra of some vowels

Examples: Consistency of Timbre Spectra

Six examples of timbre spectra of vowel [i]. All showing a strong peak at about 300 Hz, and a group of peaks around 2-4 kHz.

Examples: Timbre spectra of some consonants

Examples: Power decay in a single pitch period

A Free Evaluation Version• Includes pitch-synchronous segmentation of voice

signals, spectrogram generation, timbre spectrum generation, and power decay computation.

• Only works on Mac OS• Requires an installation of Tcl/Tk• Partially open-source: the C++ program is compiled,

the Tcl/Tk source code is open.• Includes two sets of standard speech data: the CMU

ARCTIC databases for US English speakers, male speaker bdl and female speaker slt

• Manually corrected phoneme label files for the two sets of speech data are also included

Input panel of the evaluation versionThe entire package is in a single dir, PSS. In that dir, typeIMAC: PSS usermane$ wish pss.tcl <enter>

An input panel appears:

References1. D. G. Miller, Resonance in Singing, Inside View Press,

2008. 2. R. T. Sataloff, The Human Voice, Scientific American,

December 1992, Vol. 108. 3. R. J. Baken, Electroglottography, Journal of Voice, Vol 6,

page 98-110 (1992)4. R. J. Baken, An Overview of Laryngeal Function for

Voice Production, in Professional Voice, Third Edition, edited by R. T. Sataloff, Plural Publishing, Vol. 1, pages 237-256 (2005).

5. P. Ladefoged, Elements of Acoustic Phonetics, University of Chicago Press, 1966.

6. C. J. Chen, Elements of Human Voice, World Scientific Publishing, 2016.

Recommended