Upload
others
View
24
Download
0
Embed Size (px)
Citation preview
Pitch-Synchronous Spectrogram: Principles and Applications
C. Julian Chen
Department of Applied Physics and Applied Mathematics
May 24, 2018
Outline• The traditional spectrogram• Observations with the electroglottograph (EGG)• Process of human voice production• Pitch-synchronous segmentation of voice signals• Pitch-synchronous spectrogram• Display of timbre spectrum within each period • Display of power evolution within each period• Free evaluation version and full versions
The traditional spectrogram
The graph is always a mixture of pitch and timbre.
(A) with a wide window, the overtones of fundamental frequency dominate.
(B) with a narrow window, a mixture of formant peaks and details in each pitch period dominate.
Display of timbre spectrum
The curve is always a mixture of pitch and timbre. It is very difficult to decipher formant frequencies and peak profiles.
The Source-Filter Theory
Source: Fourier transform of glottal airflow waveform, -12 dB/oct. Filter: an all-pole transfer function. Radiation factor: +6 dB per octave, which is against the law of energy conservation.
Pitch-Asynchronous Speech Parameterization (1)
The speech signal is blocked into overlapping frames with a fixed window size (25 msec) and a fixed shift (10 msec), and then multiplied by a processing window, typically a Hamming window. The windows often cross phoneme boundaries. Timbre and pitch cannot be separated.
Pitch-Asynchronous Speech Parameterization (2)
Using an all-pole filter model from LPC analysis, the formants of speech signals can be extracted. But the process is not convergent.
Anatomy of voice-production organs
Observation of Speech Signals (1)
Vowel [a], King-TTS-012, 050007, 2.23-2.28 sec.
Observation of Speech Signals (2)
Vowel [i], King-TTS-012, 004419, 1.938 – 1.968 sec.
Observation of Speech Signals (3)
Vowel [u], King-TTS-012, 005044, 1.06 – 1.11 sec.
Observation of Speech Signals (4)
Vowel [e], King-TTS-012, 050053, 2.535 – 2.585 sec.
Observation of Speech Signals (5)
Vowel [o], King-TTS-012, 051022, 1.827 – 1.877 sec.
The Electroglottograph (EGG)
A non-invasive instrument to detect the change of electric conductance between the two vocal cords, thus to monitor the opening and closing of the glottis (circa 1956).
What the Correlation of EGG Signals and Voice Signals Tells Us?
A voice waveform is triggered by a glottal closing, starting with an impulse. The acoustic wave is strong in the closed phase, and weak in the open phase. (Fig. 5.6, Resonance in Singing, D. G. Miller).
The Handclap Analogy (Robert Sataloff)
“Sound is actually produced by the closing of the vocal folds, in a manner similar to the sound generated by hand clapping. … (T)he more frequent they open and close, the higher the pitch.” (Sataloff).
The Water-Hammer Analogy (Ronald Baken)
“The sharp cutoff of flow is particularly crucial, because it is this relatively sudden stoppage of the air flow that is the raw material of voice. An impulse-like shock wave is produced that “excites” air molecules in the vocal tract.” (R. Baken)
Principle of Superposition (Peter Ladefoged)
The voice signal is a superposition of elementary decaying waves, each elementary wave starts at a glottal closing event. Pitch is the repetition rate of glottal closing. (Ladefoged)
What Is Timber Spectrum?
• As the glottis closes, the air moving in the vocal tract at that moment maintains its momentum.
• The kinetic energy of the moving air in the vocal tract is converted into acoustic energy.
• The impulse resonates in the vocal tract.• The decaying elementary wave in each pitch
period is determined by the geometry of vocal tract, thus it represents instantaneous timber.
• Accurate timber spectrum must be computed from the waveform in each pitch period.
Process Within Each Pitch Period
• A glottal closing starts a pitch period.• The acoustic wave decays exponentially during
the closed phase.• A glottal opening connects the vocal tract with
the lungs thus accelerates power decay.• A glottal opening also generates random noise.• The excitation at a glottal opening is mostly
weaker than that at a glottal closing.
Pitch-Synchronous Segmentation Using EGG
The sharp peaks in EGG derivative occur about 1 msec before the starting impulse, which is in the weakly varying section of a pitch period, suitable as segmentation points.
Pitch-Synchronous Segmentation from Voice
By multiplying the voice signal with an asymmetric window, an excitation profile function is generated. The peaks of the excitation profile function generate pitch marks.
Ends-meeting procedure to make waveform cyclic
After an ends-meeting procedure, the waveform of each pitch period becomes a sample of a smooth periodic function.
Example of a pitch-synchronous spectrogram
For voiced sections, the vertical lines represent glottal closing instants. In each pitch period, the amplitude timbre spectrum is displayed. Unvoiced sections has no glottal closings.
Display of Timbre Spectrum and Power Decay
By left-clicking the spectrogram at a pitch period, its timbre spectrum is displayed. By right-clicking at a pitch period, a graph of power decay in that pitch period is displayed.
Examples: Timbre spectra of some vowels
Examples: Consistency of Timbre Spectra
Six examples of timbre spectra of vowel [i]. All showing a strong peak at about 300 Hz, and a group of peaks around 2-4 kHz.
Examples: Timbre spectra of some consonants
Examples: Power decay in a single pitch period
A Free Evaluation Version• Includes pitch-synchronous segmentation of voice
signals, spectrogram generation, timbre spectrum generation, and power decay computation.
• Only works on Mac OS• Requires an installation of Tcl/Tk• Partially open-source: the C++ program is compiled,
the Tcl/Tk source code is open.• Includes two sets of standard speech data: the CMU
ARCTIC databases for US English speakers, male speaker bdl and female speaker slt
• Manually corrected phoneme label files for the two sets of speech data are also included
Input panel of the evaluation versionThe entire package is in a single dir, PSS. In that dir, typeIMAC: PSS usermane$ wish pss.tcl <enter>
An input panel appears:
References1. D. G. Miller, Resonance in Singing, Inside View Press,
2008. 2. R. T. Sataloff, The Human Voice, Scientific American,
December 1992, Vol. 108. 3. R. J. Baken, Electroglottography, Journal of Voice, Vol 6,
page 98-110 (1992)4. R. J. Baken, An Overview of Laryngeal Function for
Voice Production, in Professional Voice, Third Edition, edited by R. T. Sataloff, Plural Publishing, Vol. 1, pages 237-256 (2005).
5. P. Ladefoged, Elements of Acoustic Phonetics, University of Chicago Press, 1966.
6. C. J. Chen, Elements of Human Voice, World Scientific Publishing, 2016.