1 Speech Synthesis User friendly machine must have complete voice communication abilities Voice communication involves Speech synthesis Speech recognition

1

Speech Synthesis

•User friendly machine must have complete voice communication abilities

•Voice communication involves

•Speech synthesis

•Speech recognition

2

Elements of Speech

•Some popular electronic speech synthesis are modeled directly from the human tract

•Other technique by putting together various sounds that are required to produce speech

•Therefore, it is important to learn the sound characteristics

3

Elements of Speech (cont)

4

Elements of Speech (cont)•The vocal system can be broken into

•Lungs

•Larynx

•Vocal cavity

•Lungs provide power to system by forcing air up through the larynx and into the vocal cavity

•Vocal cords which made up of skin layers create sound when it flap or vibrate when air passes through

•The vibrating action generates several resonant frequencies within the vocal cavity (which has several harmonic frequency)

5

Elements of Speech (cont)•Different sounds is created by changing the shape of the vocal cavity with throat, tongue, teeth and lips.

•Engineers use frequency analyzer called sound spectrograph to study speech.

•Spectrograph have shown that the range of most human speech is from 150 to 3600Hz. This represent a frequency bandwidth of 3450Hz.

•Bass singer or soprano - bandwidth of 15kHz (from 10Hz to 15kHz)

•Volume ratio for human speech is about 16,000 to 1, from shout to whisper

6

Elements of Speech (cont)

•However, we do not need to design a speech synthesizer which generate frequency from 10Hz to 15kHz and volume ratio of 16,000 to 1

•Therefore there is a need to produce an “intelligible” speech. E.g Telephone system has a bandwidth of 3000Hz (from 300 to 3300 Hz) with 1000:1 volume ratio.

7

Electronic Speech Synthesis (ESS)

•2 techniques commonly used in ESS are:

•Natural speech analysis/synthesis

•Artificial constrictive/synthesis

•Involves recording and subsequent playback of human speech

•Can be analog or digital

•Best choice to produce limited speech as in vending machine, appliances and automobiles.

Natural speech analysis/synthesis

8

Electronic Speech Synthesis (ESS) (cont)

•Involves analysis and synthesis

•Analysis phase: Analyzed human speech, coded in digital form and stored

•Synthesis phase: Recall digitized speech from memory and converted to analog to re-create the original speech waveform

Natural speech analysis/synthesis (cont)

9


•Digital analysis/synthesis method provides more flexibility than analog method since the stored words or phrases can be randomly accessed from computer memory.

•However, the vocabulary size is limited by the amount of memory available.

•For this reason, several different encoding techniques are used to analyzed and compress the speech waveform which attempt to discard unimportant part resulting in fewer bits being required.


10


•2 types of digital analysis/synthesis

•Time domain analysis/synthesis

-Speech waveform is digitized in time domain

-Analog-> digital (use ADC). The stored samples then passed through DAC to reproduce the speech.

-E.g. Telephone directory assistance

•Frequency domain analysis/synthesis

-Frequency spectrum of analog waveform is analyzed and coded

-Synthesis operation attempts to emulate human vocal tract electronically by using stored frequency parameters obtained from analysis


11

Electronic Speech Synthesis (ESS) (cont)Artificial Constructive/Synthesis

• Created artificially by putting together the various sounds that are required to produce a given speech segment

• The most popular technique is called phoneme speech synthesis

• Phoneme: The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English

• Phonetic: Representing the sounds of speech with a set of distinct symbols, each designating a single sound: phonetic spelling

E.g. f&-’ne-tik ‘sim-b&l = phonetic symbol

12


Artificial Constructive/Synthesis (cont)

• Allophone: A predictable phonetic variant of a phoneme. For example, the aspirated t of top, the unaspirated t of stop, and the tt (pronounced as a flap) of batter are allophones of the English phoneme /t/.

13

Electronic Speech Synthesis (ESS) (cont)Artificial Constructive/Synthesis (cont)

• Phoneme and allophones sounds are coded and stored in memory

• Software algorithm then connect the phonemes to produce a given word. Words strung together to produce phrase

14

•There is also software that consists of a set of production rules that are used to translate written text into appropriate allophones code – Text to speech•With phoneme technique, a computer can produce unlimited vocabulary using minimum amount of memory


Artificial Constructive/Synthesis (cont)

15


Summary of the various methods that are used for electronic speech synthesis (ESS)

16

Time-Domain Analysis/Synthesis and Waveform Digitization

17

Time-Domain Analysis/Synthesis and Waveform Digitization (cont)

•Any method that converts the amplitude variations of speech to digital code for subsequent playback can be considered time-domain speech analysis/synthesis•Time-domain speech analysis/synthesis involve 2 operations:

•Encoding human speech waveform to digitize and store the speech using analog to digital converter (ADC)•Decoding the digitized speech into analog form for playback using digital to analog converter (DAC)

18


•Low pass filter is connected to DAC output to smooth out the steps in the synthesized waveform•Time-domain encoding attempt to reduce the amount of memory required to store digitized speech. Some examples are:

•Simple Pulse-Code Modulation (PCM)•Delta Modulation (DM)•Differential Pulse-Code Modulation (DPCM)•Adaptive Differential Pulse-Code Modulation (ADPCM)

19


•Direct waveform digitization using ADC•Things that control the quality of the digitized speech

•Sampling rate: high sampling rate creates higher quality output•ADC Resolution: higher bit converter creates higher quality output

•To “catch” all subtleties of the waveform, it must be sampled about 30,000 times per second•If each sample were converted to an 8 bit digital code, the data conversion rate would require 8X30,000 or 240,000 bits of memory (data conversion rate = 240,000 bit per second – not practical)

Simple Pulse-Code Modulation (PCM)

20


•To reduce data rate, sampling rate must be reduced•Experimentation has shown that acceptable speech can be created using a sampling rate of at least two times the highest frequency component in the speech waveform.•Therefore 6000 (or 3000X2) conversion per second is the minimum sampling required to produce acceptable speech (most speech falls in the 300 to 3000Hz).•The ADC resolution determines the smallest analog increments that will be detected by the system. Acceptable speech can be synthesized using an 8-bit ADC. (8 X 6,000 = 48,000 bps data rate). Therefore 10 second of speech needs 60,000 bytes of memory or 58.6kByte roughly

Simple Pulse-Code Modulation (SPCM) (cont)

21


22


•Problem: Requires too much memory to produce acceptable speech for any length of time.•The answer is to use:

•Delta Modulation (DM)•Differential Pulse Code Modulation (DPCM)•Adaptive Differential Pulse Code Modulation (ADPCM)

Simple Pulse-Code Modulation (SPCM) (cont)

23


•Only single bit is stored for each sample of speech waveform, rather than 8 or 12 bits.•The ADC still converts a sample to 8 or 12 bit value but then it is compared to the last sample value.•If the present value is greater than the last value, the computer stores a logic 1-bit value. If less – logic 0 is stored. Therefore, a single bit is used to represent data•An integrator is used on the circuit output to convert the serial bit stream to analog waveform

Delta Modulation (DM)

24


•Disadvantages•The sampling rate must be high to catch all the details of the speech signal. E.g. Typical sampling rate = 32,000 samples per second translates to 32,000 bps data rate. Thus, 10 seconds of speech would require 39kbytes of memory (40% data reduction from 8-bit SPCM)•Compliance Error/Slope Overload

-Results when speech waveform change so rapidly for a given sampling rate. The resulting digitization does not truly represent the analog waveform and produces audible distortion in the output.-Can be solved by increasing sampling rate (however, will result in an increased data rate and more memory)

Delta Modulation (DM) (cont)

25


Delta Modulation (DM) (cont)

26


•Same as DM but several bits are used to represent the actual difference between two successive samples rather than a single bit.•Since speech waveform contains many duplicate sound and pauses, the change in amplitude from one sample to the next is relatively small as compared to the actual amplitude of a sample. As a result, less bits are required to store the difference value than the absolute sample value•Difference between two successive samples can be represented with 6 or 7-bit value (1 sign bit (represent the slope of the input waveform)+ 5 or 6 bits for difference value)

Differential Pulse-Code Modulation (DPCM)

27


•E.g. A 7-bit DPCM system using a sampling rate of 6,000 samples per second would require 51kB of memory for 10 seconds of speech.7 X 6,000 X 10s = 420,000 bit

= 420,000bit /8 = 52,500 byte => 52,500/1024 = 51.3kB

•Bipolar DAC is used for playback to convert the successive difference values to a continuous analog waveform•Have the same problems as DM. Overcome by increasing sampling rate and bit rate.

Differential Pulse-Code Modulation (DPCM) (cont)

28


Differential Pulse-Code Modulation (DPCM) (cont)

29


•Is a variation of DPCM that eliminates the slope overload-overload problems•Only 3 or 4 bits are required to represent each sample•Waveform is sampled at 6000 samples per second with 8 or 12 bit ADC. The computer then subtract current sample value from previous to get differential value as DPCM. However, the differential value is then adjusted to compensate for slope using quantization factor.•The quantization factor adjusts the differential value dynamically, according to the rate of change or slope of the input waveform. The adjusted differential value can then be represented using only 3 or 4 bits.

Adaptive Differential Pulse-Code Modulation (ADPCM)

30


•In addition to requiring fewer bit, the sampling rate of ADPCM can be reduced (to 4000 samples), since slope overload is minimized•E.g. 4000 samples per second with 3 bit code results in data rate of 12,000 bps (i.e. 3 X 4000). Therefore 10s of speech require 15kB memory. (I.e. (12,000/8) X 10s = 15k•However, 8000 samples per second and 4-bit code are more common => 39kByte of memory for 10s of speech•Disadvantage: Needs sophisticated software

Adaptive Differential Pulse-Code Modulation (ADPCM) (cont)

Documents

1 Speech Synthesis User friendly machine must have complete voice communication abilities Voice communication involves Speech synthesis Speech recognition