26
SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Embed Size (px)

Citation preview

Page 1: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

SPEECH RECOGNITION 1DAY 14 – SEPT 27, 2013

Brain & Language

LING 4110-4890-5110-7960

NSCI 4110-4891-6110

Harry Howard

Tulane University

Page 2: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

2

Course organization• The syllabus, these slides and my recordings are

available at http://www.tulane.edu/~howard/LING4110/.• If you want to learn more about EEG and neurolinguistics,

you are welcome to participate in my lab. This is also a good way to get started on an honor's thesis.

• The grades are posted to Blackboard.

9/27/13 Brain & Language, Harry Howard, Tulane University

Page 3: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

REVIEWModularity

9/27/13 Brain & Language, Harry Howard, Tulane University 3

Page 4: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

4

Coltheart’s grouping & my explanation

1. Specific to a domain

2. Information is encapsulated

3. Fixed neural structure

4. Matures in a specific way

5. Fails in a specific way

6. Limits central access

7. Operates mandatorily

8. Acts quickly

9. Analyzes ‘shallowly’

1. by definition.

2. by definition.

3. in order to keep out all the other stuff.

4. in order to build the fixed structure.

5. because it was built in a specific way.

6. in order to keep out other stuff.

7. since there is no external access, it can’t be turned on or off.

8. because there is no other stuff to get in the way of optimizing speed.

9. because other stuff is necessary to analyze deeply.

9/27/13 Brain & Language, Harry Howard, Tulane University

Page 5: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

SPEECH RECOGNITIONIngram §5

9/27/13 Brain & Language, Harry Howard, Tulane University 5

Page 6: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 69/27/13

Three systems involved in speech production

Respiratory

Laryngeal

Supralaryngeal

Page 7: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 79/27/13

Vocal folds and their location in the larynx

Page 8: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 89/27/13

Phonation

• Phonation, or speech sound, is created by turbulent oscillation between phases in which the passage of air through the larynx is unconstricted (the expiratory airflow has pushed the vocal folds apart) and phases in which the passage of air is blocked (the vocal folds snap back to their semi-closed position).

Page 9: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 99/27/13

Turbulent oscillation of vocal air• The following figure depicts such a transition, in which increasing

darkness symbolizes increasing compression of the airflow. • The heavy line represents the pressure of the airflow through the

vocal folds as a single quantity between a minimum and a maximum. • as the vocal folds close, the outflow of air is compressed and its pressure

rises;• as they open, the outflow of air is rarefied and its pressure falls.

• A single cycle of closing and opening is defined by the distance between two peaks, marked by dotted white lines.

Page 10: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 109/27/13

Graph of turbulent oscillation of vocal air

Page 11: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 11

An example: "phonetician"

9/27/13

f o n ə t ɪ ʃ ən

Page 12: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 129/27/13

Frequency• This cycling of airflow has a certain frequency

• the frequency of a phenomenon refers to the number of units that occur during some fixed extent of measurement.

• The basic unit of frequency, the hertz (Hz), is defined as one cycle per second.

Page 13: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 139/27/13

Two sine functions with different frequencies

• A simple illustration can be found in the next diagram. It consists of the graphs of two sine functions. • The one marked with o’s, like beads on a necklace, completes an

entire cycle in 0.628 s, which gives it a frequency of 1.59 Hz. • The other wave, marked with x’s so that it looks like barbed wire,

completes two cycles in this period. Thus, its frequency is twice as much, 3.18 Hz.

Page 14: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 149/27/13

Graph of two sine functions with different frequencies

Page 15: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 159/27/13

Fundamental frequency• The pitch of the human voice corresponds to the frequency of vocal fold oscillation, called fundamental frequency or F0.

• Fundamental frequency & gender• The fundamental frequency of a man’s voice averages

125 Hz;• the fundamental frequency of a woman’s voice averages

200 Hz.• This 60% increase in the pitch of a woman’s voice can be

accounted for entirely by the fact that a man's vocal folds are on average 60% longer than a woman’s.

Page 16: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 169/27/13

An example: "phonetician"

Page 17: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 17

The fundamental & higher frequencies

• This brief introduction to the pitch of the human voice leads one to believe that the vocal folds vibrate at a single frequency, that of their fundamental frequency, much as the schematic string on the left side is shown vibrating at its fundamental frequency.

9/27/13

Page 18: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 18

Higher frequencies• However, this is but a

idealization for the sake of simplification of a rather complex subject.

• In reality, the vocal folds vibrate at a variety of frequencies that are multiples of the fundamental.

• The diagram depicts how this is possible – a string can vibrate at a frequency higher than its fundamental because smaller lengths of the string complete a cycle in a shorter period of time.

• In the particular case of the central diagram, each half of the string completes a cycle in half the time.

9/27/13

Page 19: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 19

Superposition of frequencies• This figure displays the outcome of

superimposing both frequencies on the string and the waveform.

• The result is that a pulse of vibration created by the vocal folds projects an abundance of different frequencies in whole-number multiples of the fundamental.

• If we could hear just this pulse, it would sound, as Loritz (1999:93) says, “more like a quick, dull thud than a ringing bell”.

9/27/13

Page 20: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 20

An example: the spectrogram of "phonetician"

9/27/13

f o n ə t ɪ ʃ ən

Page 21: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 219/27/13

Cavities & resonance• But the human voice does not sound like a quick, dull thud; it sounds,

well, it sounds like a human voice. This is because the human vocal tract sits on top of the larynx, and the vocal tract enhances the glottal pulse just like a trumpet enhances the shrill tweet of its reed, as illustrated previously.

• In particular, the buccal and nasal cavities resonate at certain frequencies, thereby exaggerating some harmonics while muting others.

• The oral cavity itself sits in a channel between two smaller cavities whose size varies according to the position of the tongue and lips. The next diagram zooms in on the buccal cavity to distinguish the other two. Counting from the back, there is 1. a pharyngeal cavity, 2. an oral cavity properly speaking, and 3. a labiodental cavity, between the teeth and the lips.

• Notice how the difference in tongue position for [i], the vowel in seed, and [a], the vowel in sod, changes the size of the oral and pharyngeal cavities.

Page 22: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 229/27/13

The three buccal cavities, articulating [i] and [a]

Page 23: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 239/27/13

Formants• This difference produces a marked contrast in the frequencies

that resonate in these cavities, as shown by the schematic plots of frequency over time in the next figure.

• Such enhanced frequencies, known as formants, carry the acoustic information that allows us to distinguish [i] from [a], as well as most other speech sounds. Roughly speaking, • the resonance of all three cavities together produces the lowest or

first formant, • the resonance of the pharyngeal & oral cavities produces the second

format, • and the resonance of the labiodental cavity produces the third

formant (Loritz 1999:96). • We hedge with “roughly” because the pharyngeal cavity can take on

special resonance properties, and the labiodental cavity can combine with the oral cavity; see Ladefoged (1996:123ff) for more detailed discussion.

Page 24: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 249/27/13

Schematic spectrograms of the lowest three resonant frequencies (formants) of [i] and [a]

Page 25: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

Brain & Language, Harry Howard, Tulane University 259/27/13

What it really looks like

Page 26: SPEECH RECOGNITION 1 DAY 14 – SEPT 27, 2013 Brain & Language LING 4110-4890-5110-7960 NSCI 4110-4891-6110 Harry Howard Tulane University

NEXT TIMEQ4

Finish Ingram §5 & start §6.

☞ Go over questions at end of chapter.

9/27/13 Brain & Language, Harry Howard, Tulane University 26