146
Multilingual Speech Processing - Rapid Language Adaptation Tools & Technologies Tanja Schultz 1,2 & Alan W Black 1 1 Language Technologies Institute (LTI), Carnegie Mellon 2 Cognitive Systems Lab, Karlsruhe Institute of Technology Interspeech 2010, Tutorial on Multilingual Speech Processing Sunday, September 26 2010, T-S2-R3 13:00 15:45, Makuhari, Japan

Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

  • Upload
    lymien

  • View
    248

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

Multilingual Speech Processing -Rapid Language Adaptation Tools & Technologies

Tanja Schultz1,2 & Alan W Black1

1 Language Technologies Institute (LTI), Carnegie Mellon2 Cognitive Systems Lab, Karlsruhe Institute of Technology

Interspeech 2010, Tutorial on Multilingual Speech ProcessingSunday, September 26 2010, T-S2-R3 13:00 – 15:45, Makuhari, Japan

Page 2: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

2/145

Tutorial Agenda

13:00 – 14:15 Part 1: Introduction and Motivation

o Motivation

o History and Leveraged Work

o Rapid Language Adaptation Server: Spice

o What, Why, and How

o Building process

14:15 – 14:30 BREAK

14:30 – 15:45 Part 2: SPICE - Under the hood

o Latest Experiments and Results in ASR and TTS

o Lessons Learnt from past studies

o Future

Page 3: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

3/145

Outline Part 1

13:00 – 14:15 Part 1: Introduction

o Motivation

o World of Languages – Languages of the World

o Speech Processing Systems

o Speech-to-Speech Translation

o Spoken Dialog Systems

o History and Leveraged Work

o Globalphone

o FestVox

o Spice: A Rapid Language Adaptation Server

o User's Level View

o Walkthrough

o What, Why and How to use SPICE

o Overview of the Building Process

Page 4: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

4/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Prior Work: GlobalPhone and FestVox

o Intelligent Learning Systems

o Rapid Language Adaptation Server

Page 5: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

5/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Prior Work: GlobalPhone and FestVox

o Intelligent Learning Systems

o Rapid Language Adaptation Server

Page 6: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

6/145

Many Languages – So What?

Do we really need Speech Processing in many languages?

Myth: “Everyone speaks English, why bother?”

NO: About 6900 different languages in the world

Increasing number of languages on the web

Humanitarian and military needs

Rural areas, uneducated people, illiteracy

Why is this an research issue?

Myth: “It’s just retraining on foreign data – simple!”

NO: Other languages bring unseen challenges, for example:

different scripts, no vowelization, no writing system

no word segmentation, rich morphology,

tonality, click sounds,

social factors: trust, access, exposure, cultural background

Page 7: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

7/145

Everyone speaks English, why bother?

o Huge number of Languages in the world: 6912

o Language is not only a communication tool but

fundamental to cultural identity and empowerment

o Treat linguistic

diversity as we treat

bio-diversity

(David Crystal)

o The strongest

eco systems are the

most diverse

o Cultures, ideas,

memories are

transmitted

through language

8 75

264

892

17791967

1071

344

204308

0

200

400

600

800

1000

1200

1400

1600

1800

2000

[100

- 99

9] M

io

[10

- 99]

Mio

[1 - 9]

Mio

[100

,000

- 1M

io]

[10,

000

- 99,

999]

[100

0 - 9

999]

[100

- 99

9]

[10

- 99]

[1 - 9]

UNK

Each dot gives the geographic center of the 6,912 living

languages, http://www.ethnologue.com (accessed Jul 2007)

Page 8: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

8/145

Top Languages – Distribution

http://www.ethnologue.com, as of 25.08.2010

Distribution of Living Languages

Page 9: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

9/145

So we need language support but why Speech?

Computerization: Speech is the key technology

Ubiquitous Information Access: on the go, phone-based

Mobile Devices: Too small and cumbersome for keyboards

Globalization:

Cross-cultural Human-Human Interaction

Multilingual Communities: EU, South Africa, …

Humanitarian needs, disaster, health care

Military ops, communicate with local people

Human-Machine Interfaces

People expect speech-driven applications in their mother tongue

Speech Processing in multiple Languages

Why Speech Processing?

Page 10: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

10/145

ML Speech Processing – Research Issue?

It’s just retraining on foreign data - no science!

o New language – new challenges

o Writing system: different or no script, no vowelization, G-2-P

o Word segmentation, morphology

o Sound system: tonals, clicks

o Different Cultures – social factors

o trust, access, exposure, background

o Lack of Data and Resources

o Audio recordings, corresponding transcripts

o Pronunciation Dictionaries, Lexicon

o Text corpora, parallel bilingual data

o Lack of Experts

o Technology experts without language expertise

o Native language experts without technology expertise

No

No

No

No

Page 11: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

11/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Prior Work: GlobalPhone and FestVox

o Intelligent Learning Systems

o Rapid Language Adaptation Server

Page 12: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

12/145

Language Characteristics

Prosody, Tonality: Stress, Pitch, Lenght pattern, Tonal contours

(e.g. Mandarin 4, Cantonese 8, Thai & Vietnamese 5)

Sound system: simple vs very complex sound systems

(e.g. Hawaiian 5V+8C vs. German 17V+3D+22C)

Phonotactics: simple syllable structure vs complex

consonant clusters(e.g. Japanese Mora-syllables vs. German pf,st,ks)

Segmentation: Written form separate words by white space?

(NO: Chinese, Japanese, Thai, Vietnamese)

Morphology: short units, compounds, agglutination

English: Natural segmentation into short units – great!

German: Compounds – not quite so good

Donau-dampf-schiffahrts-gesellschafts-kapitäns-mütze …

Turkish: Agglutination – looooong phrases

Osman-l-laç-tr-ama-yabil-ecek-ler-imiz-den-miş-siniz

behaving as if you were of those whom we might consider not

converting into Ottoman

Page 13: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

13/145

Writing Systems

Writing systems – basic unit is a Grapheme:

Logographic: based on semantic units, grapheme represents meaning

Chinese: >10.000 hanzi; Japanese ~7000 kanji, Korean to some extend

Phonographic: based on sound units, grapheme represents sound

Segmental: grapheme roughly corresponds to phonemes

Latin (190), Cyrillic (65), Arabic (22) graphems

Abjads = consonantal segmental phonographic, e.g. Arabic

Syllabic: grapheme represents entire syllable, e.g. Japanese kana

Abugidas = mix of segmental and syllabic systems

Featural: elements smaller than phone, e.g. articulatory features

e.g. Korean: ~5600 gulja

Segmental: Latin, Cyrillic, Latin&Cyrillic, Greek,

Georgian or Armenian

Abjads: Arabic, Arabic&Latin, Hebrew&Arabic

Abugidas: North Indic, South Indic, Ethiopic,

Thaana, Canadian Syllabic ,

Logographic+syllabic: Pure logographic,

Mixed logographic&syllabaries,

Featural syllabary+lmtd logographic

Featural-alphabetic syllabary

Wikipedia: August 2007

Page 14: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

14/145

Scripts – Examples

Scripts of some languages: Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, English, Greek,

Hebrew, Hindi, Italian, Japanese, Korean, Romanian, Russian, Serbian, Thai

How many languages do have a written form?

• Omniglot lists about 780 languages that have scripts

• True number might be closer to 1000

(Source Simon Ager, 2007, www.omniglot.com)

Logographic scripts, mostly 2 representatives:

• Chinese: ~ 10.000 hanzi,

• Japanese: ~7000 kanji (+ 3 other scripts )

Phonographic:

• Korean: ~5600 gulja,

• Arabic, Devanagari, Cyrillic, Roman: ~100 characters

Page 15: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

15/145

Grapheme-to-Phoneme Relation

Grapheme-to-Phoneme (Letter-to-Sound) Relationship:

Logographic: NO relationship at all

concern for Chinese, Japanese, Korean

Phonographic: segmental: close – far – complicated

e.g. Finnish, Spanish: more or less 1:1, -- English: try „Phydough“

Phonographic: segmental – consonantal

e.g. Arabic: no short vowels written

Phonographic: syllabic

e.g. Thai, Devanagari: C-V flips

Automatic Generation of Pronunciations might get complicated

Phonographic Logographic

English Korean

JapaneseFrench

Finnish Chinese

Ratio Phonetic/Semantic Code

De

Fra

ncis

/Ung

er

Page 16: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

16/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Intelligent Learning Systems

o Prior Work: GlobalPhone and FestVox

o Rapid Language Adaptation Server

Page 17: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

17/145

One more Reason for MLSP …

6900 Languages in the world …. BUT

o Extinction of languages on massive scale (David Crystal, Spotlight 3/2000)

o Half of all existing languages die out over next century On Average: One language dies every two weeks!

o Survey Feb 1999 from Summer Institute of Linguistics

51 languages with 1 speaker left

28 of those in Australia alone

500 languages with 500 spks

1500 languages with < 1000 spks

3000 languages with < 10.000

5000 languages with < 100.000

96% of world‟s languages are

spoken by only 4% of its people

8 75

264

892

17791967

1071

344

204308

0

200

400

600

800

1000

1200

1400

1600

1800

2000

[100

- 99

9] M

io

[10

- 99]

Mio

[1 -

9] M

io

[100

,000

- 1M

io]

[10,

000

- 99,

999]

[100

0 - 9

999]

[100

- 99

9]

[10

- 99]

[1 -

9]

UNK

Page 18: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

18/145

The Future of Language

Is a language with 100.000 speakers safe?

o Survival for generations depends on pressure imposed on language

o Dominance of another language, Attitude of the speakers

o Example Breton: beginning of 20th century has 1 Mio speakers, now

down to 250.000; Without effort Breton could be gone in 50 years

Reasons that languages die:

o Disaster: Earthquake on Papua New Guinea: Sissano, Warapu, Arop

o Genocide: 90% America‟s natives died within 200 years Europeans

o Cultural assimilation: Colonialism, Suppression, Assimilation:

o (1) Political, social, economic pressure to speak the dominant language,

o (2) Emerging bilingualism,

o (3) self-conscious semilingualism, (4) monolingualism

Why should we care?

o Massive death of languages reduces the diversity

o Bio-diversity has been accepted to be a good thing

o Maybe we should accept this for language diversity (D. Crystal)

Page 19: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

19/145

What can we do?

What do we learn from other languages?

o Intellectual issues: increase awareness of world history

such as movements of early civilization

o Practical issues: medical practices, alternative treatment forms

o Literature … but also new things about the language itself

o Slovakian proverb: “with each newly learned language you acquire a new soul”

How to save endangered languages:

o Community itself must want it, Surrounding culture must respect it

o Funding for courses, materials, and teachers, support the community

o Get linguists into the field, publish information, grammars, dictionaries

Costs associated:

o Depends on conditions (written vs. unwritten languages, etc.)

o Crystal estimates about $80.000 / year per language

o 3000 endangered languages is about $700Mio …

o Organizations to raise funds

o Foundation of endangered languages (FEL), UNESCO project

Page 20: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

20/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Intelligent Learning Systems

o Prior Work: GlobalPhone and FestVox

o Rapid Language Adaptation Server

Page 21: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

21/145

o Lack of Resources: Stochastic approach needs many data

o Hundreds of hours audio recordings and corresponding transcriptions

Audio data 40 languages; Transcriptions take up to 40x real time

o Pronunciation dictionaries for large vocabularies (>100.000 words)

Large vocabulary pronunciation dictionaries 20 languages

o Mono- and bilingual text corpora: few language pairs, pivot mostly English

o Algorithms are language independent – MLSP is not!

o Other Languages bring unseen challenges (segmentation, G2P, etc.)

o Have we already seen ALL or MOST of the language characteristics?

o Social and Cultural Aspects

o Non-native speech and language, code switching

o Combinatorical explosion (domain, speaking style, accent, dialect, ...)

o Few native speakers at hand for minority (endangered) languages

o Do we have the right data?

o Lack of Language Experts

o Bridge the gap between technology experts and language experts

Challenges of MLSP

Page 22: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

22/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Intelligent Learning Systems

o Prior Work: GlobalPhone and FestVox

o Rapid Language Adaptation Server

Page 23: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

23/145

Intelligent systems that learn a language from the user

o Efficient learning algorithms for speech processing

o Learning:

o Interactive learning with user in the loop

o Statistical modeling approaches

o Efficiency:

o Reduce amount of data (save time and costs): at least by factor of 10

o Speed up development cycles: days rather than months

Rapid Language Adaptation from universal models

o Bridge the gap between language and technology experts

o Technology experts do not speak all languages in question

o Native users are not in control of the technology

One Solution: Learning Systems

Page 24: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

24/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Intelligent Learning Systems

o Prior Work: GlobalPhone and FestVox

o Rapid Language Adaptation Server

Page 25: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

25/145

GlobalPhone

Prior Work: GlobalPhone and FestVox

Page 26: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

26/145

Multilingual Database

Widespread languages

Native Speakers

Uniform Data

Broad Domain

Large Text Resources

Internet, Newspaper

Corpus

19 Languages … counting

1800 native speakers

400 hrs Audio data

Read Speech

Filled pauses annotated

Arabic

Ch-Mandarin

Ch-Shanghai

German

French

Japanese

Korean

Croatian

Czech

Portuguese

Russian

Spanish

Swedish

Tamil

Turkish

+ Thai

+ Creole

+ Polish

+ Bulgarian

+ Vietnamese

+ ... ???

GlobalPhone

http://www.cs.cmu.edu/~tanja/GlobalPhone

Available from ELRA

Or check with Tanja

Page 27: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

27/145

Phones in GlobalPhone

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.86

Page 28: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

28/145

1011.8

14 14.5 14.516.9 18 19 20 20 20.3

23.1 24.3

36.633.8

44.546.4

36.1

45.2 44.1

36.1

46.8

36.7

43.5

0

5

10

15

20

25

30

35

40

45

50

JA DE EN KO CH TU FR PO KR SP BL CZ PL RU

Err

or

Rate

[%

]

Word error rate Phoneme error rate

GlobalPhone Recognizers in 14 Languages

Page 29: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

29/145

GlobalPhone: Morphology & OOV-Rate

Language Corpus Size

(in Mio of

word tokens)

Vocabulary

(in thousands of

word types)

OOV at

60k / 64k

English WSJ 19 105 1%

Spanish Newspaper 100 490 ~1.5%

Portuguese Newspaper 11 270 4.3%

German Broadcast N 45 >900 4.4%

Czech Newspaper 16 415 8%

Serbo-Croatian Internet 12 350 8.7% (49k)

MSA Newspaper 19 690 11%

Turkish Newspaper 16 500 15%

Korean Newspaper 15 (eojeols) 1400 (eojeols) 31%

44 (syllables) 3.5 (syllables) ~0.01%

Chinese Newspaper 82 (pinyin) 59 0%

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, 5, 9

Page 30: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

30/145

FestVox

Prior Work: GlobalPhone and FestVox

Page 31: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

31/145

FestVox: Building Synthetic Voices

http://festvox.org [Black and Lenzo 2000]

o Documentation, Tools, Scripts, Examples

o Building Synthetic Voices in the Festival Speech Synthesis System

o Supports:

o Diphone, unit selection, (later Statistical Parametric Synthesis)

o Lexicon, letter to sound rules

o Text processing support.

Page 32: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

32/145

Early FestVox Example Languages

o CMU development

o Croatian, Thai, Chinese (Mandarin), Japanese,

Catalan, Spanish, Nepali, Telugu, Tamil, Dari,

Pashto, Farsi

o Non-CMU

o At least: Italian, Malay, Maori, Mongolian,

Spanish, Telugu, Hindi, Japanese, English

(Many), German, Swedish, Polish, …

Page 33: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

33/145

TTS Build Tasks

o Define phone set

o Define pronunciations (LTS vs. Lexicon)

o Design prompt list

o Record data

o Write text front-end

o Number, symbol expansion

o Write/train prosody model

o Deal with something novel

o Word segmentation, no vowels, declensions

Page 34: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

34/145

TTS Build Results

o Results strongly correlated to effort

o Must-have for funded project

o Involve speech experts

o End of semester (student graduates)

o Almost random distribution rights

o Others can‟t always use the previous results

o No explicit copyrights (and no way to change them)

o Results often not in format for re-use

Page 35: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

35/145

Joint Speech Model Development

CMU projects: Arabic, Thai, Croatian, Farsi

o Shared audio data collection

o Prompts with phonetic coverage

o Lots of (ASR) / Single (TTS) speaker(s)

o Shared Phone set

o Sometimes “similar” e.g. with/without Tone

o Shared Pronunciation Data

o (Note) input and output are different vocab

But we need a much tighter coupling ….

Page 36: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

36/145

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Intelligent Learning Systems

o Prior Work: GlobalPhone and FestVox

o Rapid Language Adaptation Server

Page 37: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

37/145

Speech Processing: Interactive Creation & Evaluation toolkit

• National Science Foundation, Grant 2004-2008 (Schultz & Black)

• Bridge the gap between technology experts language experts

• Automatic Speech Recognition (ASR),

• Machine Translation (MT),

• Text-to-Speech (TTS)

• Develop web-based intelligent systems

• Interactive Learning with user in the loop

• Rapid Adaptation from universal models

• SPICE webpage http://cmuspice.org

Rapid Language Adaptation Toolkit (RLAT)

• Text Data Webcrawling, Focused Recrawling, Text Normalization

• Wiktionary-based Pronunciation Generation, Telephone Interface

• RLAT webpage http://csl.ira.uka.de/rlat-dev

SPICE and RLAT

Page 38: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

38/145

Input: Speech

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Hello NLP

/

MT

TTS

Text data

Phone set & Speech data

Speech Processing Systems

Page 39: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

39/145

Lexst LMt

Word s

Word t N-grams

AMtDictt

Word

phone

sequence

LMt

N-grams

AMs Dicts

Word

phone

sequence

Lexts

Word s

Word t

LMs

N-grams

AMs Dicts LMs

Word

phone

sequenceN-grams

AMtDictt

Word

phone

sequence

Input Ls Output Lt

Input LtOutput Ls

Speech-to-Speech Translation

Lsource Ltarget

Lsource Ltarget

SPICE Design Principles

o Data Sharing → Language Universal Models

o Knowledge Sharing across System Components

Page 40: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

40/145

Monolingual Dialog Systems

o Speech Models

o ASR acoustic and language models

o TTS models

o Lexical coverage

o NLP models

o Parsing

o Generation

o Interpretation (back-end processing)

Page 41: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

41/145

SPICE

o Collects:

o Appropriate text data

o Appropriate audio data

o Defines:

o Phoneme set

o Rich prompt set

o Lexical pronunciations

o Produces:

o Pronunciation model

o ASR acoustic model

o ASR language model

o TTS voice

o Maintains:

o Projects and users login

o Data and Models

Page 42: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

42/145

User„s View of Spice

Building support for a new language

o Login and Project Registration

o Text Collection & Prompt Selection

o Speech Data Collection

o Phone Set Specification and Selection

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

Page 43: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

43/145

Login and Project Registration

o Separate “projects“ for each language

o could share info between different projects

o All tasks times are logged

o Allow us to do cost/efficiency studies

Page 44: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

44/145

Page 45: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

45/145

Page 46: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

46/145

Text Collection

o We need text data for the target language

o Web crawler

o Plus boost data from similar sites

o Language encoding

o Non-trivial, but ...

o Deal with very common alphabets

o Internally all utf-8

o In-domain vs general text

o Character analysis

o Find the character classes:

o casing, numerals, punctuation etc

Page 47: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

47/145

Page 48: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

48/145

Prompt Selection

o Prompts for recording:

o Collection without transcription

o “Good” coverage will give “clean” models

o Prompts should be:

o Easy to say (no hard words, numerals etc)

o Rich in variability

Page 49: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

49/145

Finding Nice Prompts

o Only contain high frequency words

o No unusual words with unusual spelling

o 5 to 15 words

o Make them easy to say without errors

o Easy to say in one breath group

o “Phonetically” rich

o But we have no phonetic information yet

o Make them orthographically rich

o Greedily select to maximize tri-graphs

Page 50: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

50/145

Speech Data Collection

o Online audio recording tools

o Collaboratively record large number of speakers

o Speakers may separate from developer

o Visual feedback during recording

o Automatic upload on completion

o Java based for portability

o Works with *many* browsers

o In control of recording

o We can control the recording format

o File contents and directory structure

Page 51: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

51/145

Page 52: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

52/145

New Alternative: Telephone-based Collection

52

o Option 1: use web-based recorder

o Option 2: Telephone (RLAT)

IVR: Interactive Voice Response

PSTN: Public Switched Telephone Network

VoIP: Voice over IP

Page 53: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

53/145

Implementation

• Dialplan implementation: extensions.conf

• Phone number +49 721 180 30 681

53

Page 54: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

54/145

Phoneme Selection

o Selection from standard IPA chart

o User‟s names for phonemes

o Can match their lexicon (if one exists)

o Can match their familiarity

o Audio feedback

o Click to hear recording of each phone

o Allows us to map their phone names

o We map phones to IPA

o Get phonetic features for user‟s phones

o (what are vowels, what are stops etc)

Page 55: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

55/145

Page 56: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

56/145

Page 57: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

57/145

User„s View of Spice

Building support for a new language

o Login and Project Registration

o Text Collection & Prompt Selection

o Speech Data Collection

o Phone Set Specification and Selection

o Lexical construction

o ASR Bootstrap & training

o ASR Language model

o TTS Voice Building

Page 58: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

58/145

Page 59: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

59/145

Page 60: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

60/145

Page 61: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

61/145

Page 62: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

62/145

Page 63: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

63/145

Page 64: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

64/145

BREAK

13:00 – 14:15 Part 1: Introduction

o Motivation, History, Leveraged Work

o Spice: A Rapid Language Adaptation Server

o User's Level Walkthrough

Login, Data Collection, Phone Selection

After the Break:

o Spice – Under the Hood

o Lexical construction

o ASR Bootstrap & training

o ASR Language model

o TTS Voice Building

o Latest Experiments and Results

o Future

Page 65: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

65/145

SPICE: Demo Tape

Page 66: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

66/145

Outline Part 2

14:30 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

1. Forget construction, i.e. perform Grapheme-based ASR

2. Harvest Internet (Wiktionary) Resources

3. Interactive Learning (LexLearner)

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Evaluation

o Future Steps

Page 67: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

67/145

Input: Speech

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

NLP

/

MT

TTS

Textdaten„adios“ /a/ /d/ /i/ /o/ /s/

„Hallo“ /h/ /a/ /l/ /o/

„Phydough“ ???

Hello

Rapid Portability: Pronunciation Dictionary

Page 68: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

68/145

11,5

19,218,4

24,526,8

15,614 12,7

3336,4

32,8

16

26,4

18,3

0,0

10,0

20,0

30,0

40,0

50,0W

ord

Err

or

Ra

te [

%]

Phoneme Grapheme (FTT)Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASR

Problem:

• 1 Grapheme 1 Phoneme

Flexible Tree Tying (FTT):

One decision tree

• Improved parameter tying

• Less over specification

• Fewer inconsistencies

0=vowel?

0=obstruent? 0=begin-state?

-1=syllabic? 0=mid? -1=obstruent? 0=end?

AX-m

IX-m

AX-b

Page 69: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

69/145

Wiktionary as Source

o Automatic Pronunciation Dictionary Generation

o Idea: Automatically extract pronunciations from Wiktionary

69

Paper here at Interspeech 2010, Oral presentation, Wednesday, 5:20pm, Hall A/B

Tim Schlippe, Sebastian Ochs, and Tanja Schultz

Wiktionary as a Source for Automatic Pronunciation Extraction

Page 70: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

70/145

Wiktionary

• Quantity Check: Given a word list, what is the percentage of words for which phonetic

notations are found in a complete IPA representation?

• Quality Check: How many pronunciations derived from Wiktionary are identical to

existing GlobalPhone pronunciations?

How does adding Wiktionary pronunciations impact the performance

of ASR systems?

Page 71: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

71/145

Wiktionary as a Source

Top-Ten of Wiktionary

Language Editions

(July 2010)

http://meta.wikimedia.org

/wiki/List of Wiktionaries

Quantity of Pronunciations found

(% of pages with pronunciations)

Page 72: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

72/145

Wiktionary as a Source

• Quantity of proper names Proper names can be of diverse etymological origin and can surface

in another language without undergoing the process of assimilation

to the phonetic system of the new language

important as difficult to generate with letter-to-sound rules

Search pronunciations of 189 international city names and

201 country names to investigate the coverage of proper

names:

Page 73: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

73/145

Wiktionary as a Source

Amount of compared pronunciations, percentage of

identical ones and amount of new pronunciation variants:

Approach I: Using all Wiktionary pronunciations for training and decoding

Impact on ASR performance

Approach II: Using only those Wiktionary pronunciations in decoding that were

chosen in training (see table):

Page 74: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

74/145

* Follow the work of

Davel&Barnard

* Word list:

extract from text

User

Word list W

i:= best select

Word wi

Generate

pronunciation P(wi)

TTS

P(wi) okay?Yes

Delete wi

No

Update G-2-P

Improve

P(wi)

G-2-P

Delete wi

* Update after each wi

effective training

* G-2-P

- explicit map rules

- neural networks

- decision trees

- instance learning

(grapheme context)

LexSkip

Dictionary: Interactive Learning

Page 75: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

75/145

Page 76: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

76/145

Page 77: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

77/145

Lex Learner

Page 78: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

78/145

Lex Learner

Page 79: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

79/145

Issues and Challenges

o How to make best use of the human?

o Definition of successful completion

o Which words to present in what order

o How to be robust against mistakes

o Feedback that keeps users motivated to continue

o How many words to be solicited?

o G2P complexity depends on the

language (SP easy, EN hard)

o 80% coverage

hundred (SP) to thousands (EN)

o G2P rule system perplexity

Language Perplexity

English 50.11

Dutch 16.80

German 16.70

Afrikaans 11.48

Italian 3.52

Spanish 1.21

Page 80: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

80/145

Lex Learner TTS

o TTS feedback good for lexical pronunciation

o [Davel and Barnard 2004]

o Play TTS version of predicted pronunciation

o This even helps expert phoneticians

o Need full IPA TTS voice

o We do phonetic based TTS

o Unit selection (high fidelity)

o Flat prosody (but only isolated words)

o Good enough to keep user on track

Page 81: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

81/145

Outline Part 2

14:15 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation

o Future

Page 82: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

82/145

Input: Speech

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

NLP

/

MT

TTS

Phone set & Speech data

Hello

Rapid Portability: TTS

Page 83: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

83/145

Page 84: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

84/145

Statistical Parametric TTS

o Text-to-speech for Applications:

o Common technologieso Diphone: too hard to record and label

o Unit selection: too much to record and label accurately

o Statistical Parametric: “just right”

o Statistical Parametric Synthesis

o “HMM synthesis”

o clustergen trajectory synthesiso Clusters representing context-dependent allophones

o PRO: o can work with little speech (10 minutes)

o Robust to poor data

o CON: o Signal sounds “buzzy”, can lack varied prosody

Page 85: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

85/145

Voice Building Process

o Can usually collect 300-500 utterances

o Single speaker, rich prompt set

o Have lexical coverage (from Lex Learner)

o Automatic labeling from acoustic models

o Automatic: spectral and prosodic models

o But prosody will be similar to recordings

o No text processing front end (yet)

Page 86: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

86/145

Cross Lingual Voice Conversion

o Use non-native acoustic models

o Adapt them to target language

o Use small amount of target data

o Align and build mapping function (mllr or GMM)

o Requires phoneme mapping

o Automatic or by hand

[Anumachipalli and Black SLTU 2010]

Page 87: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

87/145

CLVC from English

o Conversion to German and Telugu

Page 88: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

88/145

Using ASR data for TTS

o Conventional TTS databases:

o Single speaker, well recorded

o Conventional ASR databases:

o Multi speaker, varied quality

o [Yamagishi et al IS2009]

o Speaker Adaptive Training

o Statistical Parametric Synthesis

o (Select “similar” speakers from DB)

Page 89: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

89/145

Outline Part 2

14:15 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation

o Future

Page 90: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

90/145

Page 91: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

91/145

Input: Speech

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

NLP

/

MT

TTS

Phone set & Speech data

+

Hello

Rapid Portability: Acoustic Models

Page 92: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

92/145

Phone set & Speech data

Rapid Portability: Data

Step 1:

• Uniform multilingual database (GlobalPhone)

• Build Monolingual acoustic models in many languages

Page 93: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

93/145

Multilingual Acoustic Modeling

Step 2:

• Combine monolingual acoustic models to a set of

multilingual “language independent” acoustic model

Page 94: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

94/145

Speech Production is independent from Language IPA

1) IPA-based Universal Sound Inventory

2) Each sound class is trained by data sharing

Reduction from 485 to 162 sound classes

m,n,s,l appear in all 12 languages

p,b,t,d,k,g,f and i,u,e,a,o in almost all

Universal Sound Inventory

Page 95: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

95/145

Input: Speech

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

NLP

/

MT

TTS

+

Hello

Rapid Portability: Acoustic Models

Step 3:

• Define mapping between ML set and new language

• Bootstrap acoustic model of unseen language

Page 96: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

96/145

Acoustic Model Building

o Acoustic Model Building requires:

o Recorded Read Speech Data

(since data is read, we have the transcripts!)

o Phone set definition

o Pronunciation Lexicon

o Two step process:

1. Configuration

2. Model Training

Page 97: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

97/145

Acoustic Model Building - Configuration

o Checks dependencies and errors

o Lexicon and phone set correspond

o Words in recorded prompts are covered by the lexicon

o Divides the recorded data into training and test sets

o Performance evaluation

o Few data: K-fold cross-validation, with K = #speakers

o More data: Data split into 90% (train) and 10% (test)

Page 98: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

98/145

Acoustic Model Building - Configuration

Page 99: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

99/145

Acoustic Model Building - Training

o Requires successful configuration

o Creates Log files and Displays to the user

o All steps of training

o EM Training for Context Independent Models

o 3-state HMM

o Number of Gaussians per Model depends on data

o EM Training for Context Dependent Models

o Number of models depends on data

o MFCC front-end, LDA

o Progress of training procedure

o Results of performance evaluation

Page 100: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

100/145

Acoustic Model Building - Training

Page 101: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

101/145

Unsupervised Adaptation

• Goal:

– Build Automatic Speech Recognition (ASR) for unseen

Language/Accent/Dialect with minimal human effort

• Challenge:

– No or Few Data, i.e. no transcribed Data!!!

• Solution:

– No transcriptions

apply unsupervised training approaches

– Lack of Linguistic Knowledge

transfer knowledge from other languages

– Here:

• Use several languages

• … of the same language family

• … and combine knowledge

Page 102: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

102/145

Experimental Setup

• Given: Data, Transcripts, ASR for several languages

• Recognition Systems for 4 Slavic Languages

– Croatian (South-Slavic, 7M spks)

– Russian (East-Slavic, 165M spks)

– Bulgarian (South-Slavic, 12M spks)

– Polish (West-Slavic, 56M spks)

• Wanted: ASR for Czech: (West-Slavic, 12M spks)

• GlobalPhone (ELRA), read speech, 100spks, ~ 20hrs per language

Page 103: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

103/145

SOURCERussian

Phone

Set

AM

SOURCEBulgarian

Phone

Set

AM

SOURCECroatian

Phone

Set

AM

Cross-Language Transfer

• Benefit :Source Acoustic Model not touched, apply CD models

• Benefit: Faster Decoding

• Drawback: If applied iteratively (see later), no adaptation of target AM

TARGETCzech

Dict

LM

SOURCEPolish

Phone

Set

AM

Cross-

Language

Transfer

AM

Mapped

Dict

LM

Page 104: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

104/145

Manual Phone Mapping

Page 105: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

105/145

Cross-Language Transfer (C-T)o Given: ASR in 4 languages (Bulgarian, Croatian, Polish, Russian)

o Audio Data in Czech but NO Transcripts

o Czech Pronunciation Rules (G-2-P Mapping for Dictionary)

o Text Data, Vocabulary Selection (Automatically derived)

o ASR Performance on Czech

o Apply Manual Phone Mapping to Dictionary

o Apply Source Language Acoustic Models (AM) as is

o Recognize Target Language Czech (Dev set)

o Overall performance is depressingly bad

o Word Error Rates (WER) vary with source language

o Automatic Mapping gives slightly better numbers but not worth it

Page 106: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

106/145

Improvements: Unsupervised Training (UT)

• UT: Assume audio but no transcription

– take recognizer hypothesis as transcription

• Several interesting works on UT (Zavaliagkos, 1998; Lamel

et. al 2002; Wessel/Lööf/Ney 1999-2009, …)

• UT effective only in combination with „confidence scores“

to select/weight the correct portions of the hypotheses

• Kemp and Schaaf, 1997 proposed „gamma“ and „A-stabil“

as confidence measures (JRTk)

• Problem: A-Stabil works well

for well trained Acoustic

Models (AM) but not with

badly trained AMs,

so NO option for C-T

Page 107: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

107/145

Multilingual A-Stabil

Page 108: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

108/145

Results on Multilingual A-Stabil

Page 109: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

109/145

Bootstrap Framework

WER C-T Iter1 Iter2

Bulgarian 61.0 24.5 23.6

Croatian 57.2 24.6 23.7

Polish 55.8 24.4 24.1

Russian 64.3 24.1 23.8

BestCZ 23.1(supervised)

Performance on

Czech Development Set

Poster here at IS2010:

N.T. Vu, T.Schlippe, F. Kraus, T. Schultz,

Rapid Bootstrapping of five Eastern European

languages using the Rapid Language Adaptation

Toolkit , Tuesday 10:00, Room B

Page 110: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

110/145

Outline Part 2

14:15 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation

o Future

Page 111: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

111/145

Language Model Building

Goal:

o Get as much relevant text data as possible

o Use the text data for

o Generating recording prompts

o Generating vocabulary lists

o Build Language Models for ASR

Approach

1. User supplies an URL to SPICE for crawling

2. Crawler retrieves N documents (web-pages)

3. Compute the statistics (TF-IDF) from the N documents

4. Terms with highest TF-IDF score form query terms

5. Query search engine (Google) to get the URLs for the

query terms

6. Crawl the URLs for the data

Page 112: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

112/145

Page 113: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

113/145

RLAT – Snapshot function

o Informative feedback about the quality of the crawled text

o Results show quality (PP, OOV), computed and displayed periodically

(to be defined by the user) during the crawling process

RLAT webpage:

http://csl.ira.uka.de/rlat-dev

Page 114: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

114/145

RLAT – Snapshot function

RLAT webpage: http://csl.ira.uka.de/rlat-dev

o PP

o OOV

o 10-fold

Page 115: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

115/145

RLAT – Snapshot function

Example: 18 days of crawling www.leparisien.fr (French)

n-gram coverage

7.0

6.0

5.0

4.0

3.0

2.0

1.0

0.0

OOV Rate (%)

100 Mio

50 Mio

150 Mio

200 Mio

250 Mio

300 Mio

350 Mio

400 Mio

100 K

50 K

150 K

200 K

250 K

300 K

Vocabulary size Total words

Page 116: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

116/145

Text Normalization based on SMT and Internet User Support

o Text extraction, HTML tag removal

o Language independent normalization

o Language-specific normalization

o Common abbreviations, punctuation, numbers, dates, casing

o New approach: (See Interspeech 2010 paper)

o Web-based user interface for language-specific text normalization

o Hybrid approach (rules + SMT)

Figure: Web-based User Interface for Text Normalization

Page 117: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

117/145

Text Normalization based on SMT and Internet User Support

o Experiments and Results:

o How well does SMT perform in comparison to LI-rule, LS-rule and

human?

o How does the performance of SMT evolve over the amount of

training data?

o How can we modify our system to get a time and effort reduction?

o Evaluation:

o comparing the quality of 1k output sentences derived from the

systems to text which was normalized by native speakers in our lab

o creating 3-gram LMs from our hypotheses and evaluated their

perplexities on 500 sentences manually normalized by native

speakers

o Detailed Results:Paper here at Interspeech 2010

Tim Schlippe, Chenfei Zhu, Jan Gebhardt, and Tanja Schultz,

Text Normalization based on Statistical Machine Translation and

Internet User Support

Page 118: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

118/145

Text Normalization based on SMT and Internet User Support

Figure: Performance (edit dist.) over amount of training data

Page 119: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

119/145

Outline Part 2

14:15 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation

o Future

Page 120: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

120/145

o Goal: Build Afrikaans – English Speech Translation System with SPICE

o Cooperation with University Stellenbosch and ARMSCOR

o Bilingual PhD visited CMU for 3 month

o Afrikaans: Related to Dutch and English,

g-2-p very close, regular grammar, simple morphology

o SPICE, all components apply statistical modeling paradigm

o ASR: HMMs, N-gram LM (JRTk-ISL)

o MT: Statistical MT (SMT-ISL)

o TTS: Unit-Selection (Festival)

o Dictionary: G-2-P rules using CART decision trees

o Text: 39 hansards; 680k words; 43k bilingual aligned sentence pairs;

Audio: 6 hours read speech; 10k utterances, telephone speech (AST)

SPICE 2005: Afrikaans – English

Page 121: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

121/145

o Good results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9)

o Shared pronunciation dictionaries (for ASR+TTS) and LM (for ASR+MT)

o Most time consuming process: data preparation reduce amount of data!

o Still too much expert knowledge required (e.g. ASR parameter tuning!)

58 7

311

5 50

5

10

15

20

25

Data Training Tuning Evaluation Prototype

daysAM (ASR) Lex LM (ASR, MT) TM (MT) TTS S-2-S

Time Effort

Herman Engelbrecht, Tanja Schultz, Rapid Development of an Afrikaans-English Speech-to-Speech Translator ,

IWSLT 2005, Pittsburgh, PA, October 2005

Page 122: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

122/145

SPICE 2007: Field Experiments

o Now targeting more languages in a shorter time frame

o 6-weeks Hands-on Course at CMU in Spring 2007

o Adopt native languages of participating students as targets

o Added up to 10 different languages: Bulgarian, English, French,

German, Hindi, Konkani, Mandarin, Telugu, Turkish, Vietnamese

o Teams of two students with different native language

o Course goal was to build a simple S-2-S system and use

this to communicate with each other in their mother tongue

o Solely rely on SPICE tools

o Build speech recognition components in two languages

o Build simple SMT component in two directions

o Build speech synthesis components in two languages

o Report back on problems and system shortfalls

Page 123: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

123/145

Field Experiments (2)

o The 10 languages cover broad range of peculiarities

o Writing system:

o Logographic Hanzi (Mandarin);

o Cyrillic (Bulgarian);

o Roman (German, French and English);

o phonographic segmental (Telugu and Hindi);

o phonographic featural (Vietnamese)

o No script: Konkani

o Segmentation: No segmentation (Chinese); Segmentation white

spaces do not necessarily indicate word (Vietnamese)

o Morphology: simple, low inflecting (English), compounding (German),

agglutinating (Turkish) …

o Sound System: tonal (Mandarin and Vietnamese), stress (Bulgarian)

o G-2-P: straightforward (Turkish), challenging (Hindi), difficult (English),

no relationship (Chinese), invented (Konkani)

Page 124: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

124/145

Lessons Learned

o It is possible to create speech processing components for

10 languages in 6-weeks using SPICE

o Each language brings new challenges

o Many SPICE features turned out to be very helpful, e.g.

only ONE speaker of Konkani in Pittsburgh, web recorder

allowed remote collection of more speakers

o Log: time spent

in SPICE interface

o Improve interface

using breakdown

o Use feedback

o Interface allows for

collaborative work

Task Time Spent

[hh:mm]

Text Collection 8:35

Audio Collection 10:07

Phoneme Selection 4:05

LM building 1:25

G-2-P specs 1:30

Page 125: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

125/145

Outline Part 2

14:15 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation of time vs expertise

o Future

Page 126: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

126/145

Initial evaluations

o Conducted 2 semester-long lab courses

o students use SPICE to create working ASR

and TTS in a language of their choice

o bonus for the ambitious

o train statistical MT system between two

languages to create a speech-to-speech

translation system

o Evaluation includes

o time to complete

o task difficulties

o ASR word error rate

o TTS voice quality

Page 127: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

127/145

Evaluation details on TTS

o Research questions:

o Which features contribute most?

o Are language-dependent features critical?

o How does voice quality vary with:

o Amount of recorded data

o Number of lexical entries

o Can objective measures estimate voice quality?

o Can any measures motivate the user?

Page 128: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

128/145

TTS from phonemes

o “welcome”

o W EH L K AH M

W23

EH912

L41

K941

AH553

M85

lexicon or LTS

clustered 'allophones'

symbolics

units or trajectories

acoustics

parametric synthesis

(trajectory-based)

predicted from

phonetic and linguistic

context using CART trees

Page 129: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

129/145

CART tree features

o Four categories of training features

1. name: phoneme and HMM state context

2. position: e.g., % from start (of phrase,

word, phone, HMM-state)

3. IPA features: based on phoneme set

4. linguistic: e.g. PoS, syllable segmentation

o Increasing difficulty

o 1. and 2. are language-independent

o 3. requires a defined phoneset

o 4. requires a computational linguist

Page 130: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

130/145

Effect of feature classes

Feature class Number Lang-dep. Δ MCD

no CART trees 1 no baseline

name symbolics 16 no - 0.452

position values 7 no - 0.402

IPA symbolics 72 yes - 0.001

linguistic sym. 14 yes + 0.004

o Mel-cepstral distortion (MCD)

o Standard objective measure in TTS/VC

o lower numbers are better

o ~ 0.2 is perceptually noticeable

o ~ 0.08 is statistically significant

o first two feature classes (name and position) matter

Page 131: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

131/145

Effect of feature classes on MCD

arctic_slt

Wagon Stop Value

10 100 1000

MCD

1-2

4

4.5

5.0

5.5

6.0Cummulative Feature Classes

Legend

names

names + posn

names + posn + IPA

+ linguistic (hidden)

Page 132: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

132/145

Effect of database size

o How does MCD improve with more speech?

o improvement of 0.2 from 3.75->7.5 minutes

o improvement 0.1-0.12 beyond that:

(near-linear increase with a doubling of data)

o 2x is perceptually better when <10m speech

o 4x is perceptually better after 10m speech

o Where is the asymptote?

o don't know yet!

o maybe 20 hours

Page 133: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

133/145

Database size vs MCD

tested on

training data

tested on 10% heldout

Wagon Stop Value

10 100 1000

MCD

1-2

4

4.0

4.5

5.0

5.5

6.0Effect of Database Size on MCD

arctic_slt

Legend

1/16 hour

1/8 hour

1/4 hour

1/2 hour

1 hour

1 hour

Page 134: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

134/145

Effect of a good Lexicon

o Grapheme-based voice

o 26 letters a-z are a substitute 'phone' set

o no IPA and linguistics features

o 3 slides ago showed that these don't matter

o English has highly irregular spelling

o the acoustic classes are impure

o measuring global voice quality – overlooking

mispronounced words

o Results:

o MCD improves by 0.27

o consistent across CART leaf node size

(“stop value”)

Page 135: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

135/145

Grapheme vs Phoneme (English)

Wagon Stop Value

10 100 1000

MCD

1-2

4

4.5

5.0

5.5

6.0Grapheme versus Phoneme-based Voices

arctic_slt

Legend

grapheme based

phoneme based

Page 136: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

136/145

10 non-English test languages

o European

o Bulgarian, French, German, Turkish

o Indian

o Hindi, Konkani, Tamil, Telugu

o East Asian

o Mandarin, Vietnamese

Page 137: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

137/145

Evaluating non-English voices

o Need a 'good' and a 'bad' voice

o Phoneme-based English is good

o Grapheme-based English is bad

o Data covers 3m to 1h of speech

o may be extrapolated to about 4h

o Following voices are from student

lab projects

Page 138: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

138/145

Non-English languages

Database Size (h)

0.1 1

MC

D 1

-24

4.5

5.0

5.5

6.0Effect of Database Size on MCD - Multi-Lingual

arctic_slt

Konkani

Mandarin

Bulgarian

HindiTamil

German

French

Vietnamese Legend

character-based

phoneme-based

Page 139: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

139/145

Characterizing voice quality

o MCD and size permits a quick assessment

o French is in good shape

o German could use lexicon improvements

o Hindi and Tamil are good for their size

o recommendation: collect more speech

o Bulgarian, Konkani and Mandarin need more

speech and a better lexicon

o Vietnamese voice had character set issues

o resulted in only ¼ of the speech being used

Page 140: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

140/145

More speech or a better lexicon?

o From the English MCD error curves

o 5x the speech = fixing the phoneset+lexicon

o Which is more time effective?

o assume 3-4 sentence recordings per minute

o assume 2-3 lexicon verifications per minute

o Answer

o If speech DB is small, record more speech

o If speech DB is large, work on the lexicon

o The transition point is language dependent

Page 141: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

141/145

Evaluation on TTS: Conclusions

o Discovered

o a way to estimate the quality of new voices

o a framework for directing the best course of

actions (recording more data vs lexical

improvements)

o that IPA and our particular language

dependent features are not critical to

success (Phew!)

o Next stage

o detecting bad lexical entries from acoustics

Page 142: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

142/145

Outline Part 2

15:40 – 17:00 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

o Latest Experiments and Results

o Lessons Learnt from past studies

o Evaluation of time vs expertise

o Future

Page 143: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

143/145

Conclusion

o Challenges in Multilingual Speech Processing

o Well defined build processes: ASR, MT, TTS … BUT:

o Every new language brings unseen challenges

o Current (statistical) approaches require lots of data

o … and native language expert and technology expertise

o How to bridge the gap between language and tech expert?

o Proposed solution: SPICE and RLAT

o Learning by interaction from a cooperative (but naïve) user

o Rapid adaptation from language universal models

o Knowledge sharing across components

o Development cycle: Days rather than weeks

o Better automatic identification of problem types

Page 144: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

144/145

Future

o Continuous Server Support

o Improve Interface based on user feedback and lessons learned

o Improve Language Robustness: font encoding, …

o Software Engineering, Scaling

o Collaboration

o Multiple people working on the same project

o Leverage from archived projects

o Cross-confirmation

o Multiple views for within and across project confirmation

o Confidence measure to find appropriate combination

o Error-blaming

o End-to-end system Evaluation vs Component Evaluation

o Automatic Generation of Recommendations to improve systems

Page 145: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

145/145

Latest Information

o System is online at http://cmuspice.org

o Use system for your own project

o Create new login/passwd and project

o Preloaded Hindi Example

o Login as

o Login: demo

o Passwd: demo

o Chose project # (your birth day)

Page 146: Multilingual Speech Processing - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/PP02_MSP_MS… · 1000 1200 1400 1600 1800 2000 [ 100 - 999 ... o Slovakian proverb:

146/145