59
Speech Recognition for Under-resourced Languages using Probabilistic Transcriptions Preethi Jyothi Department of CSE, IIT Bombay CS344 Guest Lecture February 7, 2017

Speech Recognition for Under-resourced Languages using ... › ... › resources › AI2017_guestlecture.pdf · Mismatched Crowdsourcing • A major bottleneck for ASR in new languages:

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Speech Recognition for Under-resourced Languages using Probabilistic Transcriptions

    Preethi Jyothi
Department of CSE, IIT Bombay

    CS344 Guest Lecture

    February 7, 2017

  • Introduction

    Automatic speech recognition (ASR): 
Translate spoken words into text

    SiriCortana

  • ASR isn’t to blame for this…

    Image from http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1

    http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1

  • Introduction

    Automatic speech recognition (ASR): 
Translate spoken words into text

    SiriCortana

    Modern ASR systems are dominated by statistical methods pioneered by [Jelenik ’76]

  • WX

    p(X,Q,W)

    Standard ASR Pipeline

    LANGUAGEMODEL

    ACOUSTICMODEL

    PRONUNCIATIONMODEL

    Decoding: Given X, find argmax p(W|X) W

    = argmax p(W, X) W

    = argmax ∑ p(X,Q,W) W Q

    p(X|Q) ≅ p(X|Q,W)

    p(Q|W) p(W)

  • ASR over the years

    IEEE

    PRO

    OF

    IEEE SIGNAL PROCESSING MAGAZINE [2] SEPTEMBER 2011

    [lecture NOTES] continued

    acoustic model, source language model, probabilistic MT model, and target lan-guage model [14], [27]. The integration of ASR and MT in ST has been proposed and implemented mainly at the decod-ing stage. Little work was done on opti-mal integration for the construction or learning of ST systems.

    On the other hand, discriminative learning has been used pervasively in recent statistical signal processing and pattern recognition research including ASR and MT separately [7], [13], [26]. In

    particular, a unified discriminative train-ing framework was proposed for ASR, in which a variety of discriminative training criteria are consolidated into one single general mathematical form, and a general optimization technique based on growth transformation was developed [7]. Applications of this optimization frame-work to MT and the extension to ST will be presented in this lecture note.

    To conclude this brief introduction, we point out that in each of the ASR, MT, and ST fields, a series of benchmark evalua-

    tions have been conducted, historically and some continuing until this as well as coming years, with active participation of international research teams. In Table 1, we summarized major benchmark tasks for each of the three fields.

    Historically, NIST have conducted a comprehensive series of tests on ASR over two and a half decades, and the perfor-mance of major ASR benchmark tests is summarized in [19]. Here we reproduce the results in Figure 1 for the readers’ reference.

    100

    10W

    ER

    (%

    )

    4

    2

    Range of Human Error in Transcription

    1k

    5k

    20k

    Noisy

    ReadSpeech

    Air TravelPlanning Kiosk

    Speech

    BroadcastSpeech

    Switchboard

    Switchboard II

    Conversational Speech

    NIST STT Benchmark Test History—May 2009

    (Non-English)

    (Non-English)

    News English Unlimited

    News English 10X

    News English 1X

    CTS Fisher (UL)

    News Arabic 10XNews Mandarin 10X

    Meeting Speech

    CTS Arabic (UL)CTS Mandarin (UL)0

    VariedMicrophones

    1

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1998

    2000

    2001

    2002

    2003

    2004

    2005

    2000

    2007

    2008

    2009

    2010

    2011

    Meeting—MDMOV4Meeting—SDMOV4

    Meeting—IHM

    [FIG1] After [19], a summarization of historical performance progresses on major ASR benchmark tests conducted by NIST.

    [TABLE 1] SUMMARY OF MAJOR BENCHMARK TASKS IN ASR, MT, AND ST.

    TASK NAME CATEGORY SCOPE AND SCENARIOS

    TIMIT ASR English phonetic recognition; small vocabularyWSJ 0&1 ASR Mid-to-large vocabulary; dictation speechAURORA ASR Small vocabulary, under noisy acoustic environmentsDARPA EARS ASR Large vocabulary, broadcast, and conversational speechNIST MT OPEN EVAL MT Large scale MT, newswire, and Web text WMT (EUROPARL/EUROMATRIX) MT Large scale MT, newswire, and political text C-STAR ST Limited domain spontaneous speechDARPA TRANSTAC ST Limited domain spontaneous dialogIWSLT ST Limited domain dialog and unlimited free-style talk TC-STAR ST Broadcast speech, political speechDARPA GALE ST Broadcast speech, conversational speech

    Image reproduced from He & Deng, IEEE Signal Proc. Magazine, 2011

    • Great progress in ASR performance

    • Aided by algorithmic and computational advances

    • Recently: Baidu’s
Deep Speech 2 comparable to human performance

    • Trained on about 10,000 hours of labeled speech


    • But limited language diversity

  • Languages with ASR

    E.g., Google Voice Search supports < 80 out of 7000 languages

    Frac

    tion

    in G

    oogl

    e Vo

    ice

    Sear

    ch

    0%

    1%

    2%

    3%

    4%

    5%

    6%

    7%

    8%

    9%

    10%

    Africa Americas Asia Europe Pacific

    https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html

    https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html

  • ASR for all languages

    • ASR systems in all languages?

    • Speech is the primary means of human communication

    • Develop natural interfaces for both literate & illiterate users

    • Contribute to preservation of endangered languages

  • Lack of Transcribed Corpora

    • Major challenge: Building ASR systems is very data-hungry • Require large amounts of labeled speech data: Speech audio

    with matching transcriptions

    • Transcription by native speakers is a laborious and expensive process

    • Crowdsourcing might help alleviate the problem • However, significant mismatch in native languages of crowd

    workers and native language populations in the world

  • Native Language Mismatch

    • Very few (to zero) crowd workers speak minority languages • Distributional mismatch between language background of crowd

    workers with the language expertise required to complete transcription tasks

  • Mismatched Crowdsourcing

    • A major bottleneck for ASR in new languages: Labeled speech

    • Transcribers need to be native speakers

    Use Non-native Speakers?

    How can it possibly work?!1

    Mismatched Crowdsourcing

    1[Jyothi & Hasegawa-Johnson AAAI-15 & Interspeech-15]

  • Mismatched Crowdsourcing

    • How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign

    languages!

    • Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …

  • Mismatched Crowdsourcing

    • How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign

    languages!

    • Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …

    • An extremely noisy channel

  • Mismatched Crowdsourcing

    • An extremely noisy channel

    मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    ENCODER(Speaker) (Transcriber)

    CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

  • Solution: Error Correction

    • Learn channel characteristics of the foreign listener

    मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    ENCODER(Speaker)

    CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

    (Transcriber)

  • मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    Solution: Error Correction

    • Learn channel characteristics of the foreign listener

    ENCODER(Speaker)

    CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

    (Transcriber)

    A Probabilistic Finite State model 









    Trained using the Expectation-Maximization (EM) algorithm

    x = [tʃ] [iː] [k]

    1𝜖 : h/1

    2[iː] : e/0.6

    𝜖 : e/1

    [k] : k/0.6 [tʃ] : c/0.7

    𝜖

    [k] : g/0.6p ( “chieg” | x) = 0.11

    p ( “cheek” | x) = 0.25

  • Visualizing the Channel

    m s n l a k o p t r i i e h j æ

    Hindi phones (IPA)

    Con

    ditio

    nal p

    roba

    bilit

    y of

    Eng

    lish

    lette

    rs g

    iven

    Hin

    di p

    hone

    s

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ms n l a

    k o p t a ri i

    eh y e

    n

    n

    z m

    el oc u b

    de

    le ee

    a

    ia

    es l

    nah g oo pa

    po

    d

    ae

    i

    iy

    i

    shin

    llar

    a

    u

    ar

    ee

    y

    ee

    eh

  • Solution: Error Correction

    • Learn channel characteristics of the foreign listener • But also need to use an error-correcting code

    मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    ENCODER(Speaker)

    CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

    But how can we encode speech?

    (Transcriber)

  • Solution: Error Correction

    • Learn channel characteristics of the foreign listener • Use a repetition code!

    मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    ENCODER(Speaker)

    CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

    (Transcriber)

  • मेरा नाम राम ह ै मैं यहां पहली बार …

    mara number mail

    bar..

    Solution: Error Correction

    • Learn channel characteristics of the foreign listener • Use a repetition code!

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ ADECODER

  • Example

    paḍegā

    paḍegā

    pratī

    paḍā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrā

    pūrāpūrā

    CU

    MU

    LATI

    VE

    DEC

    OD

    ING

    puda

    waatap

    fotup

    puddop

    pooda

    puda

    purap

    poduck

    purap

    foodap

  • Labeling Error Rates

    • Impressive accuracy (~5% error) on a medium-vocabulary isolated-word task!

    Erro

    r Rat

    e (%

    )

    05

    1015202530354045505560

    Number of Repetitions1 2 4 8 10 12 15

    1-best 2-best

    ASR Error Rate: 26.5%

    [Jyothi & Hasegawa-Johnson AAAI-15]

  • Information-theoretic Analysis

    • Conditional entropy, H(X | Y), of the spoken words (Hindi), X, given the crowd transcripts Y, captures the amount of information lost in transmission

    • H(X | Y) can be naively upper-bounded using corpus cross-entropy

    • However, errors in our channel model accumulate with increase in the number of repetitions, resulting in this upper-bound becoming less tight

  • Information-theoretic Analysis

    Tighter bound on H(X | Y) using an auxiliary random variable, Z ∈{0,1}

    Consider W = 𝜖 when Z = 0, W = X when Z = 1 We set Z = 1 when q(x|y) is sufficiently low

    W represents side channel information indicating when the model needs to be corrected

    H(X | Y) ≤ p0 · H(X | Y, Z=0) + H(Z) + (1 − p0) log |X|

    where p0 = p(Z=0) and X is the input alphabet

    Upper-bounded using corpus cross-entropy

  • Conditional Entropy EstimatesC

    E es

    timat

    e (in

    bits

    )

    0

    1.25

    2.5

    3.75

    5

    Number of Turkers

    1 2 4 8 10 12 15

    Naive-CE Tighter-CE

    Our upper-bound estimates for CE are clearly tighter than the naive cross-entropy upper-bound estimate

  • Mismatched Crowdsourcing for 
Continuous Speech

    February 7, 2017

  • Continuous Speech

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ A

    Exact Maximum-Likelihood Decoding of multiple

    strings is intractable for long utterances

    DECODER

    Decode each transcript individually using Maximum

    Likelihood Decoding and pick the output with the

    best score

    Word error rate: 77%

  • Continuous Speech

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ A

    Can we do better?
An outlier with a good score

    shouldn’t be chosen over what many similar looking transcripts

    predict

    DECODER

    Decode each transcript individually using Maximum

    Likelihood Decoding and pick the output with the

    best score

    Word error rate: 77%

  • Continuous Speech

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ A

    Can we do better?
An outlier with a good score

    shouldn’t be chosen over what many similar looking transcripts

    predict

    DECODER

    Data Filtering:
Use an edit-distance based similarity metric to discover

    a “cluster” to retain.

  • Data Filtering

    1

    2

    3

    4

    5

    4

    2

    1

    3

    5

    4

    4 2

    1

    a v i e m e k ea b i a n - k ea v e a m e g ia v e a n - k aa b i a n - k a

  • Continuous Speech

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ A

    Can we do better?
An outlier with a good score

    shouldn’t be chosen over what many similar looking transcripts

    predict

    DECODER

    Data Filtering:
Use an edit-distance based similarity metric to discover

    a “cluster” to retain.

    Word error rate: 68%

  • Continuous Speech

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ A

    Can we do even better?

    DECODER

    Data Filtering:
Use an edit-distance based similarity metric to discover

    a “cluster” to retain.

    Word error rate: 68%

  • Channel Merger

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ AMerged 
Channel 
Decoder

    Merge

    ENCODER

    Alignment

    Approximation 
via incremental

    alignment algorithm

    NP-hard!

    Discover
“typical”


    transcripts

    Data Filtering

  • Alignment

    k e e - a - j a - g a -

    - g i y a - j a y g a -

    k e e - a - j a y g a h

    - - c h a i j e - g a -

    kiyā jāyegā

    k e e a j a g a 
g i y a j a y g a 


    k e e a j a y g a h 
c h a i j e g a

  • Alignment

    k E - a j a g a -

    g i y a j Y g a -

    k E - a j Y g a h

    C - - Y j e g a -

    kiyā jāyegā

    k e e a j a g a 
g i y a j a y g a 


    k e e a j a y g a h 
c h a i j e g a

  • Channel Merger

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ AMerged 
Channel 
Decoder

    Merge

    ENCODER

    MergeMerge into one

    probabilistic transcript

    Alignment

    Approximation 
via incremental

    alignment algorithm

    NP-hard!

    Discover
“typical”


    transcripts

    Data Filtering

  • Merge Transcripts

    k E - a j a g a -

    g i y a j Y g a -

    k E - a j Y g a h

    C - - Y j e g a -

    kiyā jāyegā

    a/0.75

    Y/0.25

    j/1 Y/0.5

    e/0.25

    g/1 ……

  • Channel Merger

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ AMerged 
Channel 
Decoder

    Merge

    ENCODER

    MergeMerge into one

    probabilistic transcript

    Model for merged channel

    Alignment

    Approximation 
via incremental

    alignment algorithm

    NP-hard!

    Discover
“typical”


    transcripts

    Data Filtering

  • Channel Merger

    Speaker Repeat

    ENCODERIndependent Invocations

    of CHANNEL

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    Y(1)

    Y(2)

    . . . Y(k)

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A

    X Y X̃ A X Y X̃ AMerged 
Channel 
Decoder

    Merge

    ENCODER

    MergeMerge into one

    probabilistic transcript

    Model for merged channel

    Alignment

    Approximation 
via incremental

    alignment algorithm

    NP-hard!

    Discover
“typical”


    transcripts

    Shortlist
& Decode

    List Decoding

    + Exact Decoding from List

    Data Filtering

  • Probabilistic Transcriptions

    Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza

    1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]

  • Probabilistic Transcriptions

    Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza

    Probabilistic phone-based transcriptions derived from alignments1

    s t r a - k a p o z t a - p E - s o m

    s t r a - k a p o - t a - p i - S a -

    s t r a h k a p o - C a h p E - S u m

    s t r Y - k a p o z s a m p E t s a -

    s tʰ r ɑ/ɑi

    kʰ ɑ pʰ ɔ tʃʰ/t/s

    ɑ pʰ i/i: ʃ/s ɑ

    1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]

  • Transcription Error Rates E

    rror R

    ates

    (%)

    0

    20

    40

    60

    80

    Individual Merged Decoding from Shortlist

    With data filtering Phone Error RatesNo data filtering

    [Jyothi & Hasegawa-Johnson Interspeech-15]

    44% drop!

  • Adapting ASR Systems using 
Mismatched Transcriptions

    February 7, 2017

  • Next Step?

    • Respectable accuracy from mismatched transcriptions • But can this be leveraged for building ASR systems?

    • Plan: Baseline ASR trained on other languages will be adapted using mismatched transcriptions

    • Baseline could use data-hungry technology like Deep Neural Networks (DNNs)

    • Project at 2015 Jelenik Summer Workshop [JSALT ’15] • Several languages considered: Hungarian, Mandarin, Swahili etc.

  • More than meets the eye

    • Measuring additional information in probabilistic transcripts:

    • How error rates fall when more “advice” is made available to the decoder

    Phon

    e er

    ror r

    ates

    (%)

    30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    Advice per phone (in bits)0 0.2 0.4 .6 .8 1

    HungarianMandarinSwahili

    • Mismatched transcripts too noisy to be used in the traditional way for ASR training

    • Use as probabilistic transcripts

    a/0.75

    Y/0.25

    j/1 Y/0.5

    e/0.25

    g/1 ……

  • ASR Systems for Comparison

    • Multilingual: Train on 6 languages (Arabic, Cantonese, Dutch, Hungarian, Mandarin, Urdu) and test on a new target language (Swahili).

    • Semi-supervised DNN: Transcribe unlabeled audio from the target language using a DNN-based multilingual ASR system and use it to further re-train the DNN models.

  • Mismatched Transcriptions for ASR

    [Liu*, Jyothi* et al. ICASSP’ 16]

    25% drop!

    Phon

    e Er

    ror R

    ates

    (%)

    0

    20

    40

    60

    80

    Swahili Mandarin Hungarian

    Multilingual Semi-supervised DNN + Adaptation

  • Native Language Backgrounds of 
Mismatched Crowds

    February 7, 2017

  • Channel Selection

    • Can we do better by selecting the language background of the transcribers?

    • How should we select the transcribers?

    • Understanding when phones get misperceived

    • How is this correlated with transcribers’ language background

  • Understanding the Mismatched Channel

    • When are two phones confused with each other?

    • If they are “phonologically close” to each other

    • Phonological distance between two phones

    • Use distinctive features (DF) from linguistic theory
[Chomsky & Halle, ’68] [Phoible ’15]

    37 DFs nasal tone

    sonorant labial trill

    front back

    :• Phones as “code words” in the DF-space.


    Hamming distance measures phone contrast.

  • Distance Distribution of the Code

    [Varshney, Jyothi & Hasegawa-Johnson ITA-16]

  • Distance Distribution of the Code

    • Codes for different languages exhibit similar distributions

  • • Phone confusion quantified using the total variational distance between the output distributions of the channel

    • We call it phone-pair distinction

    Phones Letters

    • Hypothesis: Phonological distance in the DF-space correlated 
with phone confusion in the mismatched channel

    Understanding the Mismatched Channel

  • 0

    10

    20

    30

    0.00 0.25 0.50 0.75Phone pair distinction by English speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    10

    20

    30

    Frequency

    Phone Pair Distinction vs. DF-Distance

  • 0

    10

    20

    30

    0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    1020304050

    Frequency

    0

    10

    20

    30

    0.00 0.25 0.50 0.75Phone pair distinction by English speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    10

    20

    30

    Frequency

    Phone Pair Distinction

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00 0.25 0.50 0.75Phone pair distinction by English speakers

    Phon

    e pa

    ir di

    stin

    ctio

    n by

    Man

    darin

    spe

    aker

    s

    01234

    Frequency (log scale)

  • 0

    10

    20

    30

    0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    1020304050

    Frequency

    0

    10

    20

    30

    0.00 0.25 0.50 0.75Phone pair distinction by English speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    10

    20

    30

    Frequency

    Phone Pair Distinction

    • Clearly, DF-distance positively correlated with phone-pair distinction

    • Difference across native language backgrounds

    • Different DFs are prominent in different languages

  • 0

    10

    20

    30

    0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    1020304050

    Frequency

    0

    10

    20

    30

    0.00 0.25 0.50 0.75Phone pair distinction by English speakers

    Ham

    min

    g di

    stan

    ce b

    etwe

    en p

    hone

    pai

    rs

    10

    20

    30

    Frequency

    Phone Pair Distinction

    • Clearly, DF-distance positively correlated with phone-pair distinction

    • Difference across native language backgrounds

    • Different DFs are prominent in different languages

    • Ongoing work: A model that takes into account DF presence/prominence

  • • ASR for low-resource languages presents challenging research problems

    • In this talk:

    • Establish the possibility of acquiring 
speech transcriptions using mismatched 
crowds

    • Demonstrate the impact of mismatched 
transcriptions on ASR performance

    • Investigate relation of transcriber native languages with phone confusion

    Summary

    • Future research: Optimally select mismatched transcribers to 
further improve impact of mismatched transcriptions

  • Summary

    1Based on joint works with Mark Hasegawa-Johnson, Lav Varshney and participants at the 2015 Jelinek Summer Workshop.

    • ASR for low-resource languages presents challenging research problems

    • In this talk:

    • Establish the possibility of acquiring 
speech transcriptions using mismatched 
crowds

    • Demonstrate the impact of mismatched 
transcriptions on ASR performance

    • Investigate relation of transcriber native languages with phone confusion