Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Automatic Speech Recognition:
Trials, Tribulations and Triumphs
Sadaoki Furui
Professor Emeritus
Tokyo Institute of Technology
Outline
• Progress of automatic speech recognition (ASR)
technology
– 4 generations
– Synchronization with computer technology
– Recent activities at Tokyo Tech
– Major ASR applications
• How to narrow the gap between machine and
human speech recognition
• Data-intensive approach for knowledge extraction
• Summary
Radio Rex – 1920’s ASR
A sound-activated toy dog named “Rex” (from Elmwood
Button Co.) could be called by name from his doghouse.
Generations of ASR technology
1950 1960 1970 1980 1990 2000 2010
1952 1970 1G
Heuristic approaches (analog filter bank + logic circuits)
2G Pattern matching (LPC, FFT, DTW)
1970 1980
3G
Statistical framework (HMM, n-gram, neural net)
1980 1990
3.5G Discriminative approaches, machine learning, robust training, adaptation, rich transcription
1990
4G
Extended knowledge processing
?
Prehistory ASR (1925)
Our research
NTT Labs (+Bell Labs), Tokyo Tech
Collaboration with other labs
1st generation technology (1950’s-1960’s)
“Heuristic approaches”
• General
– The earliest attempts to devise
digit/syllable/vowel/phoneme recognition systems
– Spectral resonances extracted by an analogue filter
bank and logic circuits
– Statistical syntax at the phoneme level
• Early systems
– Bell labs, RCA Labs, MIT Lincoln Labs (USA)
– University College London (UK)
– Radio Research Lab, Kyoto Univ., NEC Labs (Japan)
(David C. Moschella: “Waves of Power”)
10
100
1,000
3,000
Nu
mb
er
of
use
rs
(mil
lio
n)
Year:1970 1980 1990 2000 2010 2020 2030
System centered
PC centered
Network centered
Knowledge resource centered
IT technology progress
2G 3G 3.5G 4G ASR
Mic
ropro
cessor
PC
Lapto
p P
C
Sm
art
phone
Flanagan writes, “… the computer in the background is a Honeywell DDP516. This
was the first integrated circuits machine that we had in the laboratory. … with
memory of 8K words (one-half of which was occupied by the Fortran II compiler) …”
(November 1970)
• DTW (dynamic time warping) – Vintsyuk in Russia (USSR) proposed the use of DP
– Sakoe & Chiba at NEC labs started to use DP
• Isolated word recognition – The area of isolated word or discrete utterance
recognition became a viable and usable technology based on fundamental studies in Russia and Japan (Velichko & Zagoruyko, Sakoe & Chiba, and Itakura)
• IBM Labs: large-vocabulary ASR
• Bell Labs: speaker-independent ASR
2nd generation technology (1970’s) (1)
“Pattern matching approaches”
2nd generation technology (1970’s) (2)
• Continuous speech recognition
– Reddy at CMU conducted pioneering research based
on dynamic tracking of phonemes
• DARPA program
– Focus on speech understanding
– Goal: 1000-word ASR, a few speakers, continuous
speech, constrained grammar, less than 10%
semantic error
– Hearsay I & II systems at CMU
– Harpy system at CMU
– HWIM system at BBN
(David C. Moschella: “Waves of Power”)
10
100
1,000
3,000
Nu
mb
er
of
use
rs
(mil
lio
n)
Year:1970 1980 1990 2000 2010 2020 2030
System centered
PC centered
Network centered
Knowledge resource centered
IT technology progress
2G 3G 3.5G 4G ASR
Mic
ropro
cessor
PC
Lapto
p P
C
Sm
art
phone
3rd generation technology (1980’s) “Statistical framework”
• Connected word recognition – Two-level DP, one-pass method, level-building (LB)
approach, frame-synchronous LB approach.
• Statistical framework
– HMM
– N-gram
– cepstrum + Dcepstrum
• Neural net
• DARPA program (Resource management task) – SPHINX system at CMU
– BYBLOS system at BBN
– DECIPHER system at SRI
– Lincoln Labs, MIT, AT&T Bell Labs
Structure of speech production and recognition system based on information transmission theory
(Transmission theory)
(Speech recognition process)
Acoustic channel
W S X W ^
Speech recognition system
Information source Channel Decoder
Text generation
Acoustic processing
Speech production
Linguistic decoding
)(
)()(maxarg)(maxargˆ
XP
WPWXPXWPW
WW
MFCC (Mel-frequency cepstral coefficients)-based front-end processor
FFT
FFT based spectrum
Speech
DCT Log
Mel scale triangular filters
Acoustic vector D
D 2
MFCC
Structure of phoneme HMMs
Statistical language modeling
Probability of the word sequence w1k w1w2...wk :
k k
P (w1k) P P (wi
|w1w2…wi-1) P P (wi |w1
i-1) i 1 i 1
P(wi |w1
i-1) N(w1i) / N(w1
i-1)
where N (w1i) is the number of occurrences of the string w1
i in the given training corpus.
Approximation by Markov processes:
Bigram model P(wi |w1
i-1) P(wi |wi-1)
Trigram model P(wi |w1
i-1) P(wi |wi-2wi-1)
Smoothing of trigram by unigram and bigram:
P(wi |wi-2wi-1) l1P(wi
|wi-2wi-1)+ l2P(wi |wi-1) + l3P(wi) ^
(David C. Moschella: “Waves of Power”)
10
100
1,000
3,000
Nu
mb
er
of
use
rs
(mil
lio
n)
Year:1970 1980 1990 2000 2010 2020 2030
System centered
PC centered
Network centered
Knowledge resource centered
IT technology progress
2G 3G 3.5G 4G ASR
Mic
ropro
cessor
PC
Lapto
p P
C
Sm
art
phone
3.5th generation technology (1990’s)
• Error minimization (discriminative) approach – MCE (Minimum Classification Error) approach
– MMI (Maximum Mutual Information) criterion
– MPE (Minimum Phone Error) criterion
• Robust ASR – Background noise, voice individuality, microphones,
transmission channel, room reverberation, etc.
– VTLN, MLLR, HLDA, fMPE, PMC, etc.
• DARPA program – ATIS task
– Broadcast news (BN) transcription integrated with information extraction and retrieval technology
– Switchboard task
• Applications – Automate and enhance operator services
Channel Speech
recognition system
Main causes of acoustic variation in speech
Noise • Other speakers • Background noise • Reverberations
Distortion Noise Echoes Dropouts
Microphone • Distortion • Electrical noise • Directional characteristics
Speaker Task/Context • Voice quality • Pitch • Gender • Dialect Speaking style Phonetic/Prosodic context • Stress/Emotion • Speaking rate • Lombard effect
• Man-machine dialogue • Dictation • Free conversation • Interview
3.5th generation technology (2000’s)
• DARPA program – EARS (Effective Affordable Reusable Speech-to-
Text) program for rich transcription, GALE
– Detecting, extracting, summarizing, and translating important information
• Spontaneous speech recognition – CSJ lecture project in Japan
– Meeting projects in US and Europe
• Robust ASR – Utterance verification and confidence measures
– Combining systems or subsystems
• Machine learning – Graphical models (DBN)
– Deep neural network (DNN)
Word error rate (WER) as a function of the size of
acoustic model training data (8/8 = 510 hours)
25
25.5
26
26.5
27
27.5
1/8 2/8 3/8 4/8 5/8 6/8 7/8 8/8
Training corpus size
Wo
rd e
rro
r ra
te (
%)
Task: CSJ (Corpus of Spontaneous Japanese)
Out-of-vocabulary (OOV) rate, word error rate (WER)
and adjusted test-set perplexity (APP) as a function of
the size of language model training data (8/8 = 6.84M
words)
Task: CSJ (Corpus of Spontaneous Japanese)
WFST (Weighted Finite State Transducer)-based
“T3 decoder”
Problems of WFST-based decoder:
– Large memory requirement
– Small flexibility
• Difficult to change partial models
H C L G speech
Context
dependent
phones
Context
independent
phones Word
Word
sequence
N-gram
Word sequence Speech
Composition
and
optimization
H: HMM
C: Triphone
L: Lexicon
G: N-gram
H∘C∘L∘G
On-the-fly composition
Parallel decoding
Audio
Config
Frontend filters
…..….….. Decoder
Active search GC
AM + Scoring GPU
Disk On-the-fly
WFST Control
Language model
Lexicon Triphone WFST tools
Static network
Word sequence/ Lattice
HMM acoustic model
Recognizer
Structure of the T3 decoder
RTF vs. word accuracy for the various
decoders and acoustic models (WSJ 5k task)
(det(C odet(L oG)))
• Cascade construction:
• Decoder:
T3, Sphinx3, HDecode,
Juicer
• AM:
Sphinx, HDecode
DBN representing a factor analyzed HMM
tq 1+tq
tO 1+tO
tx 1tx +
x
t x
t 1+
o
t o
t 1+Discrete
Continuous
xt : state vector, Ot : observation vector, qt : HMM state,
tx, t
o : mixture indicator
(A.-V. I. Rosti et al., 2002)
Latent variables
1-tWtW 1+tW
1-tqtq 1+tq
1-tOtO 1+tO
1-tltl 1+tl Acoustic condition
Speaker
DBN for acoustic factorization
(M. J. F. Gales, 2001)
(Dong Yu, 2012)
Deep neural network
Restricted Boltzmann machine
Hidden Layer No within layer connection
Visible Layer No within layer connection
(Dong Yu, 2012)
Why deep network is helpful
• Many simple non-linearities = One complicated non-linearity
• More efficient in representation: need fewer computational units for the same function
• More constrained space of transformations determined by the structure of the model – less likely to overfit
• Lower layer features are typically task independent (e.g., edges) and thus can be learned in an unsupervised way
• Higher layer features are task dependent (e.g., object parts or object) and are easier to learn given the low-level features
• Higher layers are easier to be classified using linear models
(Dong Yu, 2012)
CD-DNN-HMM: 3 key components
Model senones (tied triphone states) directly
Many layers of nonlinear
feature transformation
Long window frames, incl. Δ, Δ2
(Dong Yu, 2012)
Empirical evidence: Summary (Dahl, Yu, Deng, Acero 2012, Seide, Li, Yu 2011 + new result)
• Voice Search SER (24 hours training)
AM Setup Hub5’00-SWB RT03S-FSH
GMM-HMM BMMI (9K 40-mixture) 23.6% 27.4%
DNN-HMM 7 x 2048 15.8% (-33%) 18.5% (-33%)
AM Setup Test
GMM-HMM MPE 36.2%
DNN-HMM 5 layers x 2048 30.1% (-17%)
Switch Board WER (309 hours training)
Switch Board WER (2000 hours training)
AM Setup Hub5’00-SWB RT03S-FSH
GMM-HMM (A) BMMI (18K 72-mixture) 21.7% 23.0%
GMM-HMM (B) BMMI + fMPE 19.6% 20.5%
DNN-HMM 7 x 3076 14.4% (A: -34% B: -27%)
15.6% (A: -32% B: -24%)
(Dong Yu, 2012)
Deeper models more powerful? (Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)
DBN-Pretrain
BP LBP Discri- Pretrain
DBN-Pretrain
24.2 24.3 24.3 24.1 24.2 20.4 22.2 20.7 20.4 - - 18.4 20.0 18.9 18.6 - - 17.8 18.7 17.8 17.8 - - 17.2 18.2 17.4 17.1 22.5 17.1 17.4 17.4 16.8 22.6 17.0 16.9 16.9 - - - 17.9 - - - - - 17.0 - - - - -
22.1
Compare BP with DBN pre-training, pure backpropagation (BP), layer-wise BP-based model growing (LBP), and discriminative pretraining. Shown are word-error rates in %. ML alignment.
(Dong Yu, 2012)
Progress of speech recognition technology
since 1980
2 20 200 2000 20000 Unrestricted
Spontaneous speech
Read speech
Fluent speech
Connected speech
Isolated words
Vocabulary size (number of words)
Sp
eak
ing
sty
le
1980
1990
2000
natural conversation 2-way
dialogue transcription
network agent &
intelligent messaging
system driven dialogue
office dictation
name dialing
form fill by voice
directory assistance
word spotting
digit strings
voice commands
car navigation
Many applications now
Major speech recognition applications
• Spoken dialog systems for accessing information
services and voice search
(e.g. directory assistance, flight information systems,
stock price quotes, virtual agents, How May I Help You?,
GOOG-411, livesearch411, Vlingo, and Google VS)
• Systems for transcription, summarization and
information extraction from speech documents
(e.g. broadcast news, meetings, lectures, presentations,
congressional records, court records, and voicemails)
Voice search system architecture
Acoustic model Pronunciation
model Language model
Listing search
index
ASR
Search
Confidence
measures
Dialog
manager
Disambiguation
module
(Y.-Y. Wang et al., 2008)
Query
ASR m-best
results
Combined n-best
results
Listing
While driving 42%
In Public (bars, restaurants, stores, public transportation) 8%
Context of Vlingo mobile interface usage
(Self-reported)
email 5.7%
face book 1.3%
speak to text box 41.9%
tell a fiend 1.8%
twitter 0.4%
voice dial 10.4%
web search 18.7%
launch app 5.2%
note to self 1.2%
Vlingo Blackberry usage function
“Speak to text box” usage is the case where users speak into an existing application (hence, using the speech interface as a keyboard).
Average words per task
Average spoken words per Vlingo task
When speaking to Vlingo, users accomplish most tasks with only a small number of words.
Vlingo user behavior by expertise
% of utterances
Initial users tend to clear results when faced with recognition issues, whereas expert users are much more likely to correct either by typing or selecting from alternate results.
Google Voice Search accuracy evolution over time
model number
Webscore
(%)
Size of transcribed data: AM1: mismatched, AM2: 1K hours, AM3: 2K hours, AM4: 7K hours, AM5: more. Model structure and training methods are also improved.
(J. Schalkwyk at al., 2010)
Vocabulary
size: 46K
Perplexity and WER as a function of 3-gram LM
size Pe
rple
xity
WER
3-gram LM size 1 billion
Perplexity and WER as a function of 5-gram LM
size
WER
Perp
lexi
ty
5-gram LM size 1 billion
Human vs. machine recognition word error
rates across a variety of tasks (Lippmann, 1997)
Task Machine
performance %
Human performance %
Connected digits 0.72 0.009
Letters 5 1.6
Resource management 3.6 0.1
WSJ (Wall Street
Journal)
7.2 0.9
Switchboard 43 4
Pronunciation
ASR error analysis for a voice search application
(Y.-Y. Wang et al., 2008)
Major ASR problems
• Robustness against various speech variations
- Speakers, noise, language, topics, etc.
• Out-of-vocabulary (OOV) problem
- Don’t know when we know
• Few advances in basic understanding
• It takes a long time to build a system for a new
language; requires a large amount of
expensive speech databases
Knowledge sources for speech recognition
• Knowledge (Meta-data)
– Domain and topics – Context – Semantics – Speakers – Environment, etc.
• How to incorporate knowledge sources into
the statistical ASR framework is a key issue
• Unsupervised or lightly
supervised training is crucial
Speech (Data)
Recognition
Information extraction
Knowledge
Generalization (Meta-data)
Abstraction
Human speech recognition is a matching process whereby an audio signal is matched to existing knowledge (comprehension)
Next-generation ASR using comprehensive knowledge sources
Speech
Feature extraction
•Spectrum
•Waveform
•Pitch
•Energy
(Multiple time windows)
Decoder
Recognition
results
•Sentence
•Meaning
Multilingual model
Emotion
model
Prosody
model
Segmental
model
•Phone
•Syllable
Language
model
•N-gram
•Syntax
•Semantics
Speaker
model
Speaking style
model Topic
model
Noise
model
Adaptation
(David C. Moschella: “Waves of Power”)
10
100
1,000
3,000
Nu
mb
er
of
use
rs
(mil
lio
n)
Year:1970 1980 1990 2000 2010 2020 2030
System centered
PC centered
Network centered
Knowledge resource centered
IT technology progress
2G 3G 3.5G 4G ASR
Mic
ropro
cessor
PC
Lapto
p P
C
Sm
art
phone
Data-intensive (“Big data”) ASR
Individuality
Huge speech database with various variations
Meaning (Message)
Emotion
Current DB
Others
Mapping
Mapping
Mapping
Mapping
“We have actually made fair progress on simulation tools, but not very much on
data analysis tools.” (Jim Gray)
“Quantity decides quality.” “Resources are better spent collecting more data.”
Technical issues • How to collect rich and orders of magnitude larger
speech DBs, covering all possible variations – Current typical speech DB (for a single task)
• 1000-10,000 h speech ~ 100 M - 1 Gbyte
– Current model complexity (for a single task) • HMM : several million parameters
• Number of Trigrams : 100 million - 1 billion
– Many sources of variations
• How to build and utilize huge DBs (Scalable approach) – Cheap, fast and good enough metadata/transcription – Well structured model (machine learning) – Efficient data selection for annotation (active learning) – Unsupervised, semi-supervised or lightly-supervised
training/adaptation – High-performance computing
In the Mel-cepstral domain, mismatch function relating
clean speech x and noisy speech y is given by:
y = x + h + C log (1 + exp (C-1(n - x - h)))
where
n and h : additive and convolutional noise, respectively
C : DCT matrix
The model is first adapted to the target speaker by MLLR,
and then adapted to the target noise environment via
model-based vector Taylor series (VTS) compensation.
These transforms can be jointly estimated using ML.
Speaker and noise factorization
(Y.-Q. Wang and M. J. F. Gales, 2011)
Subspace mixture model
j : phonetic state index
m : substate index
cjm : mixture weight
i : Gaussian index
wjmi : mixture weight
s : speaker
Mi : mean projection matrix
vjm : state-substate-specific
vector
Niv(s) : speaker-specific offset
v(s) : speaker vector
)()(
)(
11
)(
vNvM
),;()(
s
ijmi
s
jmi
is
jmi
I
i
jmi
M
m
jm
s xwcjxpj
+
N|
(D. Povey et al., 2010)
Similar to Joint Factor Analysis, Eigenvoice, and Cluster AdaptiveTraining.
(S. Novotney et al., 2010)
Cheap, fast and good enough: non-expert transcription through crowdsourcing
Efficient data selection for annotation (Active learning)
• Select most informative utterances, having
– Low certainty (word posterior probability, confidence score, relative likelihood, margin) (Hakkani-Tur et al., 2002)
– Large word lattice entropy (Varadarajan et al., 2009)
– Large classification disagreement among committees (query by committee) (Seung et al., 1992; Hamanaka et al., 2010)
• Manually transcribe selected utterances and use them for model training
• Problem: how to remove outliers
– Density-based approaches
• Combine semi-supervised training for remaining utterances (Riccardi et al., 2003)
Committee-based utterance selection
1. Randomly and equally divide
transcribed data T into
K data sets
2. Train K recognizers (committees)
using the K data sets
3. Recognize all the utterances in
the un-transcribed data U by
each recognizer
4. Select N hour utterances with
relatively high degree of
disagreement (vote entropy)
among K recognizers
5. Transcribe the selected
utterances and add them to T,
and go back to Step 1.
• Random selection: randomly select utterances to be transcribed
• WPP (word posterior probability)-based method
• Committee-based method (8 and 1 committees for AM and LM, respectively)
69
70
71
72
73
74
75
76
25 50 75 100
Wo
rd a
ccu
racy
(%
)
Amount of transcribed training data (h)
Random
WPP
AM8-LM1
Comparison with other methods
The amount of data T to achieve a word accuracy of 74%: Random: 95 h, WPP-based: 72 h, Committee-based: 63 h
Accuracy when all
training data (191 h) was transcribed
(Database: Corpus of Spontaneous Japanese)
Create initial model from limited
amount of transcribed corpus
Select utterances considering both of
Informativeness and representativeness
Speech data
pool (Nu
untranscribed
utterances)
Combination of informativeness and representativeness
Decode all utterances (ui) (i=1,…,Nu) in
the data pool using the initial model
N-best phone n-gram sequence
Calculate N-best entropy (H(ui)) for each
utterance
(Informativeness)
Extract phone based
tf-idf features: t(ui)
Calculate dot products between t(ui) and
t(uj), j=1,…,Nu, j≠i
(Representativeness)
(N. Itoh et al., 2012)
Lightly-supervised AM training for broadcast news ASR
• BN has closed-captions which are close, but not exact transcription, and only coarsely time-aligned with the audio signal.
• A speech recognizer is used to automatically transcribe unannotated BN data. The closed-captions are aligned with transcriptions and only segments that agree are selected for training.
• Using the closed captions to provide supervision via the LM is sufficient, and there is only a small advantage in using them to filter the “incorrect” hypotheses/words.
(L. Lamel et al, 2000)
Semi-supervised batch-mode adaptation
M
M
M
T
D
Initial model
Copy
Untranscribed speech data
Recognition
hypotheses Recognition results
Model update (with confidence-measures)
Speech recognition
Iterate
Run a decoder to obtain recognition hypotheses
Use the hypotheses as reference for adaptation (eg. MLLR)
Problems of the batch-mode adaptation
• Errors are unavoidable in the recognition
• Model parameters are estimated using the
hypotheses including errors
• In the next decoding step, the adapted model
tends to make the same errors
• During the iterations, errors are reinforced
How to improve the adaptation performance by
reducing the influence of the recognition errors
M(2)
D(1)
Initial model
Copy
Evaluation speech data
Recognition hypotheses
Recognition results
Model update
Speech recognition
Iteration
M(1) M(K)
M(2) M(1) M(K)
D(2) D(K)
M
T(1) T(2) T(K)
Reducing the influence of recognition errors by separating the data used for the decoding step and the model update step
Semi-supervised cross-validation (CV) adaptation
Number of iterations and WER
CV:
K=40
Ag:
K=10, K’=6, N=8 18.5
19
19.5
20
20.5
21
21.5
22
22.5
23
0 1 2 3 4 5 6 7 8
Baseline
CV(40)
Ag(10,6,8)
Wo
rld e
rro
r ra
te (
%)
Number of iterations
Relative WER reductions
Batch-adapt:12%
CV-adapt:17%
Ag-adapt:16%
Summary
• Speech recognition technology has made very significant progress in the past 30 years with the help of computer technology.
• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.
• Major successful applications are spoken dialog and transcription systems.
• Much greater understanding of the human and physical speech processes is required before automatic speech recognition systems can approach human performance.
• Significant advances will come from data-intensive approach for knowledge extraction (“Big data” ASR).
• Active learning, and unsupervised, semi-supervised or lightly-supervised training/adaptation technologies are crucial (Machine learning).