Wreck a nice beach: adventures in speech recognition

Wreck a nice beach: adventures in speech recognition

Stephen Marquard

Centre for Educational Technology, University of Cape Town

[email protected]

Department of Computer ScienceSeminar, April 2011

http://creativecommons.org/licenses/by-sa/2.5/za/

Overview

• Project goals• Speech recognition• Acoustic modelling• Language modelling• Integration into a lecture capture system

Project goals

• Integrate speech recognition into a lecture capture system:– Opencast Matterhorn– CMU Sphinx ASR engine

• Generate automatic transcripts of recorded lectures• Allow users to correct and improve the transcripts

(crowdsourcing)• Use feedback to improve recognition accuracy

(of the same, similar or subsequent recordings)• Experiment and implement at UCT

Why is it important?

• Video and audio is more useful if you can:– Navigate it easily– Locate relevant recordings from a large set

• Use by students:– Catch up on missed lectures

(continuous play or read the transcript)– Revision: jump to a particular point or find the lectures

which cover topic X• On the public web:

– Discoverability (search indexing)

Easy or hard?

• Easiest: small, fixed vocabulary, prescriptive grammar, discrete words, known audio conditions (command-and-control systems)

• Dictation applications in a specific domain, e.g. Dragon Naturally Speaking

• Hardest: speaker-independent, large vocabulary continuous speech recognition, adverse or unknown audio conditions

Why is it hard?

• People have huge amounts of prior experience and a rich (complex) understanding of context

• Modelling of context in ASR engines is currently very limited

• Even people misrecognize speech (e.g. new / foreign accents, specialized terminology, background noise)

Speech recognition

• Wreck a nice beach … you sing calm incense

• Reckon eyes peach

• Recognize speech

… using common sense

Early history

First known device 1952 (digits)

Above: IBM Shoebox, 1961http://www-03.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html

http://www-03.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html

Linguistics vs statistics

Early approaches tried to recognize individual phonemes (phonetic units) and hence the words they formed.

But not very successfully.

Airplanes don’t flap their wings

“Every time I fire a linguist, my system improves”

Fred Jelinek

1985/1988

Speech recognition pipeline

• Audio (signal processing, extract features)• Acoustic model (features to phonemes)• Pronunciation dictionary (lexicon)• Language model (likelihood of words)• Confusion lattice (possible options)• Results > confidence score

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/lecture-notes/lecture1.pdf

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/lecture-notes/lecture1.pdf

Hidden Markov Models

• HMMs model transition probabilities:

Alice talks to Bob three days in a row and discovers that on the first day he went for a walk, on the second day he went shopping, and on the third day he cleaned his apartment.

Alice has a question: what is the most likely sequence of rainy/sunny days that would explain these observations?

http://en.wikipedia.org/wiki/Viterbi_algorithm

http://en.wikipedia.org/wiki/Viterbi_algorithm

Training in action

“training 3 (decision) trees to depth 20 from 1 million images takes about a day on a 1000 core cluster”

http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf



Characteristics of the field

“the standard approach in our field [is] state-of-the-art system A is gently perturbed to create system B, resulting in a relative decrease in error rate of from 1 to 10%”

Borlard, Hermansky and Morgan. Towards increasing speech recognition error rates, 1996.

• Algorithmic, drawing on many disciplines (especially signal processing, statistics, linguistics, natural language processing)

• Empirical: lots of different algorithms and optimizations• Almost no theory to describe why particular approaches work better

than others, or how to find optimal solutions• Massive infrastructure is a big advantage:

large and varied data sets, significant computing resources.

Audio issues

• Bandwidth• Recording noise• Ambient noise• Reverberation• Microphones• Microphone arrays

Acoustic models

• Generated from a corpus of recorded, transcribed audio• Both artificial and natural corpuses

(TIMIT, Broadcast News, Meetings)• Audio needs to match the application

– Audio bandwidth = ½ sampling rate– Phone speech (sampled 8 KHz, bandwidth 4 KHz)– Microphone speech (sampled 16 KHz, bandwidth 8

KHz, typical analysis on 130 Hz – 6800 Hz)• There is a South African corpus of phone speech • But no South African corpus of microphone speech

The TIMIT audio corpus

0 47719 She had your dark suit in greasy wash water all year

2214 4428 she4428 8316 had7308 9691 your9691 15331 dark15331 19634 suit20929 22453 in22453 27697 greasy27697 32326 wash33120 36575 water37597 39644 all39644 43982 year

0 2214 h#2214 3744 sh3744 4428 ax-h4428 5229 hv5229 6927 ae6927 7308 dcl7308 8316 jh8316 9691 axr9691 11697 dcl11697 12114 d12114 13075 aa …

Word and phoneme alignment by timecode.

630 speakers from 8 US dialect regions, speaking 10 sentences each.

Dialect regions

The Nationwide Speech Project: A new corpus of American English dialectshttp://web.mit.edu/~nancyc/Public/Papers/Clopper_Pisoni_06_SC.pdf

http://web.mit.edu/~nancyc/Public/Papers/Clopper_Pisoni_06_SC.pdf

Crowdsourcing the creation of a GPL speech corpus and open source acoustic models (Sphinx, ISIP, Julius, HTK).

An important effort, but still small (84 hours at Dec 2010)

www.voxforge.org

http://www.voxforge.org/

Language modelling

• Pronunciation dictionary (lexicon)

TOMATO T AH0 M EY1 T OW2

TOMATO(1) T AH0 M AA1 T OW2

• Language model: a statistical sequence model of words. Trigram models (3 words) are common:

-2.0998 YORK MONEY FUND

-0.0798 YORK HEDGE FUND

-0.1392 YORK MUTUAL FUND

Statistical sequence models

• Truly Madly _____• Widely used• Applications

– Auto-suggest– Spell-checkers– Lossless compression– Machine translation– Language models for speech recognition

• Probability of token w in context of preceding tokens c, e.g. P(deeply), given “truly madly”

Context is king

• Micro-context (e.g. bi- and trigrams)United Kingdom

United Airlines

United Arab Emirates• Long-range context

“Cricket and rugby are amongst the most popular sports in the United _________”

(example from The Sequence Memoizer, Wood et al, 2011).

Characteristics of language

• Power law frequency / rank distribution. Zipf’s law:

“given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table”

http://en.wikipedia.org/wiki/Zipf’s_law

• Also more frequent words are shorter.

http://en.wikipedia.org/wiki/Zipf's_law

How to get large language data sets

• Linguistic Data Consortium(by subscription, restricted)

• Some other more specialized corpora• Microsoft (free, restricted)• Google (Creative Commons license)• Wikipedia (CC / GFDL license)

Using Wikipedia as a language resource

• Download a snapshot (6G compressed)• Convert from XML and markup to plain

text• Create dictionaries of target size (by word

frequency)• Create language models of target size• Approximately equal in size to English

Gigaword Corpus

Grid computing for language modelling

• For when you need lots of RAM and/or lots of CPU

• www.sagrid.ac.za• ICTS at UCT: Tim Carr, Andrew Lewis

http://www.sagrid.ac.za/

Accounting for context: LM adaptation

• Adapt a language model to more closely resemble the target speech

• Using related text for– Topic modelling (vocabulary, concepts)– Style-of-speech modelling

“ok and um it's quite useful to have a very good diagnostic test of of acute hepatitis um you know to prevent kind of unnecessary um surgery um so hepatitis is really one um example of a cause of acute abdominal pain that doesn't need surgery”

What’s special about lectures?

• Possibly helpful assumptions:– Coherent topic(s) within a course– One lecturer presents many lectures

• Specialized vocabulary• Spoken speech different to written speech

Using Wikipedia for LM adaptation

• Goal is to adapt a “standard” LM to be specific to the topic of the audio

• Start somewhere: title, keywords, text from slides

• Select a set of documents, adapt the LM• Using wikipedia, select by similarity:

identify the set of documents most closely related to the starting point or keywords

Vector space modelling

• Represents documents as n-dimensional vectors (n terms)

• Document similarity established by comparing vectors, producing a similarity score.

• Gensim VSM toolkit: independent of corpus size (so good for wikipedia)

• LSI, LDA, TF-IDF measures. • Create a “similarity crawler” to build a corpus of

documents related to the topic

Metrics

• Perplexity (average number of guesses required)

• Word Error Rate (edit distance: insertions, deletions, substitutions)

• Information Retrieval: precision and recall• What’s sufficient? Need to close an

accuracy gap of – Munteanu research: %WER for a transcript

What is lecture capture?

www.opencastproject.org

Largely automated:• Recording • Processing• Output

Recreates the lecture experience by recording:• audio• video• screen output

(VGA)

Licensing constraints

• Opencast Matterhorn is licensed under the ECL open source license (similar to Apache 2.0 license)

• Allows closed commercial derivatives• Therefore cannot use software or datasets

which are non-commercial or research-only.

• Can use Apache, BSD, LGPL, maybe GPL code and data.

Speech recognition software ecosystem

Licensing and patentsClosed

Open

Proprietary

FOSS

Opencast in action

Prior work in ASR for lectures

MIT Lecture Browser (SUMMIT recognizer)

U. Toronto / ePresence PhD prototype by Cosmin Munteanu(SONIC recognizer)

ETH Zurich Integration of CMU Sphinx with REPLAY

Work in progress

• Get consistently good quality audio recordings• Implement dynamic language model adaptation• Integrate into Opencast Matterhorn workflow• Show transcript to users in UI, enable search• Allow users to edit / improve transcript• Use edits to improve recognition

Speech recognition in the cloud

• Google Android: 70 CPU-years to build models

• Nexiwave: cloud service using GPUs• Advantages: potentially massive

computing resources• Disadvantages: generic issues and risks

with cloud services– Bandwidth, lock-in, terms of service, data

ownership and retention, etc.

Find out more

Truly Madly Wordly: my blog on open source language modelling and speech recognition: http://trulymadlywordly.blogspot.com

CMU Sphinxhttp://cmusphinx.sourceforge.net/

Opencast

http://www.opencastproject.org

http://trulymadlywordly.blogspot.com/

http://cmusphinx.sourceforge.net/

http://www.opencastproject.org/