Speech recognition

Speech recognition

Speech recognitionKunal Shalia and Dima Smirnov

What is Speech Recognition?Speech Recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine readable format. Speech Recognition vs. Voice RecognitionThere is a common misconception that speech recognition and voice recognition are the same thing, but they are not. Speech Recognition is used to identify words while Voice Recognition is used to identify a certain individuals voice.

2Speech Recognition Demonstration

Early Automatic SR SystemsBased on the theory of acoustic phoneticsDescribes how phonetic elements are realized in speechCompared input speech to reference patternsTrajectories along the first and second formant frequencies for the numbers 1 through 9 and oh:

Used in the first speech recognizer built by Bell Laboratories in 1952

For example, to produce a steady vowel sound, the vocal cords need to vibrate, and the air that propagates through the vocal tract results in sound with natural modes of resonance similar to what occurs in an acoustic tube. These natural modes of resonance, called the formants or formant frequencies, are manifested as major regions of energy concentration in the speech power spectrum The first functioning speech recognizer was built in 1952 by Bell Laboratories and was able to recognize single digit using the format frequencies measured during the vowel regions of each digit.

4The Development of SR1950sRCA Laboratories recognizing 10 syllables spoken by a single speakerMIT Lincoln Lab speaker-independent 10-vowel recognition1960sKyoto University speech segmenterUniversity College first to use a statistical model of allowable phoneme sequences in the English languageRCA Laboratories non-uniform time scale instead of speech segmentation1970sCarnegie Mellon graph search based on a beam algorithm

The lowest level of speech segmentation is the breakup and classification of the sound signal into a string of phones.In contrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit segmenter.Phoneme is the smallest segmental unit of sound employed to form meaningful contrasts between utterances.Martin recognized the need to deal with the temporal non-uniformity in repeated speech events and suggested a range of solutions, including detection of utterance endpoints, which greatly enhanced the reliability of the recognizer performanceVintsyuk proposed the use of dynamic programming for time alignment between two utterances in order to derive a meaningful assessment of their similarityhe speech recognition language was represented as a connected network derived from lexical representations of words, with syntactical production rules and word boundary rules. In the proposed Harpy system, the input speech, after going through a parametric analysis, was segmented and the segmented parametric sequence of speech was then subjected to phone template matching using the Itakura distance. The graph search, based on a beam search algorithm, compiled, hypothesized, pruned, and then verified the recognized sequence of words (or sounds) that satisfied the knowledge constraints with the highest matching score (smallest distance to the reference patterns).5The Two Schools of SRTwo schools of applicability of ASR for commercial applications were developed in the 1970sIBMSpeaker-dependentConverted sentences into letters and wordsTranscription - focus on the probability associated with the structure of the language modelN-gram modelAT&TSpeaker-independentEmphasis on an acoustic model over language model

IBM: voice-activated typewriter Technical focus on size of vocabulary and structure of language model represented by statistical syntactical rules describing probability of sequence of language symbols = language model; n-gram model = probability of occurrence of n words; proven better than humans completing sequences

AT&T: goal to provide automated telecommunication services to public e.g. voice dialing and command/control for routing of callsSpeaker-independent system led to creation of speech clustering algorithms for creating word and sound reference patternsEmphasis on acoustic model over language model, as voice dialing and call routing consisted of only few wordsResearch to understand acoustic variability of various speech led to study of range of spectral distance measures and statistical modeling techniques that produced sufficiently rich representations of the utterances from a vast population.Keyword spotting detecting keyword of significance embedded in longer utterance for accommodating talkers who spoke in natural sentences rather than rigid commands

Both AT&T and IBM made mathematical formalism and rigor an aspect of SR

6Markov ModelsA stochastic model where each state depends only on the previous state in time. The simplest Markov Model is the Markov chain which undergoes transitions from one state to the other through a random process. Markov Property

Markov Models exhibit the Markov Property which states that the next state depends only on the current state and not the past.

7Hidden Markov ModelsA Hidden Markov Model (HMM) is a Markov Model using the Markov Property with unobserved (hidden) states.In a Markov Model the states are directly visible to the observer, while in an HMM the state is not directly visible but the output ,which is dependent on the state, is visible. Elements of a HMMThere are a finite number of N states, and each state possesses some measurable, distinctive properties. At each clock time T, a new state is entered based upon a transition probability distribution which depends on the previous state(Markovian property)After each transition, an observation output symbol is produced according to the probability distribution of the state. Urn and Ball ExampleWe assume that there are N glass urns in room.In each urn there is a large quantity of colored balls and M distinct colors.A gene is in the room and randomly chooses the initial urn.Then a ball is chosen at random, its color recorded, and then the ball is replaced in the same urn. A new urn is selected according to a random procedure associated with the current urn. Urn and Ball ExampleEach state corresponds to a specific urnColor probability is defined for each state (hidden)

X1, x2, x3 represent the urns, which are the statesY1, y2, y3, and y4 represent the possible observations a is the state transition probability which is hiddenb is the observation probability

11Coin Toss ExampleYou are in a room with a barrier and you cannot see what is happening on the other side.On the other side another person is performing a coin(or multiple coin) tossing experiment. You wont know what is happening, but will receive the results of each coin flip. Thus a sequence of HIDDEN coin tosses are performed and you can only observe the results. One coin toss

Two coins being tossed

There are examples where the coins can be biased. In that case the observation probability becomes lopsided and either heads or tails is favored heavily. A fair coin is used to decide which biased coin is tossed at each trial. Therefore the state transition probabilities are not biased. 14Three coins being tossed

15HMM Notation

The Three Problems for HMM1. Given the observation sequence O = (o1 . . . oT ), and a model = (A, B, ), how do we eciently compute P (O|), the probability of the observation sequence given the model?

2. Given the observation sequence O = (o1 . . . oT ), and a model = (A, B, ), how do we choose a corresponding sequence q = (q1 . . . qT ) that is optimal in some sense (i.e., best explains the observations)?

3. How do we adjust the model parameters = (A, B, ) to maximize P (O|)?Problem 1 gives us a model and a sequence of observations and we have to figure out the probability that a particular sequence of symbols was generated by that model. The solution of problem 1 allows us to choose the model which best matches the observations.Problem 2 is where we try to uncover the hidden part of the model that led to the observations Problem 3 is where we attempt to optimize the model parameters so we can best describe the sequence that we receive. This is essentially the training model which uses training data in order to create the best models. 173 types of HMMErgodic ModelLeft to Right ModelParallel Left to Right ModelErgodic Model In an ergodic model it is possible to reach any state from any other state.

Left to Right (Bakis) ModelAs time increases, the state index increases or stays the same

The left to right model has the desirable property that it can readily model signals whose properties change over time in a successive model, for example speech.

20Parallel Right to Left ModelA left to right model where there are several paths through the states.

HMM in SR1980s shift to rigorous statistical frameworkHMM can model the variability in speechUse Markov chains to represent linguistic structure and the set of probability distributions Baum-Welch Algorithm to find unknown parametersHidden Markov Model merged with finite-state networkShift from template-based approach to more rigorous statistical modeling fwHMM models the intrinsic variability of the speech signal as well as the structure of spoken language in integrated and consistent statistical modeling framework Speech signal is highly variable to pronunciation, accent, reverb, noise, etcGiven set of known utterances called training set, Baum-Welch algorithm is used to obtain best set of parameters defining model, which provides indication of probability that unknown utterance is the modelBaum Welch algorithm works by assigning initial probabilities to all the parameters. Then until the training converges, it adjusts the probabilities of the HMMs parameters so as to increase the probability the model assigns to the training set.

Use of a finite-state grammar (graph) in large vocabulary continuous speech recognition represented a consistent extension of the Markov chain that the HMM utilized to account for the structure of the language, albeit at a level that accounted for the interaction between articulation and pronunciation. Although these structures (for various levels of the language constraints) were at best crude approximations to the real speech phenomenon, they were computationally efficient and often sufficient to yield reasonable (first- order) performance results. The merger of the hidden Markov model (with its advantage in statistical consistency, particularly in handling acoustic variability) and the finite state network (with its search and computational efficiency, particularly in handling word sequence hypotheses)22Speech Recognition TodayDevelopments in algorithms and data storage models have allowed more efficient methods of storing larger vocabulary basesModern ApplicationsMilitaryHealth careTelephonyComputingMilitary aircraft control, access to battle information databases, training air traffic controllersHealth form filling, queries, signing documentsTelephony voice control, automated systemsSpeech-to-text; hands-free computing23

Documents

Speech recognition