ISSUES IN SPEECH RECOGNITION Shraddha Sharma

ISSUES IN SPEECH RECOGNITION

Shraddha Sharma

Contents:

• Introduction• What is speech recognition?• Terminology of speech recognition• Why we want speech recognition?• What is speech?• Difficulties with ASR?• Solutions for difficulties of ASR

Speech Recognition: Definition:- The process of interpreting human

speech in a computer.

The more technical definition by Jurafsky:

ASR as the building of system for mapping acoustic signals to a string of words.

• Terminology of Speech Recognition:• Speaker Dependent Recognition– The recognition system is designed to work with

just one or a small number of individual speakers • Speaker Independent Recognition– These systems are designed to work with all the

speakers from a given linguistic community• Large Vocabulary Recognition– Very difficult to make accurate large vocabulary, speaker

independent systems

• Small Vocabulary Recognition– Typically recognition of a few keywords such as digits or a

set of commands. – Example: voice operated telephone number dialing

• Isolated Word Recognition: – Systems which can only recognize individual words which

are preceded and followed by relatively long period of silence.

.

• Connected Word Recognition: – Systems which can recognize a limited sequence of words

spoken in succession.– e.g. “Ninety-eight thirty-five four thousand”

• Continuous Word Recognition: – These systems can recognize speech as it occurs and

recognize the speech in real time.– Such system usually work with large vocabulary, but with

moderate accuracy.

• Why we want speech recognition ?

The main goal of speech recognition is to get efficient ways for humans to communicate with computers.

• speech recognition is important, not because it is natural for us to communicate via speech, but because in some cases, it is the most efficient way to interface to a computer.

• Applications of speech recognition:- 1. Telephone application 2. Hands free operation 3. Application for physically handicapped 4. Dictation 5. Translation 6. Environmental control

• What is speech? • When humans speak, let air pass from our lungs

through our mouth• and nasal cavity, and this air stream is restricted and

changed with our tongue and lips. This produces contractions and expansions of the air, an acoustic wave, a sound.

• The sounds we form, the vowels and consonants, are usually called phones.

The phones are combined together into words.• However, speech is more than sequences of phones

that forms words

• The term speech signal within ASR refers to the analog electrical representation of the contractions and expansions of air.

• The analog signal is then converted into a digital representation by sampling the analog continuous signal.

• A high sampling rate in the A/D conversion gives a more accurate description of the analog signal, but also leads to a higher degree of space consumption.

• Difficulties with ASR:-• 1. Human comprehension of speech compared with

ASR• 2. Body Language • 3. Noise• 4. Spoken language /Written language• 5. Continuous speech• 6. Channel variability• 7.Speaker variability• 8.Amount of data & search space• 9. Ambiguity

• Human comprehension of speech compared to ASR :-

• Humans use the knowledge they have about the speaker and the subject.

• Words are not arbitrarily sequenced together, there is a grammatical structure and redundancy that humans use to predict words not yet spoken.

• In ASR we only have the speech signal. We can construct a model for the grammatical structure and use some kind of statistical model to improve prediction, but there are still the problem of how to model world knowledge, the knowledge of the speaker and encyclopedic knowledge.

• Body language:-• A human speaker does not only communicate

with speech, but also with body signals - hand waving, eye movements, postures etc.

• This information is completely missed by ASR.• Noise:-• Speech is uttered in an environment of sounds• Unwanted information in the speech signal is

called noise.• In ASR we have to identify and filter out these

noises from the speech signal.

• Spoken language /Written language:- 1. Written communication is usually a one-way

communication, but speech is dialogue-oriented.

2. Disfluences in speech, e.g. normal speech is filled with hesitations, repetitions, changes of subject in the middle of an utterance, slips of the tounge etc.

3. The grammaticality of spoken language is quite different to written language at many different levels.

• Continuous speech:-• Natural speech is continuous it does not have

pauses between the words , so the recognition of continuously spoken speech is significantly more difficult.

• The complexity ASR is caused by mainly 3 properties of continuous speech that are:

• 1. Word boundaries• 2. Coarticulatory effects• 3. Content words

• Channel variability:-• Aspect of variability is the context were the

acoustic wave is uttered.• Here we have the problem with noise that

changes over time, and different kinds of microphones and everything else that effects the content of the acoustic wave from the speaker to the discrete representation in a computer.

• This phenomena is called channel variability.

• Speaker variability:-• All speakers have their special voices, due to

their unique physical body and personality. The voice is not only different between speakers, there are also wide variations within one specific speaker.

• list some of these variations are:• 1.Realization• 2. Speaking style• 3. The sex of the speaker

• 4. Anatomy of vocal tract• 5. Speed of speech• 6. Regional and social dialects Regional dialects involves features of

pronunciation, vocabulary and grammar which differ according to the geographical area the speaker come from.

• Social dialect are distinguished by features of pronunciation, vocabulary and grammar according to the social group of the speaker.

• Amount of data and search space:-• Communication with a computer via a microphone induces

a large amount of speech data every second. This has to be matched to group of phones the sounds, the words and the sentences.

• Groups of groups of phones that build up words and words builds up sentences. The number of possible sentences are enormous.

• Also minimize our lexicon, i.e. set of words. This introduces another problem, which is called out-of-

vocabulary, which means that the intended word is not in the lexicon.

• ASR system has to handle out-of vocabulary in a robust way.

• Ambiguity:-• Natural language has an inherent ambiguity,

i.e. we can not always decide which of a set of words is actually intended.

• There are two ambiguities that are particular to ASR,

1.homophones 2. word boundary ambiguity

The concept homophones refers to words that sound the same, but have different orthography. They are two unrelated words that just happened tosound the same

• Word boundary ambiguity:-• When a sequence of groups of phones are put

into a sequence of words, we sometimes encounters word boundary ambiguity.

• Word boundary ambiguity occurs when there are multiple ways of grouping phones into words.

• Solution for issues in the speech recognition:-• A general solution of many of the above

problems effectively requires human knowledge and experience, and would thus require advanced artificial intelligence technologies to be implemented on a computer.

• In particular, statistical language models are often employed for disambiguation and improvement of the recognition accuracies.

• Language Model:• The choice of language model has a significant

impact on recognition process.• The constraint provided by a language model

can substantially improve a system performance & size of the search space generated by a L.M.

• There are 4 type of L.M.• 1. UNIFORM L.M.:- Every word in sentences is equally

probable

• 2. Stochastic L.M.:- Trigram, bigram & unigram 3. Finite state L.M.:- It is simple artificial language that model

all legal sentences using a single network. 4. Other possible L.M.:- In this context free, unification,

stastistical tree based & case frame grammars.

• Fundamental equation of speech recognition:-

P(w)-> a priori probability of word sequence of wIt is computed from language model.P(y/w)-> the conditional probability of the acoustic modelP(y)-> the probability of acoustic sequence.

• Combining language & acoustic models:-• Probability theory suggest that acoustic &

language probabilites can be combined through multiplication there is some weighting is necessary.

• To balance the both replace the term p(w) with p(w)^l Range of l is between 2 – 5.

• Here l indicates the language model weight.• It is determined in order to optimize the

recognition performance in CSR.

• Improve the acoustic models so that they better represent the statistics of the true incoming audio data.

• continuous recognition requires lots of CPU power. While isolated-word recognizers can run on a slower machines, this is only because when you pause between words, you're telling it where the words start and stop.

• But in continuous speech, any word could potentially start and stop at any time, so the system has to search through and consider every possible start time and every possible end time for every possible word to be recognized, and find the sequence that fits the best.

• design your grammar using words that are different by multiple phonemes, and you should have good results.

THANK YOU…

Documents

ISSUES IN SPEECH RECOGNITION Shraddha Sharma