A flexible real-time recognizer of spoken words for man-machine communication

Int. J. Man-Machine Studies (1970) 2, 317-326

A Flexible Real-time Recognizer of Spoken Words for Man-Machine Communication

R. DE MORI, L. GILLI AND A. R. MEO

Istituto Elettrotecnico Nazionale G. Ferraris and Politecnico di Torino, Italy

(Received 3 November 1969)

A relatively simple real-time recognizer of spoken words is described. The main characteristics of this system are the following: (1) the vocabulary of accepted words is settled with simple operations on the panel of the machine; (2) the system is quasi-adaptive in the sense that the characteristic parameters of a given word uttered by a certain speaker can be measured and displayed on a set of Nixie tubes, and it is very easy (operating on the keyboard) to fit the recognizer to the speaker according to the displayed data; (3) at present, a maximum of 15 words can be classified, but, owing to the modularity of the system, other units can be added in order to enlarge the accepted vocabulary.

The coder consists of a bank of active filters and a set of circuits translating spectral information into a set of binary digits. The processor is composed (a) of a set of combinational units which evaluate the Hamming distance of the input patterns from the characteristic sequences of phonemes or other "tracts" of words; (b) of a set of sequential units which complete the classification of the uttered word by analysing the time evolutions of the combinational network outputs. The panel controls which make it possible to fix the accepted vocabulary and to adapt the recognizer to the speaker operate on the connections between combinational and sequential units and on the operation parameters of all the units.

The machine, used for recognizing the ten digits from 0 to 9 and five other words spoken in Italian, reaches an efficiency larger than 99% on condition of its being previously adapted to the speaker. Programmed according to the average characteristics of male speakers and for classifying words spoken by voices not used in the preceding stage of learning, it reaches average efficiencies of about 90Yo (from 85 to 9 9 ~ for a given speaker).

Introduction

Despite many illuminating discoveries made in the last years on sound production and perception, the complete solution of the speech recognition problem, i.e. the automatic classification of all the words of a language, seems to be a long way off. The great difficulties of the problem have obliged the researchers to restrict their studies to relatively simple recognition processes.

318 R. DE MORI, L. GILLI AND A. R. MEO

In particular, some equipments have been proposed which make it possible to recognize a small vocabulary of spoken words with a greater or lesser degree of accuracy (Davis et al., 1952; Teacher et al., 1967). These partial solutions are interesting from a theoretical as well as from a practical point of view on account of their possible applications to man-machine communication systems, such as, for example, the input of a digital computer.

The system described in this paper belongs to this class of partial recog- nizers, which accept only a small voacabulary of words and must be very reliable. The main characteristics of the proposed solution are the following:

(1) The vocabulary of classified words is settled with simple operations on the panel of the machine. At present, a maximum of 15 words can be recognized, but, since each word corresponds to a different subnetwork, other units can be added in order to enlarge the accepted vocabulary.

(2) The system is quasi-adaptive in the sense that the characteristic parameters of a given word uttered by a certain speaker can be measured and displayed on a set of Nixie tubes, and it is very easy (operating on the keyboard) to fit the recognizer to the speaker according to the displayed data. Thus it is possible to achieve very high efficiencies.

(3) The structure is relatively simple and operates in real time.

Basic Features of the Coder

The unit for extracting the parameters characteristic of the uttered words, or "coder", is of a rather conventional type, consisting essentially of a bank of filters and a set of circuits translating spectral information into a set of binary digits. A block schematic diagram is shown in Fig. 1.

The microphonic signal is pre-amplified in an amplifier with a fiat frequency characteristic and then is shunt-fed to ten 1/3-octave bandwidth active filters having the following central frequencies: 400, 500, 630, 800, 1000, 1250, 1600, 2000, 2500, 4000 Hz.

The choice of these frequency values followed from heuristic considerations based on the results of preceding recognition experiences (Gilli & Meo, 1967, 1968; Meo & Righini, 1965). Each filter is coupled to the chain of one amplifier with adjustable gain, one rectifier, one integrator, one level- detector and one flip-flop operating as a binary quantizer. The mean of the output signals given by the integrators is extracted by a suitable circuit and used as a reference signal in each level-detector unit. So, when the signal

FLEXIBLE SPEECH RECOGNIZER IN REAL TIME 319

~ Int~rator I Level- detector I I ' - - " [ ~ I0 Reference I level

. . . . Reference level

: Other 7 channels • i

I1 ,

Reference level ,extroctor [

FIG. 1. Block diagram of the coder.

component through the ith filter exceeds a certain threshold (proportional to a suitably weighted mean of the spectral components), the output of the ith flip-flop is switched on: otherwise it stays switched off. Notice that, if the output of each integrator is compared with the mean of all integrator outputs, the use of an automatic gain control for the microphonic signal appears to be no more necessary. It should be remarked also that the output of the mean- extractor is biased; that means that when the signal power is under a certain threshold (adjusted just above the noise level), the outputs of all the integrators are under the refrence signal they are compared with and, therefore, all the flip-flops are switched off. This resolution makes it possible to detect the silence intervals between different words or in the utterance of the words con- taining plosives.

General Structure of the Processor

Since the logic values 1 and 0 can be assigned to the two output levels of each flip-flop, we can say that a set of ten binary functions of the continuous variable "time" describes the time evolution of the signal to be classified. Besides, these functions are sampled with a period of 15 msec; it follows that the features of the uttered word are represented by a sequence of patterns, each of which is composed of ten binary digits. The ith digit of the kth pattern indicates that in the kth sampling instant the signal power in the ith band exceeds the previously settled reference signal.

The operation of the processing unit of the recognizer, or processor, can

320 R. DE MOP.I, L. GILLI AND A. R. MEO

be subdivided into two phases, which we shall call "combinational phase" and "sequential phase".

In the combinational phase the binary pattern relative to a certain sampling instant is analysed and is submitted to a first classification, for deciding which phonemes or characteristic "tracts" of speech the analysed pattern "may" belong to. This result is obtained by means of a set of combinational units, each of which is fed by the ten binary outputs of the coder. Each of these units can be settled with simple operations on the keyboard in order to recognize a given vowel or phoneme or other characteristic tract. If, for example, the outputs of the combinational networks associated to vowels E and I are excited, the binary pattern delivered by the coder in that instant should be an E or an I.

In the sequential phase the time evolution of the results of the combinational phase are examined in all the successive sampling instants in order to perform a full classification of the uttered word. In practice this result is obtained by means of a set of sequential circuits, fed by the level outputs of the combinational units and by the clock pulses. Each of these units is associated to a different word of the accepted vocabulary, and gives an output indi- cating that the corresponding word has been recognized. Since it is possible that two, or more than two, units are switched on, a simple decision network has been added suppressing some of the contemporaneous indications.

A block shematic diagram of the whole system is shown in Fig. 2. At present, the recognizer contains 10 combinational and 15 sequential units;

C°m ei n° . i }1 - .

I

-T- ~Comb~no.ono,] ~ . . ~ . ~ network j---- I ,

From I o the ~ o coder . . . .

~0 o

I I t I

-..4 t I ---Other 7 uni ts- - ! I I I -'T-- ~Combinationai ~_

"-I--I,.. ~.~ net~o°rk [ . . I I I

I I

! /

"~rt--Other .12 units . . . .

I :

ino! - i s i ( n twok

FIG. 2. Block diagram of the processor.

Lamp I " - - ~ (wordl)

._Q Lamp2 (word 2 )

--Other 12 lamps

Lamp 15 (word 15)


but, owing to the modularity of the structure, other units can be added in order to improve the efficiency (increasing the number of the combinational units) or enlarge the accepted vocabulary (increasing the number of the sequential units).

The Combinational Units

Each combinational unit is associated to a "characteristic tract" of a word. This tract may be a phoneme or any other acoustic basic unit (for example, the silence preceding the plosives). To each of these tracts (and therefore to each unit) we can associate a "characteristic sequence" of ten ternary symbols (1, 0, X) defined by the following rule: a symbol "1" (or "0") in the kth position of the sequence denotes that a symbol "1" (or "0") is rather likely to appear at the output lead of the kth channel of the coder, when the considered tract is uttered, while a symbol X denotes that the information content of the output of the kth channel is of little usefulness in that case.

In every sampling instant the combinational unit associated to a certain tract performs the following operations:

(a) it computes the Hamming distance D of the characteristic sequence from the binary pattern given by the coder (obviously neglecting the positions in which an X appears);

(b) it gives an excited output if the computed distance D does not exceed a given threshold Dmax, that is, if D ~< Dmax.

A single ten-input majority gate G (that is, a combinational unit giving an excited output if, and only if, the number of the excited inputs exceeds a given threshold T) and some inverters ! implement the Boolean function described by operations (a) and (b). Obviously, the threshold T of the majority gate is settled equal to D . . . . and its inputs are connected to the corresponding outputs of the coder according to the following rules:

(a) if the symbol a i of the characteristic sequence is 1, the input 1 i of the majority gate is connected to the complement of the output 0i of the ith coder channel;

(b) i f a i is 0, I i is connected to 0 i directly; (c) if ai is X, input li is left disconnected.

The majority gates have been obtained with simple transistor circuits. Notice that, in order to fix the characteristic sequence and D~ax it is sufficient to settle the right positions of a number of switches accessible on the panel of the machine.


The Sequential Units

Each of the sequential units makes it possible to verify whether all the "tracts" of the corresponding word have been recognized by the combinational networks in a sufficient number of sampling instants and in the proper sequence. Thus, to recognize the word SEI (in English: "six"), a certain sequential unit can be settled in such a way that an excited output is released after the level outputs of the units related to S, E and I have been on for at least 4, 6 and 4 sampling instants.

Each sequential unit is composed of a number of counter registers, each of which corresponds to a given "tract" of word. A counter register is driven by the clock pulse sequence and is enabled to count by the following two level signals:

(a) the output of the related combinational unit, (b) a "consensus" signal delivered by a preceding counter register when

the "tract" of the word associated to the latter has been recognized.

The number of counter registers contained in a sequential unit is variable from 3 to 8. Indeed, as is shown on the right of Plate l(a), eight units (from 1 to 8) contain three counters each, two units (9 and 10) contain four counters, two units (13 and 14) contain six counters and finally one unit (16) contains eight counters whose indications are visualized by Nixie tubes (the last one can be used to analyse in detail the behaviour of the system in the most difficult cases); of course, the recognition of a more complex word will require a unit with more counters, and vice versa.

The counters are connected to the combinational units by inserting some pins in an A-MP Matrix Pinboard [Plate l(b), right side; any row corresponds to one of 66 counters and any column to one of the ten combinational units]. Besides, the minimal counting times can be fixed (from 0 to 9) by operating on 10-position switches [Plate l(a)].

Notice that also two complementary level signals denoting "silence" or "noise" are available and can be used for enabling or disabling counter registers for fixed time intervals.

Learning and Settling

In order to perform the learning process, the coder outputs are visualized on a set of Nixie tubes which make it possible to record the spectral characteristics of the speaker. The procedure is the following one. A certain "tract" of a word is isolated by means of suitable counters, and for each channel of the coder the number of sampling instants in which the output is excited is


counted and recorded by a subset of the Nixie tubes. The characteristic sequence of that tract is fixed according to those indications. For example, if a channel is excited for more than 60 sampling instants out of 100, the corresponding digit of the sequence is set equal to 1 ; if the percentage of excitation instants is less than 40, the digit is set equal to 0; otherwise, in that position an X is set.

The recognizer can be settled by operating the following controls on the panel of the machine (Plate 1):

(1) The potentiometer knots controlling the gains of the amplifiers 1 + 10 and the bias of the mean-extractor of Fig. 1. These controls are adjusted once for all in such a way as to achieve a complete utilization of the information content of all the spectral components. Theoretical considerations show that to reach such an aim would require a very complex analysis of the complete set of the words to be recognized. Therefore, the simple criterium has been followed of controlling the gain of any amplifier in such a way that after a large number of experiences pertaining to many words uttered by many speakers the excitation percentage at the corresponding channel output is nearly 50.

(2) The pins in the A-MP Matrix Pinboard which make it possible to connect a sequential unit to a suitable subset of combinational units.

(3) The ten-position switches which fix the minimal counting times of every tract of a word.

(4) The three-position switches of Plate l(a) (right side) which settle the combinational unit for comparing the characteristic sequence with the input pattern. Obviously, the positions of the switches must copy exactly the characteristic sequence determined in learning.

(5) The switch T [Plate l(a)] which fixes Dmax for a certain combinational unit. Dmx is assumed equal to 0, 1, 2 or 3 according to the lesser or greater stability of the patterns.

In brief, the controls described in 2 and 3 are used for establishing the vocabulary of the classified words while the controls described in (4) and (5) make it possible to adapt the system to the speaker. The former operation requires about half an hour (for 15 words); the latter can be performed in a few minutes (learning included).

The machine has other controls and facilities of minor importance, such as a circuit for resetting the counters automatically during the silence intervals between two successive words.

324 R. DE MORI, L. GILL! AND A. R. MEO

TABLE 1

Characteristic sequences

Tract Sequence Dmax

A 0 0 1 1 1 1 1 0 0 0 1 O 0 1 l l l X 0 0 0 0 1 El 0 1 1 0 0 0 1 1 0 0 1 CI 0 0 0 0 0 0 0 0 1 1 0 S 0 0 0 0 0 0 0 0 0 1 0 U 1 1 1 X 0 0 0 0 0 0 0 E, 0 1 1 0 0 0 X X O 0 0 I X l X O 0 0 0 0 1 1 1

Silence 0 0 0 0 0 0 0 0 0 0 0

TABLE 2

Minimal durations of the characteristic sequences

Word Minimal durations of the characteristic sequences

UNO 6U 40 DUE 6U 5E, TRE 4 El

QUATTRO 1U 4A CINQUE 1CI 3 Sil

SE1 4S 6El SETTE 4S 4El OTTO 60 6 Sil NOVE 80 4Es ZERO 9E 20

6 Sil 30 1U 3Es 4I

8 Sil 4Es 40

Efficiency

When used for recognizing the ten digits from 0 to 9 (and also five other words suitably chosen) spoken in Italian, the machine reaches an efficiency larger than 9 9 ~ provided it has been previously adapted to the speaker. The preceding percentage refers to a set of experiences made in different days on samples of 100 words each pronounced in random order by five selected speakers.

It may be interesting to report the behaviour of the machine when pro-


Classifiec word

Pr0nounc ~ ' e a ~ I ° o w w word ~ N~ -~z o~ ~-tr

ZERO (01 777 2 04444 ZERO:

UN0 ( I )= 99 'u:no:

DUE ( 2 ) 3 5566 'du,E

TRE (3) ire

QUATTRO (4) 447 9 ~kwa :ttro :

ClNQUE (5) 667778 15899 'tgirlkw¢ t 999

SEI (6) 346 sei

SETTE (7)1 9 'sEttE

OTTO ( 8 )1 7 9999 'o : tt0 :

NOVE (9)1 56789 Ino:v£

O I--- z m ~u

6 55

~2589 8

33666 66777 88

138

FIG. 3. Error matrix referring to 1000 utterances of ten speakers not considered in the learning process. (The figures in the matrix denote the speaker.)

grammed according to the average characteristics of male speakers (Tables 1 and 2) and used for recognizing speakers not considered in the learning process. The error matrix shown in Fig. 3 refers to 1000 utterances of ten speakers, each of which pronounced ten times the ordered sequence of the ten digits. The speakers were requested to pronounce the words clearly, at a fixed distance from the microphone, in a loud voice, and the experiences were performed after a few minutes' training. The environment was very noisy, because a teletype was substituted for the lamps for these experiences. The right-classification percentage ranges from 8 5 ~ (speaker 9) to 99Yo (speaker 0, one of the authors of this paper). The overall average efficiency appears to be 93 .6~ .

We wish to thank Professor R. Sartori, Professor G. Sacerdote and Dr G. Righini, of Istituto Elettrotecnico Nazionale G. Ferraris for many useful suggestions and criticisms during the course of this research.

This research was supported in part by the Consiglio Nazionale delle Ricerche of Italy.


References

DAVIS, K. H., BIDDULPH, R. & BALASHEK, S. (1952). Automatic recognition of spoken digits../, acoust. Soc. Am. 24, 637.

D~NES, P. & MATI-mWS, M. V. (1960). Spoken digit recognition using time-frequency pattern matching, d. acoust. Soc. Am. 32, 1450.

DUDLEY, H. & BALASHEK, S. (1958). Automatic recognition of phonetic patterns. d. acoust. Soc. Am. 30, 721.

Gmu, L. & MEo, A. R. (1967). Sequential system for recognizing spoken digits in real time. Acustica 19, 38.

GILLI, L. & M~o, A. R. (1968). A recognition system using majority gates and sequential circuits. Alta Frequenza XXXVII, 158.

GOLD, B. (1966). Word recognition computer program. M.I.T. Research Lab. of Electronics, Cambridge, Mass., Tech. Rept. 452.

LINImREN, N. (1965). Machine recognition of human language. I.E.E.E. Spectrum 2, 114.

MEo, A. R. & RIOmNL G. (1965). Riconoscitore istantaneo di suoni vocalici. Alia Frequenza XXXIV, 256.

OLSON, H. F., BELANO, H., & ROOERS, E. S. (1967). Speech processing techniques and applications. LE.E.E. Trans. AU-15, 120.

SAKAI, T. & DOStltTA, S. (1963). The automatic speech recognition system. LE.E.E. Trans. EC-12, 835.

TEACHER, C. F., KELLETT, H. G. • FOCHT, L. R. (1967). Experimental, limited vocabulary speech recognizer. LE.E,E. Trans. AU-15, 127.

Documents

A flexible real-time recognizer of spoken words for man-machine communication