Pattern-Matching Procedure for Automatic Talker Recognition

THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA VOLUME 35, NUMBER 3 MARCH 1963

Pattern-Matchin Procedure for Automatic Talker Recognition

SANDRA PRUZANSKY

Bell Telephone Laboratories, Inc., Murray Hill, New Jersey (Received 18 December 1962)

A pattern-matching procedure for automatic recognition of talkers was used to study the effects of varia- tions in patterns upon recognition performance. Several utterances of common words, excerpted from context, were spoken by ten talkers and converted to time-frequency-energy patterns. Some of each talker's utterances were used to form reference patterns and the remaining utterances served as test patterns. The recognition procedure consisted of cross-correlating the test patterns with the reference patterns and select- ing the talker corresponding to the reference pattern with the highest correlation as the talker of the test utterance. The same recognition procedure was used with patterns reduced to two dimensions. The recognition score for three-dimensional patterns was 89%. Reducing the original patterns to time-energy patterns resulted in a lower recognition score; however, when only spectral information was retained, recognition results were the same as those for three-dimensional patterns. No errors were made in recognition based on a small sample of patterns consisting of pooled spectra of several different voiced sounds.

INTRODUCTION

ECOGNITION of talkers by human observers has been studied from several different points of view. Meeker and Nelson, 1 in evaluating transmission sys- tems, found that trained listeners were able to identify talkers from an ensemble of five, with a high degree of accuracy. Pollack, Pickett, and Sumby, 2 studying the effect of each of several variables upon identification performance, showed that the speech wave conveyed considerable information about talker identification if

the listener heard a long enough speech sample. Kersta, a using a visual display of acoustic features, showed that a group of trained observers was able to identify talkers by spectrogram matching. All of these studies indicate that talkers can be identified primarily on the basis of acoustic cues. The present study is concerned with ways in which a computer can be programmed to recognize talkers solely on the basis of acoustic information.

Several studies have demonstrated the feasibility of automatic recognition of speech sounds based only on acoustic information, provided the sample of words to be recognized is limited. A report of one such study by

• W. F. Meeker and A. L. Nelson, "Vocoder Evaluation Stud- ies," Radio Corporation of America, AFCRL 547 (30 June 1962).

•' I. Pollack, J. M. Pickett, and W. H. Sumby, J. Acoust. Soc. Am. 26, 403-406 (1954).

a "Voice Spectrograms for Unique Personal Identifications," Bell Labs Record 40, 214-215 (1962).

Denes and Mathews 4 described a spectral-pattern correlation procedure for the automatic recognition of spoken digits. They reported that no errors occurred when reference patterns and patterns to be recognized were based on utterances of the same talker, but the error rate was high when one-talker reference patterns were used to recognize utterances by different talkers. These results suggested that characteristic inter-talker differences might exist in the spectral patterns of talkers uttering the same text. Accordingly, a pilot experiment was undertaken to assess the feasibility of automatic talker recognition by spectral pattern matching. The basic program used in the Denes-Mathews study was altered to perform talker recognition rather than word recognition and applied to the spectral patterns from the same study. This technique proved quite successful with the materials of limited vocabulary and small sample of talkers. (See Appendix for a brief description of this experiment and discussion of results.) Therefore, the study was expanded, using a larger and less re- stricted set of materials to examine the effects of certain

stimulus parameters upon recognition performance. The experiment reported here begins with a pattern-

matching procedure using three-dimensional patterns; the same recognition scheme is then used for patterns reduced to two dimensions. Most of the recognition

4 p. B. Denes and M. V. Mathews, J. Acoust. Soc. Am. 32, 1450-1455 (1960).

354

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 132.174.254.155 On: Tue, 30 Sep 2014

23:15:12

AUTOMATIC TALKER RECOGNITION 355

results are based on individual words with identity of the word provided; however, some work done with patterns consisting of the pooled spectra of several different voiced sounds is also discussed.

PREPARATION OF MATERIALS

Five sentences were specially constructed to include words commonly occurring in telephone conversations. Ten talkers (7 male and 3 female) read each sentence several times. The speech signal was led from the sound- treated recording booth over a Western Electric 633 microphone and Bogen RP2 preamplifier to an Ampex 300 tape recorder.

After the recordings were made, the ten key words, spoken several times by each talker, were excerpted from context in the following manner. A sentence was recorded on a continuous tape loop. The loop output was split, one lead going directly to one track of a two- track Ampex tape recorder, the other to one channel of a Grason-Stadler electronic switch, which was used for monitoring purposes. A 1000-cps tone, gated by the second channel of the electronic switch, was recorded on the other track of the tape recorder at a time coincident with the desired word. This was accomplished by listening to the output of the monitor channel and adjusting the timers, which controlled the gate width, until just the desired word was located. A block diagram of the apparatus is shown in Fig. 1.

Digital representation of the spoken words in the form of spectral patterns was obtained by spectral analysis, quantization, and digital recording. Spectral analysis was accomplished by playing back the recordings through a 16-channel filter bank and a special 17th channel which bandpassed 4000-7000 cps. The center frequencies of the other bandpass filters were arranged at approximately equal intervals along the Koenig scale and covered the range 200-4000 cps. The cross-over point between adjacent filters occurred at the 6-dB point. The output of the filter bank was sampled sequentially by a multiplexer at a rate of 100 samples per second. The samples were quantized into 10-bit binary numbers which were recorded on digital tape; analogue-digital conversion and digital recording equipment have been described elsewhere. 5 The 1000-cps tone was used to control the starting and stopping operation of the digital recording equipment; thus, only the key words were recorded on digital tape. A special code marking the beginning of each word was included on the tape. This tape was used as the input to an IBM 7090 computer.

RECOGNITION PROCEDURE

Pattern m•tching, in general, consists of comparing •n unknown pattern with standard patterns that serve

• E. E. David, Jr., M. V. Mathews, and H. S. McDonald, "Description and Results of Experiments with Speech Using Digital Computer Simulation," 1958 IRE Wescon Convention Record, Part 7.

AMPEX 300 SO U RCE

RECORDINGS

OSCI LLATOR 1000 CPS

AM PEX 300 TAPE LOOP

GRASON- STADLER SWITCH

MONITOR CHANNEL

TONE GATE

CHANNEL

EARPHONE

AM PEX S 3567

TRACK I

TRACK

FIo. 1. A block diagram of apparatus for excerpting words from context.

as references. The present procedure involved comparing individual array points of the unknown with corresponding array points of the reference patterns. Since the patterns were allowed to vary in length, alignment was critical.

The method of aligning the beginnings of utterances, as used for the digits, was unsatisfactory for excerpted words, so a different alignment scheme was developed. The new procedure involved locating the maximum of the energy-vs-time function of each word and lining up these points. Specifically, the point selected was the midpoint of the longest run of time sections during which the energy exceeded 75% of the maximum.

Reference patterns were formed by adding together corresponding array points of three utterances of the same word by the same talker, making ten different reference patterns (one for each talker) for each word. The procedure used for recognition of talkers consisted of cross-correlating the array of the remaining single utterances of each word with each of the ten reference

arrays for that word using the product-moment coeffi- cient of correlation. This measure is defined by 6

r= 23xy/NvxV.•.

In the present application x and y, deviation measures of corresponding array points from the respective means of the test and reference arrays, are summed over time and frequency (or one of these dimensions, depending on the type of pattern under test); N is the number of cross products, and the sigmas are the standard devi- ations of the two arrays. This measure renders the cross products independent of the means and variances of the two arrays. The talker corresponding to the reference pattern with the highest correlation was recognized as the talker of the test utterance.

RESULTS

The recognition results for patterns of energy in the time-frequency plane (e-t-f) are shown in Fig. 2(a) in the form of a confusion matrix; the asterisks indicate

6 Q. McNemar, Psychological Statistics (John Wiley & Sons, Inc., New York, 1949).


23:15:12

356 SANDRA PRUZANSKY

RECOGNIZED TALKER

MM BM PB SP JM RG CL LK LG NG COR. MU '•' I 4 88, BI• 39 I 98, PB 35 3 2

$P• I I 37 I 92 JM I 6 ;30 I I 77

RG I 4 35

CL • 2 2 I 30 I 1 I 79

LK I 3,7 I 9,5

OVER-ALL 8,9 ,

RECOGNIZED TALKER

%

MM 37 2 I 93 B• I 37 I I 93

PB 30 3 3 3 I 75 5• 1 38 I 95 Ja 3 • 1 2 65

RG 4 136 90

LG I 36 2 92

NG I 2 I • 89

OVER-ALL

RECOGNIZED TALKER

ß • = FEMALE TALKER

(a) (b) (c) Fro. 2. Inter-talker confusions for: (a) time-frequency-energy patterns, (b) frequency-energy patterns, and (c) combined results.

female talkers. Talkers were correctly recognized for 89% of the 393 utterances tested. Results for individual talkers, shown in the right column of Fig. 2(a), ranged from 77% correct to 98% correct. The distribution of errors among words is shown in Fig. 3 (a); it can be seen that the errors were not uniformly distributed over all of the words.

The correlation procedure was used to examine recognition performance with two-dimensional patterns, eliminating temporal information. The original e-t-f patterns were reduced to energy-frequency arrays (e-f) containing 17 array points--each point being simply the sum of energy from each of the time sections for that frequency band. A confusion matrix for these patterns is shown in Fig. 2(b). B'ith the time dimension eliminated, talkers were still correctly recognized for 89% of the 393 utterances tested; the percentage o[ correct recognition for each talker is shown in the right- hand column of Fig. 2(b). However, the distribution of errors differs from that of the e-t-f patterns. There is also a difference in the error rate for individual words as

shown in the graph of Fig. 3(b). The effect on recognition success of eliminating

spectral information was also observed. Two-dimensional patterns, consisting of the total energy in each time section, were formed by summing the energy in the 17 frequency bands at each time section. The patterns were aligned by the same method used for the three- dimensional patterns. Only 47% of the unknowns were correctly recognized. Although recognition success for individual talkers ranged from 30 to 81%, only two talkers had more than 45% of approximately 40 utterances correctly recognized.

Since the distributions of errors for the two- and

three-dimensional patterns were quite different, it is possible that an increase in recognition success might be achieved by combining the results from the two best pattern types. An examination of agreement and error for e-f and e-t-f patterns showed that of the 321 cases in which the two patterns agreed on the answer, only 4

of these answers were wrong, and of the 72 cases in which they disagreed, they were both wrong in only 9 cases. A simple linear combination of the e-f and e-t-f correlations, equally weighted, resulted in correct resolution of 50 of the disagreements. Inter-talker confusions and distribution of errors among words are shown in Fig. 2(c) and Fig. 3(c), respectively. Over-all correct recognition was 93%. This is a slight improvement over the results with either pattern type alone.

Pooled Spectra

Since the talkers apparently were able to be dis- tinguished on the basis of frequency spectra alone, it was speculated that distinctiveness might be retained in the long-time frequency spectra of these voices. Once again the original patterns were reduced to two dimensions (e-f), this time modified to represent an approxi- mation to the long-time spectra of voiced speech. First, the voiced portions of speech were detected in a manner similar to that described by Lochbaum. 7 Then a reference pattern for each talker was formed by summing the energy of the voiced speech across approximately 30 different utterances (3 sets of 10 different words). Each test pattern consisted of different utterances of single sets of the same text. This reduced the number of

patterns to be recognized from almost 400 to only 40. Using the same recognition procedure as was used with single words, all 40 patterns were correctly recognized.

A perfect score on a given sample provides little information about the capability of a recognition scheme, except that it has not been adequately tested. Consequently, the possibility of extrapolation to a larger sample of talkers was explored. Examination of correlation coefficients indicated that a model similar to

that of Green and Birdsall 8 seemed applicable to the

7 C. C. Lochbaum, J. Acoust. Soc. Am. 32, 914 (A) (1960). 8 D. M. Green and T. G. Birdsall, "The Effect of Vocabulary

Size on Articulation Score," University of Michigan, AFCRC TR 57-58, AD 146, 759 (January 1958).


23:15:12

AUTOMATIC TALKER RECOGNITION 357

30

o 20

z

u lO

K D J m g t W t a P N A U I E H O H B R O Y $ N T A U I O I W T U T L N U C

t D K t E E

WORD

K D J M G T W T A P N A U I E H O H B R O Y S N T A U I O I W T U T L. N U C

T D K T E E

WORD

o

o

o

o

K D J M g t W t A P N A U I E h O H B R O Y S N T A U I O I W T U T L. N U C

t D K T E E

WORD

(a) (b) (c)

Fro. 3. Error-rate variation among words' (a) time-frequency-energy patterns, (b) frequency-energy patterns, and (c) combined results.

problem of predicting the percentage of correct recognition as a function of the number of talkers to be recognized. First, the distribution of correlation coefficients for the same-talker reference and test patterns (correct distribution) was examined. Transforming these data to z scores, 6 where

• logo (1 -- r) z= «1ogc(l+r)--2- , rendered this distribution normal. The calculated

standard deviation was very close to the theoretical value [-0.z= 1/(N--3)l-] and the within-subject variance larger than the between-subject variance. This is further evidence that the z scores are samples from a single normal population. A chi-square test of the distribution of all of the different-talker reference and test

patterns indicated that the z transforms of these values were normally distributed about a mean that was considerably smaller than that of the correct-answer distribution; the standard deviation was almost the same.

The model expresses the percen rage of correct recognition for n talkers as the probability (P•) that a sample from the population of correct-answer z scores (//•) is greater than n samples from the population of wrong- answer z scores (H0), where

P,,= /(y/Ho)dy f (x/H•)dx.

In this application it appears that H0 and H• are normally distributed with essentially the same standard deviation; so we may let

and f(y/Ho)= 1 ..-expr- (Y-mø)• 1 , 0.(2•r)« L 20. •

f(xffHl)--' 1 lexp[ -- (x-Z-•/1)2" ] . o'(2,r)• L 20 '2 _1

It is seen that P,, for a given n, depends solely upon the difference between means of the two populations. Assuming the sample means are the best estimate of the population means, this model predicts 99% correct recognition for 100 talkers; assuming that the true population means are each 20.,• (standard error of the mean) closer to each other, a pessimistic estimate, P, is 0.90 for an n of 100.

DISCUSSION

The present experiments demonstrate the success of one particular automatic talker-recognition procedure for a limited sample of talkers uttering words excerpted from a sentence context. Of 393 utterances 89% were correctly recognized when recognition was based on patterns of energy in the time-frequency plane. Recog- nition remained high, 89% correct, for two-dimensional patterns of energy and frequency. Combining data from both of the patterns resulted in 93% correct recognition. This is not the first report of a successful automatic- recognition technique; however, most of the previous studies, dealing with the problem of automatic recognition of speech sounds, have required the abstraction of distinctive features either by expert judgment or by statistical decision processes. The present results indicate that a relatively simple pattern-matching procedure, which involves no abstraction of features, might be quite useful in automatic recognition of talkers.

No attempt has been made to determine the number of parameters necessary to identify a talker. However, recognition performance did remain high with considerable reduction of information. When only spectral information was retained, reducing the number of cross products to 17 per correlation, results were as good as when the correlation procedure used about 50 times as many cross products. This indicates that voices might be able to be specified by relatively few parameters.


23:15:12

358 SANDRA PRUZANSKY

Not all reduced patterns yielded high recognition scores; recognition based on temporal patterns of energy was quite poor. However, in most cases there were very high correlations between the unknown pattern and each of the reference patterns (above 0.90) and very small differences between the pattern selected and the other nine. Either the distribution of energy in time for a given word was quite similar between talkers in this sample, or the procedure used to line up the time sections (matching the portions of peak energy) tended to make the patterns more similar. If the former alter- native be true, and words differ in their energy-time patterns, such patterns might be useful in word recognition, where representations of words invariant with talkers are desirable.

Although this experiment provides some information about physical specification of talkers for a limited number of persons, little can be learned from the present data about what people use as cues to talker identity. However, these results can be compared with recognition performance by human observers if a task com- parable to that given the computer can be developed. If there is any similarity between man and machine performance, the present pattern-matching scheme might serve as a tool for exploring factors important to human recognition of talkers.

The results with pooled spectra have perhaps the most interesting implications for talker recognition and speech recognition. Their reliability and generality are limited because the sample was small and the test and reference patterns included voiced portions of exactly the same text words; therefore, they represent "ideal- ized" long-time spectra, and not true random samples of voiced speech of equivalent duration. Within these limitations, however, the results suggest that spectral distinctiveness of talkers is retained in long-time spectra. The implications of this distinctiveness are that: (1) spectral procedures for recognition of vowels, say, will be plagued by these differences between talkers, and (2) if the differences are reliable, a finite

sample of standard text might be used to adjust speech recognizers to a particular talker.

APPENDIX

The Denes-Mathews materials consisted of digi- talized patterns of utterances of the digits zero through nine spoken in isolation by five male talkers. Each utterance was recorded on digital tape as a separate block of data. Each block consisted of the first 60 sweeps of a multiplexer (sampling 70 times per second) across 17 filter channels, making 1020 array points per pattern. Patterns were aligned so that the beginning of the utterances they represented would coincide. The beginning was located as the time section that first exceeded 10% of the peak energy. Reference patterns consisted of three utterances of the same digit by a talker. There were five reference arrays for each digit and a total of 100 test utterances, two utterances of each digit by each talker.

Recognition by cross correlation of reference and test patterns resulted in correct recognition of talkers for 94% of the utterances tested. Converting the patterns to a standard duration by changing the time scale did not change recognition scores; however, the distribution of errors differed. When recognition was based on two- dimensional patterns in the time-energy plane, considerably more errors were made; the talkers were correctly recognized for only 78% of the utterances tested. However, using two-dimensional patterns of frequency and energy resulted in 98% correct recognition. No consistent error pattern emerged for the different digits; the errors were spread over all the digits regardless of pattern alteration.

ACKNOWLEDGMENTS

The author is indebted to Mr. P. D. Bricker and

Mr. M. V. Mathews for help in planning this experiment, for much needed statistical advice, and for their valuable guidance throughout the course of the study.


23:15:12

Documents

Pattern-Matching Procedure for Automatic Talker Recognition