1. Project Goaldgw2109/intoxication detection.pdf · gender normalization, sober-class normalization, removal of samples at the margin, and optimization of various machine learning

1

A Refined Approach to Intoxicated Speaker Classification

Daniel Wilkey ([email protected]) John Graham ([email protected])

1. Project Goal

The INTERSPEECH 2011 Speaker State Challenge [1], “Alcohol Language Corpus” sub-

challenge implored the SLP research community to train a classifier that could determine

whether or not a speaker is intoxicated, which they define to be a blood alcohol concentration

(BAC) greater than 0.05%. For the purpose of this study, INTERSPEECH refers to speakers

considered intoxicated by the prior definition, as “alcoholised” and all other speakers as “non-

alcoholised.” Many research teams attempted this challenge and most met with limited success.

Using INTERSPEECH’s unconventional metric of un-weighted average recall, the best-

performing team [2] achieved just a 7.04% relative improvement over the baseline of 65.9%. The

goal of this project is to reuse some of the techniques from the best teams in this challenge in

combination with a few new ones of our own to achieve better overall performance.

2. Research Hypothesis

We hypothesize that a combination of the following techniques will produce better relative

performance, as compared to the baseline, than any of the teams that attempted the challenge.

The techniques we will explore include: the 4368 standard acoustic features extracted using

openSMILE that was provided with the challenge, a principal component analysis of the

aforementioned feature set, down-sampling of the training data, global speaker normalization,

gender normalization, sober-class normalization, removal of samples at the margin, and

optimization of various machine learning facets of classification (the algorithm to use, the kernel

to use, the number of folds for cross-validation, and the number of iterations).

mailto:[email protected]

mailto:[email protected]

2

3. Background

The corpus used in this challenge is an extension of one that was originally collected for a

German study on detecting intoxication from speech, which boasted impaired German police

officers for speakers [3]. The INTERSPEECH corpus consists of 162 speakers (84 male and 78

female) from 5 different areas in Germany. Each subject chose a target BAC between 0.28 and

1.75 mille, and then drank until a breathalyzer confirmed they had reached their target [1]. At

which point, the subject was asked to speak for 15 minutes. Two weeks later, all subjects were

called back in to record 30-minutes of sober speech. One major limitation of the corpus is that

each sample was later divided into roughly 30-second chunks. Therefore, despite the samples

being large in number (5400 for the training set), there are about 75 samples for every subject.

Additionally, because there are twice as many sober samples as intoxicated, the baseline

configuration is heavily biased toward guessing sober. As pointed out by [4], down-sampling is a

good way to correct for this.

For the teams that actually competed in this challenge, there was withheld test data in

addition to the training and development sets we were supplied for this study. Teams were

encouraged to train preliminary classifiers and submit them to be tested up to 5 times for

refinement. Because the competition is over, we must find a new way of measuring our

classifier. Instead, we will treat the development corpus as our test set. If we were to keep the

two data sets entirely distinct, then the teams to whom we are comparing ourselves would

effectively have about twice as much training data. To even the odds, we will report cross-

validation results instead of test results. This allows our classifier to learn some information from

the development set, while still affording a means of measurement. We concede that this method

increases the likelihood our classifier will over-fit to the data, but we feel it is a fair compromise.

3

One group from which we draw inspiration for this study is that of Prof. Hirschberg at

Columbia in [4]. Her team noted that down-sampling the provided training set caused their

classifier to be less likely to choose the majority class, NAL, and thus generalize better.

Especially considering that this data set is so heavily skewed toward the NAL class (around

70/30); this is definitely something we would like to explore.

We also take note of the success of Prof. Narayanan’s lab at USC [2]. . They showed that

various forms of global speaker normalization and iterative normalization significantly improved

classification performance. Specifically, we wish to try global speaker-ID normalization (which

they claim gave fair results), and sober-class normalization (which they claim gave very good

results).

Finally, Prof. Batliner, who worked on the original corpus we referenced in [3], noted

that most of his classification error came from a small subset of the data, namely the range from

0.8 to 1.6 per mille. The major difference between the corpus Batliner used and the one we are

using is the intoxication threshold. Batliner’s team was working off of 0.8 per mille, while in this

study the cutoff is 0.5 per mille. We would like to try removing some of the erroneous fringe

samples that he describes, but we will have to adjust the range accordingly.

4. Baseline and Pre-Processing

For the competitors of the INTERSPEECH challenge, there was a provided baseline off which to

work. Using Weka, they trained a support vector machine (SVM) with a linear kernel on the

training and development data, and then tested it against the test set. This configuration produced

a baseline un-weighted average recall of 65.9% [1]. Because we do not have access to their

testing data, the need exists to create our own baseline. For our baseline, we used the same

4

settings (SVM with a linear kernel), but trained using only the training data and tested using 5-

fold cross validation with the development set. With this method we obtained a baseline f-

measure for the AL class of 0.595 (the statistic that we will prefer) and an un-weighted average

recall of 71.0%. Note that this is significantly greater than the INTERSPEECH baseline.

The reason we prefer f-measure to un-weighted average recall is, first of all, it is

insulated from the two major bias types. If the classifier were to always guess AL, then the large

number of false positives would cause precision to be very low. Alternatively, if the classifier

were to guess NAL every time, then recall would be 0 in the absence of true positives. Because

the F1-measure equally weights these two statistics, it receives protection from both types of bias.

Secondly, f-measure more accurately represents an answer to the question we are posing: ‘is the

speaker intoxicated?’ Average recall, by contrast, takes the answers to two questions (‘can you

recognize someone who is drunk?’ and ‘can you recognize someone who is not drunk?’) and

averages them. Distinguishing the sober people from the intoxicated does not have many

applications, while detecting the intoxicated amongst the sober is useful in areas such as field

sobriety tests and cars that will not start if the driver is impaired. For these reasons, we chose to

optimize for the more widely-accepted f-measure statistic in this study. That being said, we will

still report un-weighted average recall for the purpose of comparison. The formulas for both

metrics are shown below (given relative to the AL class).

Equation 1 – F1-Measure and Un-Weighted Average Recall

5

( )

When generating the baseline, we noticed that a single simulation took approximately 12

hours. This was due in large part to the fact that the initial data set contained thousands of

features. To cut down on this time and remove features that were not providing additional

information (yet still contributing to the curse of dimensionality), we ran a principal component

analysis using Weka’s GainRatioAttributeEval algorithm. This algorithm estimates the ratio of

classification information gained from each attribute to all information gained from all attributes.

Using this technique, we reduced the initial feature set from 4368 to just a few hundred.

Furthermore, our two metrics were exactly the same after this change, which definitively proves

that no important features were filtered out.

The final preprocessing step we took was to down-sample the training data as was done

in [4]. When the classes are as unbalanced as they were in this corpus (70/30), there is a tendency

for classifiers trained from them to be biased in favor of the larger group. In the case of this

corpus, even though the initial average recall was 71%, the breakdown was 84% recall for the

NAL class and 58% recall for the AL class. This implies that the classifier was

disproportionately guessing NAL over AL due to its class size. The initial split in the training set

was 3750 NAL samples to 1650 AL samples. To even out the distribution, we randomly

removed (3750-1650) = 2100 samples from the NAL data set, leaving us an exact 50/50 split.

Down-sampling is a viable manipulation in general, but that was especially true of this corpus

6

because, as mentioned previously, there are 75 samples per person, ergo a lot of duplicates. This

is an example of a transformation that can only be applied to training data. In testing, we cannot

remove samples (in essence, refuse to classify certain samples). Down-sampling is a technique to

force the classifier you are training to focus more upon the features and less upon the relative

class sizes. Once training is complete, the influence of this technique is fixed. The impact down-

sampling had on our results was favorable. It caused a 9.3% relative improvement in f-measure

(significant for α=0.001) and a 0.85% relative increase in average recall (not significant).

5. Normalization Techniques

Prof. Shrikanth Narayanan’s team attempted various types of speaker normalization with varying

degrees of success [2]. Two such techniques that had a significant impact on his team’s results

were global speaker normalization and normalizing by the sober class, the latter more so than the

former. In both cases, they used z-score normalization, which has the following formula:

Equation 2 – Z-score

We decided to try both of the aforementioned normalization techniques and add a new technique

of our own: gender normalization.

Global speaker normalization requires the meta-information of speaker ID. When using

meta-information such as this, the team must be prepared to detect that information automatically

on the training set. Although we make no such guarantee for this study, we simply look to Prof.

Narayanan’s team for proof that it can be done if needed. Next, for a given speaker and a given

7

attribute1, compute the average and standard deviation across all samples. Finally, replace the

value in each sample by the z-score of that value, using the formula shown above. This form of

normalization yielded poor results for us. After its application, we saw a 0.23% relative decrease

in f-measure and a 0.19% relative decrease in average recall. Neither change was significant.

Next, we attempted sober-class normalization. To normalize by a class, for each attribute,

one must compute an average a standard deviation for that class across all samples from all

speakers. Then, as before, replace the value of each sample by the z-score of that value. Like

down-sampling, class normalization cannot be applied to the testing data (which is the

development data in our case). If we were able to automatically detect the class of the current

sample in order to perform this normalization, then we necessarily would have already solved the

problem and subsequent normalization would be irrelevant. Instead, we used the same computed

mean and standard deviation from the training set to normalize the development set. Our results

from this normalization technique were also unfavorable. We saw a 0.24% decrease in f-measure

and a 0.16% decrease in average recall. Again, the changes were not significant.

Finally, we attempted gender normalization. The rationale behind gender normalization is

that women tend to have a higher overall pitch as compared to men, which manifests itself in F0.

To accomplish gender normalization, we iterated over each attribute dependent upon F0 and

computed an average and standard deviation by gender. Then, we replaced each sample’s value

by the z-score of that value. Like speaker ID, gender is a type of meta-information that can be

extracted on the fly for testing purposes. Though our classifier does not do this either, detecting

gender is something that has been accomplished in the past with good results, so we take for

1 Normalization is only done for attributes that derive from F0, not all attributes, because this is an acoustic feature

whose range varies greatly by speaker. The same cannot be said for all attribute types.

8

granted that this capability could be trivially added. Gender normalization was the one

normalization technique that did show positive results. After applying gender normalization, we

saw a 0.88% relative increase in average recall as compared to the down-sampled results (1.7%

overall) and a 1.22% relative increase in f-measure (10.69% overall). Neither improvement was

significant over the prior result. The overall improvements are as before: not significant for

average recall and significant for α=.001 for f-measure.

Before wrapping up normalization, we also tried combining the approaches. When

combining global speaker ID normalization with gender normalization, we saw a decrease in

both f-measure and average recall relative to the results from using gender alone. When

combining gender normalization with sober-class normalization, however, our metrics improved

to 1.78% overall relative increase in average recall and 10.75% relative increase in f-measure.

Neither of these changes was significant, but we chose to go with the combination of gender and

sober-class normalization in our final configuration due to the nominal improvement.

6. Removing Fringe Samples

Although it is not a proven machine learning technique, many researchers, such as Jurafsky [5],

have empirically found that removing fringe cases from the dataset can produce a better overall

classifier. The idea is that, due to error rates in labeling, training a classifier on iffy fringe cases

can cause it to learn false conclusions. Now, a breathalyzer is about as good of a labeler as one

could hope for, but as Batliner pointed out in [3], the vast majority of the error in his

classification came from samples near the margin (BAC between 0.08% and 0.16%). To some

extent this can be said of any classification task, but we felt it was worth exploring nonetheless.

We should point out that the BAC threshold that Prof. Batliner trained for was 0.08% while in

9

our case it is 0.05%. Adjusting for this difference, we decided to try removing all samples with a

BAC between 0.02% and 0.08% (within .03 of the threshold), which amounted to about 21% of

our training data. The results were interesting. Average recall increased to 72.81%, which is a

2.5% relative increase over the baseline configuration and significantly greater for α=0.05. Its

increase over the prior configuration, down-sampled and gender-normalized, was not significant.

F-measure, on the other hand, decreased to 0.637, a relative 7.1% improvement over the baseline

and a 3.25% relative decrease as compared to the prior configuration. This decrease is significant

for α=0.05. Because removing fringe samples caused a significant decrease in f-measure with no

significant change in average recall and because we are more interested in optimizing the former

than the latter, we decided not to use the technique in our final configuration.

7. Machine Learning Optimizations

To this point, all of the changes we have made have been to the data set itself: either normalizing

samples or removing them. Next, we sought to optimize some of the machine learning aspects of

classification, such as the algorithm to use and the parameters of that chosen algorithm. Recall

that the baseline configuration used 5-fold cross-validation on a support vector machine with a

linear kernel. Our first experiment was to compare the performance of this baseline to three other

popular classification algorithms: naïve Bayes, perceptron, and decision trees. The results of

these trials are shown below.

Table 1 – Classification Algorithm Comparison

algorithm SVM (baseline) Bayes Perceptron Decision Tree

average recall 0.7227 0.5837 0.5908 0.6282

relative change 0 -0.1923 -0.1825 -0.1308

f-measure 0.6586 0.4917 0.4954 0.5455

relative change 0 -0.2534 -0.2478 -0.1717

10

Each tested algorithm did significantly worse than the baseline support vector machine.

We decided not to try a hidden Markov model because they are typically best suited for feature

sets that include history and therefore state transitions, of which we have none. In light of these

results, we stuck with our original algorithm, the SVM.

Next, we sought to optimize a few parameters of the SVM. The first such parameter was

the kernel. In machine learning algorithms, the kernel function is used to map samples from n-

dimensional feature space, into k-dimensional kernel space. Similar to linear regressions, the

greater the dimensionality of the kernel, the better the fit the algorithm will achieve, but the

worse it will generalize. To prevent over-fitting, we abandoned cross-validation for this portion

of the experiment. Although in theory cross-validation results should decline once the kernel

exponent is large enough to cause over-fitting, holding out the development set entirely as our

test set further reduces these odds. Moreover, the magnitude of the metrics was inconsequential

here as we merely intended to compare the alternatives. Table 2 shows a comparison of many

different kernels.

Table 2 – Kernel Comparison

experiment n=1 n=2 n=3 n=4 n=5 n=6 n=7 RBF norm n=2

average recall 0.6876 0.7111 0.7319 0.7282 0.731 0.7361 0.7363 0.7172 0.6319

relative change 0 0.0341 0.0644 0.059 0.0631 0.0705 0.0708 0.043 -0.0811

f-measure 0.6161 0.6435 0.6742 0.6637 0.6684 0.6779 0.6727 0.6337 0.4834

relative change 0 0.0445 0.0944 0.0773 0.0849 0.1004 0.0919 0.0287 -0.2153

The first 7 columns of Table 2 show the results for polynomial kernels with the specified

exponent. The eighth column contains results for a radial basis function kernel and the final

column shows the results for a normalized polynomial kernel with an exponent of 2. All changes

are reported relative to the first column, which is the baseline linear kernel. The normalized

11

quadratic kernel was the only configuration to show an overall decrease in either average recall

or f-measure so this kernel was eliminated immediately. The cubic kernel (n=3) was particularly

impressive, showing 6.44% relative increase in average recall and a 9.44% relative increase in f-

measure over the linear kernel (both significant for α=0.001). As n increased beyond 3, we begin

to see the leveling off in performance that is indicative of over-fitting. Interestingly, however, for

n=6 we see another increase in performance. Relative to n=3, n=6 shows a 0.57% increase in

average recall and a 0.55% increase in f-measure, though neither change is significant. Despite

the greater nominal value, we preferred n=3 over n=6 in our final configuration because the

difference between the two was not significant and, in general, a larger exponent increases the

chances of over-fitting. The last kernel to discuss is that of the radial basis function (column 8).

The RBF kernel performed well, though below some of the better polynomial alternatives.

Theoretically, the radial basis function is a projection of the feature vector into an infinitely

dimensional space. It works well for several high-dimensional applications, but by its nature has

a tendency to over-fit. We were not disappointed to see it outdone by polynomial kernels herein.

The final two machine learning parameters we sought to optimize were the number of

folds for cross-validation and the number of iterations to run the algorithm. For our baseline, we

used 5-fold cross-validation. We decided to try 10-fold cross-validation for comparison simply

because it is another common value. In terms of iterations, our baseline used 1. As a form of

binary search (multiple-iteration experiments take a long time so we did not want to try every

combination), we decided to start by increasing the iterations to 10. The reasoning for 10 is that

it is the largest number we were willing try in conjunction with a support vector machine (this

experiment took about 30 hours to complete). If 10 iterations showed a significant increase in

performance, then we would try 5, followed by either 8 or 3 depending upon how 5 did, and so

12

on until we narrowed in on where the increase was most significant. If 10 iterations did not show

better performance, however, then we could give up because there was no improvement to be

had from tweaking this parameter. The results of these trials are shown in Table 3, below.

Table 3 – Folds and Iterations

experiment baseline 10 folds 10 iterations

average recall 0.7227 0.7304 0.7201

relative change 0 0.0107 -0.0036

f-measure 0.6586 0.6692 0.6568

relative change 0 0.0161 -0.0027

Once again, the change values are given relative to the baseline configuration in the first

column. Increasing the number of iterations had no effect on the overall performance of the

classifier. In fact, more iterations led to worse results, though this change was not significant.

Increasing the number of folds, however, did show favorable results. The 10-fold configuration

showed a 1.07% increase in average recall and a 1.61% relative increase in f-measure over the

baseline. Even though neither change was significant, we decided to use 10 folds in our final

configuration purely in response to the nominal increase.

8. Results

With knowledge gained from all of the aforementioned configurations, we ran one final

experiment that had: an SVM with a cubic kernel, 10-fold cross validation, 1 iteration, gender

normalization, and sober-class normalization. This configuration yielded 75.91% average recall

(a 6.87% relative increase) and an f-measure for the AL class of 0.7061 (an 18.67% relative

increase). Both increases are significant at the 99.9% level of confidence. Our f-measure results

we believe would be about the best among the participants from INTERSPEECH 2011, but

13

unfortunately that was not the challenge metric; average recall was. To compare our average

recall results with the best-performing team from the INTERSPEECH challenge, we present

Table 4.

Table 4 – Result Comparison

INTERSPEECH Baseline

Narayanan Our Baseline Our Result

Average Recall

65.9 70.54 71.03 75.91

Relative Increase

- 7.041 - 6.8703

It is very hard to compare these two classifiers, but the above is our best-effort. Prof.

Narayanan’s numbers are relative to a test set, which we did not have. Also, most teams reported

that the test set was very different from both the training and development sets, which caused

their performance to decrease. Furthermore, Prof. Narayanan’s results are test results whereas we

are reporting cross-validation results. On our end, due to the cross validation, we had a much

higher baseline. All that being said, the results are similar. We both performed about 7% above

our respective baselines. Although we did not significantly outperform the best team as we had

hoped, at the end of the day it’s an apples and oranges comparison anyway.

9. Conclusions and Extensions

The goal of this study was to learn from what teams had done in the past and combine it with our

own ideas to create a classifier for the 2011 INTERSPEECH speaker state challenge that could

outperform any of the submissions. Although our results stack up very well, it is interesting to

note that most of our success did not come from adding new features or replicating the features

that other teams had used, but from manipulations of the data sets and optimizing the problems

14

from a machine learning standpoint. While various forms of normalization, most notably gender,

did provide some boost in performance, none of these improvements was significant. Also, the

removal of fringe samples from the dataset, suggested by [3], actually had a negative impact

upon our results.

On the other hand, our ideas borrowed from research were not all bad. Down-sampling

provided a significant increase in both average recall and f-measure, as reported by [4]. Our best

results, by far, came from optimizing the kernel to use and the number of folds to cross-validate.

These changes nearly doubled our relative increase in f-measure (from 10.7% to 18.7%) and

caused almost a 4-fold relative improvement in average recall (from 1.8% to 6.9% over the

baseline). Although it is difficult to compare the results of this study to those of the

INTERSPEECH challengers, looking at relative increase in average recall alone, our

performance was on par with the best teams, albeit using a very different technique.

One extension that we propose is the incorporation of the GMM super-vectors that Prof.

Narayanan’s team explored, into our model. This feature set was one of the primary reasons their

model performed so well in testing and we believe that, if combined with our techniques, it could

produce the best classifier yet. Another interesting extension is the collection of a better data set.

There were several problems with the corpus provided for this challenge. Many of the challenge

teams mentioned that the test set varied greatly from both the training and development data,

causing their classifiers to perform significantly worse in testing than they had in training.

Additionally, each speaker appeared in the corpus about 75 times. This type of repetition causes

extreme over-fitting on the training data. Two speech samples broken into 75 parts do not make

75 samples; they make two data points with 36-fold over-sampling. Finally, it is not necessary to

force the same people to record intoxicated and sober speech samples. For most applications of

15

this research, such as field sobriety tests, one would not have a sober speech sample of the

subject with which to compare. We believe the ideal corpus for this study would contain a large

number of sober samples and a large number of intoxicated samples, each from a different

subject.

10. References

[1] B. Schuller, S. Steidl, A. Batliner, F. Schiel and J. Krajewski, "The INTERSPEECH 2011

Speaker State Challenge," in Proc. INTERSPEECH, Florence, Italy, 2011.

[2] D. Bone, M. P. Black, M. Li, A. Metallinou, S. Lee and S. S. Narayanan, "Intoxicated Speech

Detection by Fusion of Speaker Normalized Hierarchical Features and GMM Supervectors,"

in Proc. INTERSPEECH, Florence, Italy, 2011.

[3] M. Levit, R. Huber, A. Batliner and E. Noeth, "Use of prosodic speech characteristics for

automated detection of alcohol intoxication," in ITRW on Prosody in Speech Recognition and

Understanding, Red Bank, NJ, 2001.

[4] F. Biadsy, W. Y. Wang, A. Rosenberg and J. Hirschberg, "Intoxication Detection using

Phonetic, Phonotactic and Prosodic Cues," in Proc. INTERSPEECH, Florence, Italy, 2011.

[5] D. Jurafsky, R. Ranganath and D. McFarland, "Extracting Social Meaning: Identifying

Interactional Style in Spoker Conversation," in NAACL Proceedings of Human Language

Technologies, Stroudsburg, PA, 2009.

Documents

1. Project Goaldgw2109/intoxication detection.pdf · gender normalization, sober-class normalization, removal of samples at the margin, and optimization of various machine learning