Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
A Refined Approach to Intoxicated Speaker Classification
Daniel Wilkey ([email protected]) John Graham ([email protected])
1. Project Goal
The INTERSPEECH 2011 Speaker State Challenge [1], “Alcohol Language Corpus” sub-
challenge implored the SLP research community to train a classifier that could determine
whether or not a speaker is intoxicated, which they define to be a blood alcohol concentration
(BAC) greater than 0.05%. For the purpose of this study, INTERSPEECH refers to speakers
considered intoxicated by the prior definition, as “alcoholised” and all other speakers as “non-
alcoholised.” Many research teams attempted this challenge and most met with limited success.
Using INTERSPEECH’s unconventional metric of un-weighted average recall, the best-
performing team [2] achieved just a 7.04% relative improvement over the baseline of 65.9%. The
goal of this project is to reuse some of the techniques from the best teams in this challenge in
combination with a few new ones of our own to achieve better overall performance.
2. Research Hypothesis
We hypothesize that a combination of the following techniques will produce better relative
performance, as compared to the baseline, than any of the teams that attempted the challenge.
The techniques we will explore include: the 4368 standard acoustic features extracted using
openSMILE that was provided with the challenge, a principal component analysis of the
aforementioned feature set, down-sampling of the training data, global speaker normalization,
gender normalization, sober-class normalization, removal of samples at the margin, and
optimization of various machine learning facets of classification (the algorithm to use, the kernel
to use, the number of folds for cross-validation, and the number of iterations).
2
3. Background
The corpus used in this challenge is an extension of one that was originally collected for a
German study on detecting intoxication from speech, which boasted impaired German police
officers for speakers [3]. The INTERSPEECH corpus consists of 162 speakers (84 male and 78
female) from 5 different areas in Germany. Each subject chose a target BAC between 0.28 and
1.75 mille, and then drank until a breathalyzer confirmed they had reached their target [1]. At
which point, the subject was asked to speak for 15 minutes. Two weeks later, all subjects were
called back in to record 30-minutes of sober speech. One major limitation of the corpus is that
each sample was later divided into roughly 30-second chunks. Therefore, despite the samples
being large in number (5400 for the training set), there are about 75 samples for every subject.
Additionally, because there are twice as many sober samples as intoxicated, the baseline
configuration is heavily biased toward guessing sober. As pointed out by [4], down-sampling is a
good way to correct for this.
For the teams that actually competed in this challenge, there was withheld test data in
addition to the training and development sets we were supplied for this study. Teams were
encouraged to train preliminary classifiers and submit them to be tested up to 5 times for
refinement. Because the competition is over, we must find a new way of measuring our
classifier. Instead, we will treat the development corpus as our test set. If we were to keep the
two data sets entirely distinct, then the teams to whom we are comparing ourselves would
effectively have about twice as much training data. To even the odds, we will report cross-
validation results instead of test results. This allows our classifier to learn some information from
the development set, while still affording a means of measurement. We concede that this method
increases the likelihood our classifier will over-fit to the data, but we feel it is a fair compromise.
3
One group from which we draw inspiration for this study is that of Prof. Hirschberg at
Columbia in [4]. Her team noted that down-sampling the provided training set caused their
classifier to be less likely to choose the majority class, NAL, and thus generalize better.
Especially considering that this data set is so heavily skewed toward the NAL class (around
70/30); this is definitely something we would like to explore.
We also take note of the success of Prof. Narayanan’s lab at USC [2]. . They showed that
various forms of global speaker normalization and iterative normalization significantly improved
classification performance. Specifically, we wish to try global speaker-ID normalization (which
they claim gave fair results), and sober-class normalization (which they claim gave very good
results).
Finally, Prof. Batliner, who worked on the original corpus we referenced in [3], noted
that most of his classification error came from a small subset of the data, namely the range from
0.8 to 1.6 per mille. The major difference between the corpus Batliner used and the one we are
using is the intoxication threshold. Batliner’s team was working off of 0.8 per mille, while in this
study the cutoff is 0.5 per mille. We would like to try removing some of the erroneous fringe
samples that he describes, but we will have to adjust the range accordingly.
4. Baseline and Pre-Processing
For the competitors of the INTERSPEECH challenge, there was a provided baseline off which to
work. Using Weka, they trained a support vector machine (SVM) with a linear kernel on the
training and development data, and then tested it against the test set. This configuration produced
a baseline un-weighted average recall of 65.9% [1]. Because we do not have access to their
testing data, the need exists to create our own baseline. For our baseline, we used the same
4
settings (SVM with a linear kernel), but trained using only the training data and tested using 5-
fold cross validation with the development set. With this method we obtained a baseline f-
measure for the AL class of 0.595 (the statistic that we will prefer) and an un-weighted average
recall of 71.0%. Note that this is significantly greater than the INTERSPEECH baseline.
The reason we prefer f-measure to un-weighted average recall is, first of all, it is
insulated from the two major bias types. If the classifier were to always guess AL, then the large
number of false positives would cause precision to be very low. Alternatively, if the classifier
were to guess NAL every time, then recall would be 0 in the absence of true positives. Because
the F1-measure equally weights these two statistics, it receives protection from both types of bias.
Secondly, f-measure more accurately represents an answer to the question we are posing: ‘is the
speaker intoxicated?’ Average recall, by contrast, takes the answers to two questions (‘can you
recognize someone who is drunk?’ and ‘can you recognize someone who is not drunk?’) and
averages them. Distinguishing the sober people from the intoxicated does not have many
applications, while detecting the intoxicated amongst the sober is useful in areas such as field
sobriety tests and cars that will not start if the driver is impaired. For these reasons, we chose to
optimize for the more widely-accepted f-measure statistic in this study. That being said, we will
still report un-weighted average recall for the purpose of comparison. The formulas for both
metrics are shown below (given relative to the AL class).
Equation 1 – F1-Measure and Un-Weighted Average Recall
5
( )
When generating the baseline, we noticed that a single simulation took approximately 12
hours. This was due in large part to the fact that the initial data set contained thousands of
features. To cut down on this time and remove features that were not providing additional
information (yet still contributing to the curse of dimensionality), we ran a principal component
analysis using Weka’s GainRatioAttributeEval algorithm. This algorithm estimates the ratio of
classification information gained from each attribute to all information gained from all attributes.
Using this technique, we reduced the initial feature set from 4368 to just a few hundred.
Furthermore, our two metrics were exactly the same after this change, which definitively proves
that no important features were filtered out.
The final preprocessing step we took was to down-sample the training data as was done
in [4]. When the classes are as unbalanced as they were in this corpus (70/30), there is a tendency
for classifiers trained from them to be biased in favor of the larger group. In the case of this
corpus, even though the initial average recall was 71%, the breakdown was 84% recall for the
NAL class and 58% recall for the AL class. This implies that the classifier was
disproportionately guessing NAL over AL due to its class size. The initial split in the training set
was 3750 NAL samples to 1650 AL samples. To even out the distribution, we randomly
removed (3750-1650) = 2100 samples from the NAL data set, leaving us an exact 50/50 split.
Down-sampling is a viable manipulation in general, but that was especially true of this corpus
6
because, as mentioned previously, there are 75 samples per person, ergo a lot of duplicates. This
is an example of a transformation that can only be applied to training data. In testing, we cannot
remove samples (in essence, refuse to classify certain samples). Down-sampling is a technique to
force the classifier you are training to focus more upon the features and less upon the relative
class sizes. Once training is complete, the influence of this technique is fixed. The impact down-
sampling had on our results was favorable. It caused a 9.3% relative improvement in f-measure
(significant for α=0.001) and a 0.85% relative increase in average recall (not significant).
5. Normalization Techniques
Prof. Shrikanth Narayanan’s team attempted various types of speaker normalization with varying
degrees of success [2]. Two such techniques that had a significant impact on his team’s results
were global speaker normalization and normalizing by the sober class, the latter more so than the
former. In both cases, they used z-score normalization, which has the following formula:
Equation 2 – Z-score
We decided to try both of the aforementioned normalization techniques and add a new technique
of our own: gender normalization.
Global speaker normalization requires the meta-information of speaker ID. When using
meta-information such as this, the team must be prepared to detect that information automatically
on the training set. Although we make no such guarantee for this study, we simply look to Prof.
Narayanan’s team for proof that it can be done if needed. Next, for a given speaker and a given
7
attribute1, compute the average and standard deviation across all samples. Finally, replace the
value in each sample by the z-score of that value, using the formula shown above. This form of
normalization yielded poor results for us. After its application, we saw a 0.23% relative decrease
in f-measure and a 0.19% relative decrease in average recall. Neither change was significant.
Next, we attempted sober-class normalization. To normalize by a class, for each attribute,
one must compute an average a standard deviation for that class across all samples from all
speakers. Then, as before, replace the value of each sample by the z-score of that value. Like
down-sampling, class normalization cannot be applied to the testing data (which is the
development data in our case). If we were able to automatically detect the class of the current
sample in order to perform this normalization, then we necessarily would have already solved the
problem and subsequent normalization would be irrelevant. Instead, we used the same computed
mean and standard deviation from the training set to normalize the development set. Our results
from this normalization technique were also unfavorable. We saw a 0.24% decrease in f-measure
and a 0.16% decrease in average recall. Again, the changes were not significant.
Finally, we attempted gender normalization. The rationale behind gender normalization is
that women tend to have a higher overall pitch as compared to men, which manifests itself in F0.
To accomplish gender normalization, we iterated over each attribute dependent upon F0 and
computed an average and standard deviation by gender. Then, we replaced each sample’s value
by the z-score of that value. Like speaker ID, gender is a type of meta-information that can be
extracted on the fly for testing purposes. Though our classifier does not do this either, detecting
gender is something that has been accomplished in the past with good results, so we take for
1 Normalization is only done for attributes that derive from F0, not all attributes, because this is an acoustic feature
whose range varies greatly by speaker. The same cannot be said for all attribute types.
8
granted that this capability could be trivially added. Gender normalization was the one
normalization technique that did show positive results. After applying gender normalization, we
saw a 0.88% relative increase in average recall as compared to the down-sampled results (1.7%
overall) and a 1.22% relative increase in f-measure (10.69% overall). Neither improvement was
significant over the prior result. The overall improvements are as before: not significant for
average recall and significant for α=.001 for f-measure.
Before wrapping up normalization, we also tried combining the approaches. When
combining global speaker ID normalization with gender normalization, we saw a decrease in
both f-measure and average recall relative to the results from using gender alone. When
combining gender normalization with sober-class normalization, however, our metrics improved
to 1.78% overall relative increase in average recall and 10.75% relative increase in f-measure.
Neither of these changes was significant, but we chose to go with the combination of gender and
sober-class normalization in our final configuration due to the nominal improvement.
6. Removing Fringe Samples
Although it is not a proven machine learning technique, many researchers, such as Jurafsky [5],
have empirically found that removing fringe cases from the dataset can produce a better overall
classifier. The idea is that, due to error rates in labeling, training a classifier on iffy fringe cases
can cause it to learn false conclusions. Now, a breathalyzer is about as good of a labeler as one
could hope for, but as Batliner pointed out in [3], the vast majority of the error in his
classification came from samples near the margin (BAC between 0.08% and 0.16%). To some
extent this can be said of any classification task, but we felt it was worth exploring nonetheless.
We should point out that the BAC threshold that Prof. Batliner trained for was 0.08% while in
9
our case it is 0.05%. Adjusting for this difference, we decided to try removing all samples with a
BAC between 0.02% and 0.08% (within .03 of the threshold), which amounted to about 21% of
our training data. The results were interesting. Average recall increased to 72.81%, which is a
2.5% relative increase over the baseline configuration and significantly greater for α=0.05. Its
increase over the prior configuration, down-sampled and gender-normalized, was not significant.
F-measure, on the other hand, decreased to 0.637, a relative 7.1% improvement over the baseline
and a 3.25% relative decrease as compared to the prior configuration. This decrease is significant
for α=0.05. Because removing fringe samples caused a significant decrease in f-measure with no
significant change in average recall and because we are more interested in optimizing the former
than the latter, we decided not to use the technique in our final configuration.
7. Machine Learning Optimizations
To this point, all of the changes we have made have been to the data set itself: either normalizing
samples or removing them. Next, we sought to optimize some of the machine learning aspects of
classification, such as the algorithm to use and the parameters of that chosen algorithm. Recall
that the baseline configuration used 5-fold cross-validation on a support vector machine with a
linear kernel. Our first experiment was to compare the performance of this baseline to three other
popular classification algorithms: naïve Bayes, perceptron, and decision trees. The results of
these trials are shown below.
Table 1 – Classification Algorithm Comparison
algorithm SVM (baseline) Bayes Perceptron Decision Tree
average recall 0.7227 0.5837 0.5908 0.6282
relative change 0 -0.1923 -0.1825 -0.1308
f-measure 0.6586 0.4917 0.4954 0.5455
relative change 0 -0.2534 -0.2478 -0.1717
10
Each tested algorithm did significantly worse than the baseline support vector machine.
We decided not to try a hidden Markov model because they are typically best suited for feature
sets that include history and therefore state transitions, of which we have none. In light of these
results, we stuck with our original algorithm, the SVM.
Next, we sought to optimize a few parameters of the SVM. The first such parameter was
the kernel. In machine learning algorithms, the kernel function is used to map samples from n-
dimensional feature space, into k-dimensional kernel space. Similar to linear regressions, the
greater the dimensionality of the kernel, the better the fit the algorithm will achieve, but the
worse it will generalize. To prevent over-fitting, we abandoned cross-validation for this portion
of the experiment. Although in theory cross-validation results should decline once the kernel
exponent is large enough to cause over-fitting, holding out the development set entirely as our
test set further reduces these odds. Moreover, the magnitude of the metrics was inconsequential
here as we merely intended to compare the alternatives. Table 2 shows a comparison of many
different kernels.
Table 2 – Kernel Comparison
experiment n=1 n=2 n=3 n=4 n=5 n=6 n=7 RBF norm n=2
average recall 0.6876 0.7111 0.7319 0.7282 0.731 0.7361 0.7363 0.7172 0.6319
relative change 0 0.0341 0.0644 0.059 0.0631 0.0705 0.0708 0.043 -0.0811
f-measure 0.6161 0.6435 0.6742 0.6637 0.6684 0.6779 0.6727 0.6337 0.4834
relative change 0 0.0445 0.0944 0.0773 0.0849 0.1004 0.0919 0.0287 -0.2153
The first 7 columns of Table 2 show the results for polynomial kernels with the specified
exponent. The eighth column contains results for a radial basis function kernel and the final
column shows the results for a normalized polynomial kernel with an exponent of 2. All changes
are reported relative to the first column, which is the baseline linear kernel. The normalized
11
quadratic kernel was the only configuration to show an overall decrease in either average recall
or f-measure so this kernel was eliminated immediately. The cubic kernel (n=3) was particularly
impressive, showing 6.44% relative increase in average recall and a 9.44% relative increase in f-
measure over the linear kernel (both significant for α=0.001). As n increased beyond 3, we begin
to see the leveling off in performance that is indicative of over-fitting. Interestingly, however, for
n=6 we see another increase in performance. Relative to n=3, n=6 shows a 0.57% increase in
average recall and a 0.55% increase in f-measure, though neither change is significant. Despite
the greater nominal value, we preferred n=3 over n=6 in our final configuration because the
difference between the two was not significant and, in general, a larger exponent increases the
chances of over-fitting. The last kernel to discuss is that of the radial basis function (column 8).
The RBF kernel performed well, though below some of the better polynomial alternatives.
Theoretically, the radial basis function is a projection of the feature vector into an infinitely
dimensional space. It works well for several high-dimensional applications, but by its nature has
a tendency to over-fit. We were not disappointed to see it outdone by polynomial kernels herein.
The final two machine learning parameters we sought to optimize were the number of
folds for cross-validation and the number of iterations to run the algorithm. For our baseline, we
used 5-fold cross-validation. We decided to try 10-fold cross-validation for comparison simply
because it is another common value. In terms of iterations, our baseline used 1. As a form of
binary search (multiple-iteration experiments take a long time so we did not want to try every
combination), we decided to start by increasing the iterations to 10. The reasoning for 10 is that
it is the largest number we were willing try in conjunction with a support vector machine (this
experiment took about 30 hours to complete). If 10 iterations showed a significant increase in
performance, then we would try 5, followed by either 8 or 3 depending upon how 5 did, and so
12
on until we narrowed in on where the increase was most significant. If 10 iterations did not show
better performance, however, then we could give up because there was no improvement to be
had from tweaking this parameter. The results of these trials are shown in Table 3, below.
Table 3 – Folds and Iterations
experiment baseline 10 folds 10 iterations
average recall 0.7227 0.7304 0.7201
relative change 0 0.0107 -0.0036
f-measure 0.6586 0.6692 0.6568
relative change 0 0.0161 -0.0027
Once again, the change values are given relative to the baseline configuration in the first
column. Increasing the number of iterations had no effect on the overall performance of the
classifier. In fact, more iterations led to worse results, though this change was not significant.
Increasing the number of folds, however, did show favorable results. The 10-fold configuration
showed a 1.07% increase in average recall and a 1.61% relative increase in f-measure over the
baseline. Even though neither change was significant, we decided to use 10 folds in our final
configuration purely in response to the nominal increase.
8. Results
With knowledge gained from all of the aforementioned configurations, we ran one final
experiment that had: an SVM with a cubic kernel, 10-fold cross validation, 1 iteration, gender
normalization, and sober-class normalization. This configuration yielded 75.91% average recall
(a 6.87% relative increase) and an f-measure for the AL class of 0.7061 (an 18.67% relative
increase). Both increases are significant at the 99.9% level of confidence. Our f-measure results
we believe would be about the best among the participants from INTERSPEECH 2011, but
13
unfortunately that was not the challenge metric; average recall was. To compare our average
recall results with the best-performing team from the INTERSPEECH challenge, we present
Table 4.
Table 4 – Result Comparison
INTERSPEECH Baseline
Narayanan Our Baseline Our Result
Average Recall
65.9 70.54 71.03 75.91
Relative Increase
- 7.041 - 6.8703
It is very hard to compare these two classifiers, but the above is our best-effort. Prof.
Narayanan’s numbers are relative to a test set, which we did not have. Also, most teams reported
that the test set was very different from both the training and development sets, which caused
their performance to decrease. Furthermore, Prof. Narayanan’s results are test results whereas we
are reporting cross-validation results. On our end, due to the cross validation, we had a much
higher baseline. All that being said, the results are similar. We both performed about 7% above
our respective baselines. Although we did not significantly outperform the best team as we had
hoped, at the end of the day it’s an apples and oranges comparison anyway.
9. Conclusions and Extensions
The goal of this study was to learn from what teams had done in the past and combine it with our
own ideas to create a classifier for the 2011 INTERSPEECH speaker state challenge that could
outperform any of the submissions. Although our results stack up very well, it is interesting to
note that most of our success did not come from adding new features or replicating the features
that other teams had used, but from manipulations of the data sets and optimizing the problems
14
from a machine learning standpoint. While various forms of normalization, most notably gender,
did provide some boost in performance, none of these improvements was significant. Also, the
removal of fringe samples from the dataset, suggested by [3], actually had a negative impact
upon our results.
On the other hand, our ideas borrowed from research were not all bad. Down-sampling
provided a significant increase in both average recall and f-measure, as reported by [4]. Our best
results, by far, came from optimizing the kernel to use and the number of folds to cross-validate.
These changes nearly doubled our relative increase in f-measure (from 10.7% to 18.7%) and
caused almost a 4-fold relative improvement in average recall (from 1.8% to 6.9% over the
baseline). Although it is difficult to compare the results of this study to those of the
INTERSPEECH challengers, looking at relative increase in average recall alone, our
performance was on par with the best teams, albeit using a very different technique.
One extension that we propose is the incorporation of the GMM super-vectors that Prof.
Narayanan’s team explored, into our model. This feature set was one of the primary reasons their
model performed so well in testing and we believe that, if combined with our techniques, it could
produce the best classifier yet. Another interesting extension is the collection of a better data set.
There were several problems with the corpus provided for this challenge. Many of the challenge
teams mentioned that the test set varied greatly from both the training and development data,
causing their classifiers to perform significantly worse in testing than they had in training.
Additionally, each speaker appeared in the corpus about 75 times. This type of repetition causes
extreme over-fitting on the training data. Two speech samples broken into 75 parts do not make
75 samples; they make two data points with 36-fold over-sampling. Finally, it is not necessary to
force the same people to record intoxicated and sober speech samples. For most applications of
15
this research, such as field sobriety tests, one would not have a sober speech sample of the
subject with which to compare. We believe the ideal corpus for this study would contain a large
number of sober samples and a large number of intoxicated samples, each from a different
subject.
10. References
[1] B. Schuller, S. Steidl, A. Batliner, F. Schiel and J. Krajewski, "The INTERSPEECH 2011
Speaker State Challenge," in Proc. INTERSPEECH, Florence, Italy, 2011.
[2] D. Bone, M. P. Black, M. Li, A. Metallinou, S. Lee and S. S. Narayanan, "Intoxicated Speech
Detection by Fusion of Speaker Normalized Hierarchical Features and GMM Supervectors,"
in Proc. INTERSPEECH, Florence, Italy, 2011.
[3] M. Levit, R. Huber, A. Batliner and E. Noeth, "Use of prosodic speech characteristics for
automated detection of alcohol intoxication," in ITRW on Prosody in Speech Recognition and
Understanding, Red Bank, NJ, 2001.
[4] F. Biadsy, W. Y. Wang, A. Rosenberg and J. Hirschberg, "Intoxication Detection using
Phonetic, Phonotactic and Prosodic Cues," in Proc. INTERSPEECH, Florence, Italy, 2011.
[5] D. Jurafsky, R. Ranganath and D. McFarland, "Extracting Social Meaning: Identifying
Interactional Style in Spoker Conversation," in NAACL Proceedings of Human Language
Technologies, Stroudsburg, PA, 2009.