21
Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C. Lammert, Brian R. Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth Narayanan September 29, 2010

Automatic Classification of Married Couples’ Behavior using Audio Features

Embed Size (px)

DESCRIPTION

Automatic Classification of Married Couples’ Behavior using Audio Features. Matthew Black, Athanasios Katsamanis , Chi-Chun Lee, Adam C. Lammert , Brian R. Baucom , Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth Narayanan September 29, 2010. (Common Slides). Insert here …. - PowerPoint PPT Presentation

Citation preview

Slide 1

Segment the sessions into meaningful regionsExploit the known lexical content (transcriptions with speaker labels)Recursive automatic speech-text alignment technique [Moreno 1998]

Session split into regions: wife/husband/unknownAligned >60% of sessions words for 293/574 sessionsSpeaker Segmentation

AM = Acoustic ModelLM = Language ModelDict = DictionaryMFCC = Mel-Frequency Cepstral CoefficientsASR = Automatic Speech RecognitionHYP = ASR Hypothesized TranscriptAutomatic Classification of Married Couples' Behavior using Audio Features9/29/20108 / 21

- First, we needed to segment the sessions into meaningful regions- Recursive automatic speech-text alignment technique - this worked as follows: speech recognition was run across the entire 10-minute session using language models trained on the sessions transcript and the resulting speech recognition output transcript was compared to the input transcript. - the two transcripts were aligned and and long strings of agreement were areas in which there was high confidence the text for that portion of the audio was correct. These regions are referred to as anchor regions. - the process was then iterated between anchor regions. - After convergence, the session split into regions: wife/husband/unknown- Segmented >60% of sessions words into wife/husband regions for 293/574 sessions- Only considered these 293 sessions for our initial experiments

8Automatic Classification of Married Couples' Behavior using Audio Features(Common Slides)Insert here 9/29/20102 / 21Automatic Classification of Married Couples' Behavior using Audio FeaturesOverview of Study9/29/20103 / 21Trained Human Evaluator

Available Data (e.g., audio, video, text)Judgments(e.g., how much is one spouse blaming the other spouse?)Signal Processing(e.g., feature extraction)Computational Modeling(e.g., machine learning)FeedbackExamplesMarried couples discussing a problem in their relationship

Data Coding

Automatic Classification of Married Couples' Behavior using Audio FeaturesMotivationPsychology research depends on perceptual judgmentsInteraction recorded for offline hand codingRely on a variety of established coding standards [Margolin et al. 1998]

Manual coding process expensive and time consumingCreating coding manualTraining codersCoder reliability

Technology can help code audio-visual dataCertain measurements are difficult for humans to makeComputers can extract these low-level descriptors (LLDs) [Schuller et al. 2007]Consistent way to quantify human behavior from objective signals9/29/20104 / 21Husband: what did I tell you about you can spend uh everything that we uh earn

Wife: then why did you ask

Husband: and spend more and get us into debt

Wife: yeah why did you ask see my question is

Husband: mm hmmm

Wife: if if you told me this and I agree I would keep the books and all my expenses and everythingAutomatic Classification of Married Couples' Behavior using Audio FeaturesSample Transcript9/29/20106 / 21GoalsSeparate extreme cases of session-level perceptual judgments using speaker-dependent features derived from (noisy) audio signal

Extraction and analysis of relevant acoustic (prosodic/spectral) features

RelevanceAutomatic coding of real data using objective featuresExtraction of high-level speaker information from complex interactions

Automatic Classification of Married Couples' Behavior using Audio Features9/29/20109 / 21Goals- Make new histogram of behaviors show full histogram and then new histogram where we only care about half of the behavior

Goals for the initial experiments were to:- Separate extreme cases of session-level perceptual judgments (e.g., high blame from low blame) for each spouse using features derived from the (noisy) audio signal- Extraction/Modeling/Analysis of relevant acoustic (prosodic/spectral) features

Inputs- 10-minute audio clip of spontaneous dyadic conversation about a problem- transcript with speaker labels

Relevance- Automatic coding of data using objective features and quantitative modeling- Extraction of useful speaker information from long, spontaneous dyadic conversations

9Explore the use of 10 acoustic low-level descriptors (LLDs)Broadly useful in psychology and engineering research

1) Speaking rateExtracted for each aligned word [words/sec, letters/sec]

2) Voice Activity Detection (VAD) [Ghosh et al. 2010]Trained on 30-second clip from held-out sessionSeparated speech from non-speech regionsExtracted 2 session-level first-order Markov chain featuresPr(xi = speech | xi-1 = speech)Pr(xi = non-speech | xi-1 = speech)Extracted durations of each speech and non-speech segment to use as LLD for later feature extractionAutomatic Classification of Married Couples' Behavior using Audio Features9/29/201010 / 21Feature Extraction (1/3)- We explore the use of both prosodic and spectral features that have been shown to be broadly useful in psychology and engineering research

10Extracted over each voiced region (every 10ms with 25ms window)3) Pitch4) Root-Mean-Square (RMS) Energy5) Harmonics-to-Noise Ratio (HNR)6) Voice Quality (zero-crossing rate of autocorrelation function)7) 13 MFCCs8) 26 magnitude of Mel-frequency bands (MFBs)9) Magnitude of the spectral centroid10) Magnitude of the spectral flux

LLDs 4-10 extracted with openSMILE [Eyben et al. 2009]Pitch extracted with Praat [Boersma 2001]Median filtered (N=5) and linearly interpolatedNo interpolation across speaker-change points (using automatic alignment)

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201011 / 21Feature Extraction (2/3)- We explore the use of both prosodic and spectral features that have been shown to be broadly useful in psychology and engineering research

11Normalized the pitch stream 2 ways:

Mean pitch value, Fo, computed across session using automatic alignmentUnknown regions treated as coming from one unknown speakerAutomatic Classification of Married Couples' Behavior using Audio Features9/29/201012 / 21Pitch Example & Normalization

- We explore the use of both prosodic and spectral features that have been shown to be broadly useful in psychology and engineering research

12Each session split into 3 domainsWife (when wife was speaker)Husband (when husband was speaker)Speaker-independent (full session)

Extracted 13 functionals across each domain for each LLDMean, standard deviation, skewness, kurtosis, range, minimum, minimum location, maximum, maximum location, lower quartile, median, upper quartile, interquartile range

Final set of 2007 featuresTo capture global acoustic properties of the spouses and interaction

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201013 / 21Feature Extraction (3/3)- We explore the use of both prosodic and spectral features that have been shown to be broadly useful in psychology and engineering research

13Classification ExperimentBinary classification taskOnly analyzed sessions that had mean scores evaluator scores in the top/bottom 20% of the code rangeGoal: separate the 2 extremes automaticallyLeave-one-couple-out cross-validationTrained wife and husband models separately [Christensen 1990]Error metric: % of misclassified sessions

Classifier: Fishers linear discriminant analysis (LDA)Forward feature selection to choose which features to train the LDAEmpirically better than other common classifiers (SVM, logistic regression)Automatic Classification of Married Couples' Behavior using Audio Features9/29/201014 / 21- Maximize the proportion of between class scatter to within class scatter using a linear combination of the features 14Classification Results (1/2)Separated extreme behaviors better than chance (50% error) for 3 of the 6 codesAcceptance, global positive affect, global negative affectGlobal speaker-dependent cues captured evaluators perception well

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201015 / 21- Bar plot shows the classification results for this initial experiment- x-axis is grouped by the code, and the y-axis is the % of misclassifed sessions (so lower is better performance)- The dotted line shows the error needed to achieve performance that is significantly better than chance agreement of 50% (at the 5% significance level using McNemars Test)

- Separated the two extremes significantly better than chance for 3 of the 6 codes: acceptance, global positive affect, global negative affect- Other codes may depend more crucially on non-acoustic cues (e.g., lexical cues) or may need to be modeled using more dynamic methods

15Other codesNeed to be modeled with more dynamic methodsDepend more crucially on non-acoustic cuesInherently less separable

Classification Results (2/2)

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201016 / 21

- Bar plot shows the classification results for this initial experiment- x-axis is grouped by the code, and the y-axis is the % of misclassifed sessions (so lower is better performance)- The dotted line shows the error needed to achieve performance that is significantly better than chance agreement of 50% (at the 5% significance level using McNemars Test)

- Separated the two extremes significantly better than chance for 3 of the 6 codes: acceptance, global positive affect, global negative affect- Other codes may depend more crucially on non-acoustic cues (e.g., lexical cues) or may need to be modeled using more dynamic methods

16LDA classifier feature selectionMean = 3.4 features, Std. Dev. = 0.81 featuresDomain breakdownWife models: Independent=56%, Wife=32%, Husband=12%Husband models: Independent=41%, Husband=37%, Wife=22%Feature breakdownPitch=30%, MFCC=26%, MFB=26%, VAD=12%, Speaking Rate=4%, Other=2%

Also explored using features from a single domain (make bar graphs!)

As expected, performance decreases in mismatched conditionsGood performance using speaker-independent features: session is inherently interactive (need to use more dynamic methods)Results: Feature SelectionAutomatic Classification of Married Couples' Behavior using Audio Features9/29/201017 / 21ConclusionsWork represents initial analysis of a novel and challenging corpus consisting of real couples interacting about problems in their relationship

Showed we could train binary classifiers using only audio features that separated spouses behavior significantly better than chance for 3 of the 6 codes we analyzed

Provides a partial explanation of coders subjective judgments by objective acoustic signals

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201018 / 21Future WorkMore automationFront-end: speaker segmentation using only audio signal

Multimodal fusionAcoustic features + Lexical features

Saliency detectionUsing supervised methods (manual coding at a finer temporal scale)Using unsupervised signal-driven methods

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201019 / 21ReferencesP. Boersma, Praat, a system for doing phonetics by computer, Glot International, vol. 5, no. 9/10, pp. 341345, 2001.

A. Christensen, D.C. Atkins, S. Berns, J. Wheeler, D. H. Baucom, and L.E. Simpson. Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples. J. of Consulting and Clinical Psychology, 72:176-191, 2004.

F. Eyben, M. Wollmer, and B. Schuller, openEARIntroducing the Munich open-source emotion and affect recognition toolkit, in Proc. IEEE ACII, 2009.

P. K. Ghosh, A. Tsiartas, and S. S. Narayanan, Robust voice activity detection using long-term signal variability, IEEE Trans. Audio, Speech, and Language Processing, 2010, accepted.

C. Heavey, D. Gill, and A. Christensen. Couples interaction rating system 2 (CIRS2). University of California, Los Angeles, 2002.

J. Jones and A. Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles, 1998.

G. Margolin, P.H. Oliver, E.B. Gordis, H.G. O'Hearn, A.M. Medina, C.M. Ghosh, and L. Morland. The nuts and bolts of behavioral observation of marital and family interaction. Clinical Child and Family Psychology Review, 1(4):195-213, 1998.

P.J. Moreno, C. Joerg, J.-M. van Thong, and O. Glickman. A recursive algorithm for the forced alignment of very long audio segments. In Proc. ICSLP, 1998.

V. Rozgi, B. Xiao, A. Katsamanis, B. Baucom, P. G. Georgiou, and S. Narayanan, A new multichannel multimodal dyadic interaction database, in Proc. Interspeech, 2010.

B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, and L. Kessous. The relevance of feature type for automatic classification of emotional user states: Low level descriptors and functionals. In Proc. Interspeech, 2007.

A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and Vision Computing, 27:1743-1759, 2009. Automatic Classification of Married Couples' Behavior using Audio Features9/29/201020 / 21Thank you!Questions?Automatic Classification of Married Couples' Behavior using Audio FeaturesRecruited CouplesCouples married an average of 10.0 years (SD = 7.7 years)

Ages ranged from 22 to 72 yearsMen median age = 43 years (SD = 8.8 years)Women median age = 42 years (SD = 8.7 years)

College-educated on averageMen and women median level of education = 17 years (SD = 3.2 years)

Ethnicity breakdown77% Caucasian8% African American5% Asian or Pacific Islander5% Latino/Latina5% Other9/29/2010Hidden SlideAutomatic Classification of Married Couples' Behavior using Audio FeaturesSocial Support Interaction Rating SystemDesigned in [Jones et al. 1998]

18 codesMeasures emotional content of interaction and topic of conversation

4 categoriesAffectivity, Dominance/Submission, Features of Interaction, Topic Definition

Example: Global positive affectAn overall rating of the positive affect the target spouse showed during the interaction. Examples of positive behavior include overt expressions of warmth, support, acceptance, affection, positive negotiation, and compromise. Positivity can also be expressed through facial and bodily expressions, such as smiling and looking happy, talking easily, looking comfortable and relaxed, and showing interest in the conversation.9/29/2010Hidden SlideAutomatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden SlideAutomatic Alignment DetailsReject aligned region if ExplanationAudio duration < 0.25 sec Too shortText string length < 6 letters Too shortSpeaking rate < 8 letters/sec Too slowSpeaking rate > 40 letters/sec Too fastAlign reference and hypothesized transcriptsDynamic programming string alignment to minimize the edit distanceInsertion/deletion/substitution errors assigned penalty scores of 3/3/4Example of alignment shown below (aligned regions underlined)

REF Trans: YOU ARE INSISTING ON THE ANGER THAT YOU HAVEREF Align: you ARE insisting on ***** the anger THAT YOU HAVEHYP Align: you insisting on the anger **** TO NOTHYP Trans: (has start and stop times for each recognized word)

Reject low-confidence aligned regionsFeature-level Fusion

Fusion of audio and lexical features for multimodal predictionAcoustic feature = , Lexical feature = Experiment 1: Make one decision for entire session

Experiment 2: Make decision for each turn/utteranceScore-level Fusion

FeatureSelectionFuture Work: Multimodal FusionFeatureSelectionClassifierClassifierFeatureSelectionClassifierTurn 1Turn 2

Turn NClassifierClassifierClassifierAutomatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden SlideFuture Work: Saliency DetectionIncorporate saliency detection into the computational frameworkCertain regions more relevant than others for human subjective judgmentCan be viewed as data selection or data weighting problem

Proposal: Supervised methodMimic the saliency trends of trained human evaluatorsManually code a subset of the data at a finer-grained temporal scaleCan inform the more useful feature streamsactive features during regions specified as relevant (salient) to human evaluatorsCan be used to train classifiers at the turn/utterance-levelTurn/Utterance-level classifiers can then be mapped to Session-level ones

Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden Slide

Ambiguous DisplaysSaliency detection could lead to a more accurate representation for ambiguous displays of human behaviorMove beyond just separating extreme behaviors

With the possibility of detecting codes at shorter temporal scaleCan specify how much a code is occurring in a sessionProvides framework for a more regression-like continuous output

Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden SlideOngoing WorkThis work is part of a larger problem set out lab is working onBehavioral Signal Processing (BSP)

New data collection [Rozgi et al. 2010]Designed smart room with 10-HD cameras, 15-microphones, motion captureAllows for multimodal analysis of interpersonal interactions

Show video clip Point out specific behavior we are targeting: approach/avoidance

Automatic Classification of Married Couples' Behavior using Audio Features9/29/201030 / 21