Automatic Depression Scale Prediction using Facial ... · Automatic Depression Scale Prediction using Facial Expression Dynamics and Regression Asim Jan, Hongying Meng, Yona Falinie

Automatic Depression Scale Prediction using Facial Expression

Dynamics and Regression

Asim Jan, Hongying Meng, Yona Falinie Binti Abd Gaus, Fan Zhang, Saeed TurabzadehDepartment of Electronic and Computer Engineering

Brunel University London, Uxbridge, Middlesex UB8 3PH, U.K.{asim.jan, hongying.meng, yonafalinie.abdgaus}@brunel.ac.uk

[email protected],[email protected]

August 2014

Abstract

Depression is a state of low mood and aversion to activity that can affect a person’s thoughts, behavior,feelings and sense of well-being. In such a low mood, both the facial expression and voice appear differentfrom the ones in normal states. In this paper, an automatic system is proposed to predict the scales ofBeck Depression Inventory from naturalistic facial expression of the patients with depression. Firstly,features are extracted from corresponding video and audio signals to represent characteristics of facialand vocal expression under depression. Secondly, dynamic features generation method is proposed inthe extracted video feature space based on the idea of Motion History Histogram (MHH) for 2-D videomotion extraction. Thirdly, Partial Least Squares (PLS) and Linear regression are applied to learn therelationship between the dynamic features and depression scales using training data, and then predictthe depression scale for unseen ones. Finally, decision level fusion was done for combining predictionsfrom both video and audio modalities. The proposed approach is evaluated on the AVEC2014 datasetand experimental results demonstrate its effectiveness.

0.1 Introduction

The study on mental health problems has beengiven increasing attention from various domains inthe modern society in recent years. Among mooddisorders, depression commonly occurs and heavilythreatens the mental health of human beings. Gen-erally, depression is a state of low mood and aver-sion to activity that can affect persons’ thoughts,behavior, feelings, and sense of well-being [22]. De-pressed people may feel sad, anxious, hopeless,empty, worried, helpless, worthless, hurt, irritable,guilty, or restless. They may lose interest in ac-tivities that once were pleasurable, experience lossof appetite or overeating, suffer trouble in concen-trating, remembering details, or making decisions,and may contemplate or even attempt suicide. In-somnia, excessive sleeping, fatigue, loss of energy,aches, pains, or digestive problems that are resis-tant to treatment may be present as well [1].

In spite of very limited progress currentlyachieved on automatic depression scale predictionor other mental disorders, recent technological rev-olutions of automatic emotion analysis in the fieldof affective computing and social signal process-ing are extensive, which can be regraded as agood place to start. People express their emotionsthrough the visual (i.e. facial expressions and bod-ily gestures), vocal, and physiological modalities,and among these modalities, facial expression playsa primary role in representing emotional states ofhuman beings. Much effort has hence been dedi-cated by psychologists to model the mapping fromfacial expressions to emotional states [10]. Initialwork aims at describing expressions using staticfeatures extracted from still face images [21] [24].However, to some expressions, especially the subtleones like anger and disgust, these features proveinsufficient. Recently, the focus gradually veersto facial expression analysis in video data, sincethey convey richer information than images do.From videos, dynamic features can be obtained,which are also critical to represent facial expres-sions formed by periodical muscular movements,e.g. [6] [13] [18]. Meanwhile, in video data, there of-ten exists vocal information which is correspondingto the visual, and it is another important channelto emotion recognition. Therefore, a natural trendappears to combine the vocal modality with the vi-sual one, claiming that both the clues are comple-

mentary to each other and the joint use improvessystem performance [5].

Similarly, in depression scale prediction, visualand vocal features (included in video data) are bothindispensable, since depressed people tend to be-have disorderly in facial expression, gesture, ver-bal communication, etc. For example, they mayseem unable to relax, quicker to anger, or full ofrestless energy, which can be reflected by changesin their facial expressions; while they may maketheir speech slurred, slow, and monotonous, andthis can be represented by variations of their voices.Based on such consideration, this paper proposes anovel approach for depression scale predcition usingboth the visual and vocal clues, aiming to combinetheir advantages. Based on the idea of Motion His-tory Histogram (MHH) for 2-D motion capture, dy-namic features are extracted based on the featuresfrom videos. Feature selection is done to reducethe dimention of the feature. Then Partial LeastSquare (PLS) and Linear regression algorithm isthen used to model the mapping between dynamicfeatures and the depression scales. Finally, predic-tions from both video and audio modalities werecombined on decision level. Experimental resultsachieved on the AVEC2014 dataset illustrate theeffectiveness of the proposed method.

The rest of the paper is organized as follows.Section 2 briefly reviews related work in this area.Section 3 provides a detailed description of the pro-posed method, and Section 4 displays and discussesthe experimental results on the AVEC2014 dataset[27]. Section 5 concludes the paper.

0.2 Related Work

Recent years have witnessed the growing study forclinical and mental health analysis from facial andvocal expressions [29] [32] [23] [12] due to signif-icant progress on emotion recogntion from facialexpression. Wang et al. [29] proposed a compu-tational approach, which creates probabilistic fa-cial expression profiles for video data and helps toautomatically quantify emotional expression differ-ences between patients with neuropsychiatric dis-orders (Schizophrenia) and healthy controls.

As particularly concerning on depression analy-sis, Cohn et al. [7] pioneered at seeking for solutionsin the view of affective computing. They fused both

1

Figure 1: Overview of the proposed automatic depression scale recognition system.

clues of facial actions and vocal prosody, attempt-ing to investigate systematic and efficient waysof incorporating behavioral observations that arestrong indicators of psychological disorders, muchof which may occur outside the awareness of eitherindividual. Their findings suggest the feasibilityof automatic detection of depression, and possessexciting implications for clinical theory and prac-tice. Yang et al. [32] explored variations in vo-cal prosody of participants, and found moderatepredictability of the depression scores based on acombination of F0 and switching pauses. Girard etal. [12] analyzed both manual and automatic fa-cial expressions during semi-structured clinical in-terviews of clinically depressed patients, conclud-ing that participants with high symptom sever-ity behave with more expressions associated withcontempt, smile less, and their smiles were morelikely to be related to contempt. Scherer et al. [23]studied the correlation between the properties ofgaze, head pose, and smile and three mental disor-ders (i.e. depression, post-traumatic stress disorderand anxiety), and discovered that significant differ-ences of automatically detected behaviors appearbetween the highest and lowest distressed partici-pant groups.

Depression recognition sub-challenge ofAVEC2013 [28] was held in 2013 and somegood methods were proposed with good re-sults [30] [14]. Williamson et al. [30] exploitedthe effects that reflected changes in coordination

of vocal tract motion associated with MajorDepressive Disorder. Specifically, they investigatedchanges in correlation that occur at differenttime scales across formant frequencies and alsoacross channels of the delta-mel-cepstrum. Bothfeature domains provide measures of coordinationin vocal tract articulation while reducing effects ofa slowly-varying linear channel, which can be in-troduced by time-varying microphone placements.Based on these two complementary feature sets,they designed a novel Gaussian mixture model(GMM)-based multivariate regression scheme,referred to as Gaussian Staircase Regression, thatprovided very good prediction on the standardBeck depression rating scale. Meng et al. [14]presents a novel method, which comprehensivelymodels visual and vocal modalities with dynamicfeatures to automatically predicts the scale ofdepression. Motion History Histogram (MHH)was used firstly to extract the dynamics fromcorresponding video and audio data to representcharacteristics of subtle changes in facial andvocal expression of depression. And then, PartialLeast Square (PLS) regression algorithm is appliedto learn the relationship between the dynamicfeatures and depression scales using training data,and then predict the depression scale for an unseenone. Predicted values of visual and vocal cluesare further combined at decision level for finaldecision.

From these two methods, it can be seen that fea-

2

ture extraction is the key for depression scale pre-diction. In this paper, we proposed a new way fordynamic feature extraction. Firstly, standard fea-ture extraction methods are used on each frames orvocal segments. Then the dynamic variations in thefeature space is extracted based on the idea of MHHof the image. The reason is that the feature spacecontains more useful information than the raw im-ages that were used in [14]. Then, fusion processon advanced regression methods is investigated andproposed to improve the overall performance.

0.3 Depression Scale Predic-tion

Since human facial expression and voice in depres-sion are theoretically different from those undernormal mental states, we attempt to address theproblem of depression scale prediction by combin-ing dynamic descriptions within naturalistic facialand vocal expressions. This paper proposes a novelmethod that comprehensively models the variationsin visual and vocal clues and automatically predictsthe Beck Depression Inventory scale of depression.

0.3.1 System Overview

Figure 0.1 illustrates the process of how the featuresare extracted from both visual and audio data,which is then combined, reduced and used with ma-chine learning. With the visual data, a sequenceof steps are taken to combine different extractionmethods that are used for dynamic and static data.The data itself is in a video format from which EdgeOrientation Histogram (EOH), Local Phase Quan-tization (LPQ) and Local Binary Patterns (LBP)features are extracted of each frame from the video,extracting different descriptors to capture differentcharacteristics of the facial expression images. Theidea of Motion History Histograms (MHH) is thenapplied to capture the dynamic movement of thefeatures, describing temporal information of differ-ent extracted facial descriptors. It is effective insummarizing the facial feature movements. All thefeatures are then concatenated for a better repre-sentation and reduced in dimensionality to makethe system more accurate and efficient. PartialLeast Square (PLS) regression and Linear regres-sion are then combined to predict the depression

scale.

0.3.2 Image Feature Extraction

Based on the frames of facial expression in videodata, three different texture features are extracted.

Local Binary Patterns

Local Binary Patterns (LBP) describes the localtexture structure of an image by creating patternsfor the pixels and its surroundings. It is namelyinvariance to monotonic gray level changes [19] andcomputational efficiency making it useful for facerecognition [2], which would be useful for extractingthe facial structure variance across the expressions.

Each pixel is compared with its surrounding 8pixels, using a threshold to compare if they arehigher or lower. From this a pattern of 8 bits isgenerated and converted to decimal format fromwhich a histogram is created. This will be appliedon the textured facial images from the database.

Edge Orientation Histogram

EOH, an efficient and powerful operator, is re-garded as a simpler version of Histogram of Ori-ented Gradients (HOG) [8] that captures the edgeor the local shape information of an image. It hasbeen widely investigated in a variety of applicationsin computer vision such as hand gesture recogni-tion [11], object tracking [31] and facial expressionrecognition [18].

Local Phase Quantization

Local Phase Quantization (LPQ) is widely used inface recognition and texture classification [20] be-cause of its ability to extract local phase informa-tion using Discrete Fourier Transform on windowsof M-by-M across the whole image, and checks forthe phase difference to produce a histogram for eachwindow. The histograms are combined to produceone overall histogram for the image.

0.3.3 Audio Feature Extraction

The spectral low-level descriptors (LLDs) andMFCCs 11-16 are included in the AVEC2014 base-line audio feature set and adopted as the basic rep-resentation of the vocal clue. The baseline feature

3

set consists of 2268 features, composed of 32 en-ergy and spectral related×42 functionals, 6 voicingrelated LLD×32 functionals, 32 delta coefficientsof the energy/spectral LLD×19 functionals, 6 deltacoefficients of the voicing related LLD×19 function-als, and 10 voiced/unvoiced durational features.

LLD features are extracted from 25 ms and 60ms overlapping windows which are shifted forwardat a rate of 10 ms. Among these features, pitch(F0) based LLD are extracted from 60 ms win-dows, while all other LLD are extracted from 25ms windows. For detailed information, please seethe baseline paper [27].

Figure 2: The selected audio feature from audiofeatures provided in AVEC2014.

Firstly, the 2268 features are split on to sev-eral vectors containing features from each individ-ual LLD extraction method which are investigatedseparately. The process of selecting the best audiodescriptors is to test each individual feature vec-tor with the development data set and take thetop 8 performing descriptors, which are: Flatness,Energy in Bands 1k-4kHz (Band1000), Entropy,MFCC, Probability of Voicing (POV), PSY Sharp-ness, Shimmer and Zero Crossing Rate (ZCR).Then, further investigation is done on all 256 pos-sible combination of these 8 vectors, which arealso tested with the development data set. Fromthis, the combination of Flatness, Band1000, PSYSharpness, POV, Shimmer and ZCR performed thebest and are concatenated together to produce thefinal descriptor for the audio modality as shown inFigure 2.

0.3.4 1-D Motion History Histogram

MHH is a descriptive temporal template motionrepresentation for visual motion recognition. Itwas originally proposed and applied in human ac-tion recognition [16]. The detailed information canbe found in [15] and [17]. It records the greyscale value changes for each pixel in the video.In comparison with other well-known motion fea-tures, such as Motion History Image (MHI) [4], itcontained more dynamic information of the pix-els and get better performance in human actionrecognition[15]. MHH not only provides rich mo-tion information, but also remains computationallyinexpensive [17].

In the previous research, MHH is used for ex-tracting the motion on each pixel in the 2D im-age sequences and then creating the multiscale his-tograms (M is the number of scales) for the wholevideo. Here, we proposed an 1-D MHH that ex-tracts the changes on each component in a featurevector sequence (instead of one pixel from a im-age sequence) and then create the histogram for allthe components of the feature vector in one video.So the dynamic of facial movement are replaced bythe feature movements. Figure 3 shows the processof computing 1-D MHH on the sequence of featurevector. Firstly, on one component of the featurevector, consecutive two values are compared andthresholded to produce a binary value. If there ischange bigger than threshold, it will be ‘1’, other-wise, it will be ‘0’. Along the time line, a binarysequence will be generated. Within the binary se-quence, M different pattern will be counted. So foreach component, M counts will be generated. Allthese counts together make the 1-D MHH featuresof the feature vector.

Figure 3: Process of computing 1-D MHH on thesequence of feature vectors.

4

The dynamic property between consecutive au-dio segments is also critical. The vocal expres-sion changes are considered to be discriminativebetween depression and other emotions using thesame idea of MHH as it was done for video. In-stead of operating on each pixel in videos, eachcomponent of the audio baseline feature is used fordynamic modelling. For each component, a changesequence is created and its corresponding histogramis made based on the variation in patterns. Sim-ilarly, M values are obtained for each component.The final dynamic audio feature vector is thereforeM times longer than the original one.

0.3.5 Feature Combination and Fu-sion

The LBP, LPQ and EOH feature extraction meth-ods are applied on each of the frames as each al-gorithm highlights different patterns, all of whichare considered to be useful. This then effectivelyconverts a video with frames of images to three se-quences of extracted feature vectors. 1-D MHH al-gorithm is applied to each feature vector sequenceand it produces M feature vectors from each se-quence, that contains the dynamic details of themotion occurred across the features. The resultinghistograms are then concatenated to provide onefeature vector.

All the features are combined to produce a largefeature vector, which can cause the training periodof the machine learning to be very slow. Two ap-proaches are used to reduce the dimensionality ofthe large feature set to improve the training speedand performance. Firstly, a novel method was usedto analyze each feature column and decide whetherit can be discarded to increase the performance ofthe system. Secondly, Principal Component Anal-ysis (PCA) was used to obtain the coefficients foreach feature set.

The initial method normalizes each column ofthe feature vector, e.g. if the training set contains50 videos with 2000 features each, a column vec-tor would represent 1 feature from the 50 videosi.e. (1x50). The goal is to reduce the total num-ber of columns whilst retaining key information.From each individual column, a histogram is thentaken which is tested for how well distributed it is.Equation 1 describes the process of selecting whichcolumns of the feature vector to remove, where k

is the number of bins, m(i) is the number of sam-ples in each individual bin, n is the number of totalsamples and T is the sampling threshold.

k∑i=1

m(i) > n× T (1)

If the number of bins (k) is 10, the 50 samples arethen distributed between the 10 bins depending ontheir value. The number of samples in each of the10 bins m(i) are then checked against the thresh-old value n × T . If m(i) is greater then that col-umn would be removed as this would mean that thecontents of that column are too similar across thesamples. This method not only reduces the vectorsize, but to also increase the system performance,the system is less confused by not having similarfeatures across majority of the samples. Once thevector is reduced, PCA is applied to further reducethe vector size by calculating the coefficients whichwould best describe the features. The amount ofcoefficients we generate from the features are be-tween 5-10, as this range gives the best performancewhen testing with the development data set. Thisnew vector is then used at the regression stage topredict the depression values.

0.3.6 Regression

The Partial Least Squares (PLS) regression [9] isa statistical algorithm that bears some relation toprincipal components regression. Instead of findinghyperplanes of minimum variance between the re-sponse and independent variables, it builds a linearregression model by projecting the response andindependent variables to another common space.Since both the response and independent variablesare projected to a new space, the approaches in thePLS family are known as bilinear factor models.

More specifically, PLS tries to seek fundamentalrelations between two matrices (response and in-dependent variables), i.e. a latent variable way tomodel the covariance structures in these two spaces.A PLS model aims to search the multidimensionaldirection in the independent variable space that ex-plains the maximum multidimensional variance di-rection in the response variable space. PLS regres-sion is particularly suited when the matrix of pre-dictors has more variables than observations, andwhen there is multicollinearity among independent

5

variable values. By contrast, standard regressionwill fail in these cases.

In the case of depression recognition, it is nec-essary to reduce the dimension of the feature vec-tor. In the AVEC2014 dataset, the first 50 sam-ples are for training; another 50 for developing;and the left 50 for test. This is a relatively smallnumber in comparison with the feature dimension-ality, making the regression problem more redun-dant. For this reason, the feature selection tech-nique is used to only concern the feature compo-nent that is relevant to the depression label. Thecorrelation between feature vector and depressionlabels is computed in the training set and only thefeature components with an absolute value biggerthan a threshold are kept and others are discarded.

The general underlying model of multivariatePLS is

X = TPT + EY = UQT + F

(2)

where X is an n × m matrix of predictors and Yis an n × p matrix of responses. T and U are twon × l matrices that are, projections of X (scores,components or the factor matrix) and projectionsof Y (scores); P , Q are, respectively, m× l and p× lorthogonal loading matrices; and matrices E and Fare the error terms, assumed to be independent andidentical normal distribution. Decompositions of Xand Y are made so as to maximize the covarianceof T and U .

Linear Regression is another approach for mod-eling the relationship between a scalar dependentvariable and one or more explanatory variables instatistics. It was also used in the system along withPLS regression for decision fusion. The decision fu-sion stage aims to combine multiple decisions intoa single and consensus one [25]. The linear opinionpool method is used in this case due to its sim-plicity [3], and a weighted sum rule is defined tocombine the predicted values from each decision asin [26].

0.4 Experimental Results

0.4.1 AVEC2014 Dataset

The proposed approach is evaluated on theAudio/Visual Emotion Challenge (AVEC) 2014

dataset [27], a subset of the audio-visual depres-sive language corpus (AViD-Corpus). The datasetcontains 340 video clips from 292 subjects perform-ing a Human-Computer Interaction task while be-ing recorded by a webcam and a microphone in anumber of quiet settings. There is only one personin each clip and some subjects feature in more thanone clip. All the participants are recorded betweenone and four times, with an interval of two weeks.5 subjects appears in 4 recordings, 93 in 3, 66 in2, and 128 in only one session. The length of theseclips is between 20 minutes and 50 minutes with theaverage of 25 minutes, and the total duration of allclips lasts 240 hours. The mean age of subjects is31.5 years, with a standard deviation of 12.3 yearsand a range of 18 to 63 years.

0.4.2 Experimental Setting

AVEC2014 addresses two Sub-Challenges: the Af-fect Recognition Sub-Challenge (ASC) and the De-pression Recognition Sub-Challenge (DSC). ASCconcentrates fully on continuous affect recogni-tion of the dimensions of dominance, valence andarousal, where the level of affect has to be pre-dicted for each frame of the recording, while DSCrequires to predict the level of self-reported depres-sion as indicated by the Beck Depression Inven-tory (BDI) for every session, that is, one continu-ous value per video clip file. This study focuses onDSC, where a single regression problem needs tobe solved. The Mean Absolute Error (MAE) andRoot Mean Square Error (RMSE) over all sessionsare both used as measurements in the competition.

For each video clip, we work on the spatial do-main to produce local features using EOH, LBPand LPQ. These features are extracted frame byframe to produce 384, 944 and 256 dimensional his-tograms respectively for each frame. Then the al-gorithm on MHH was modified to be able to usefor 1-D case and is operated on all the data to pro-duce M = 3 vectors of temporal information. Thevectors EOH MHH, LBP MHH and LPQ MHH arereshaped producing 1152, 2835 and 768 compo-nents which are concatenated to produce a vectorof 4755 components. These components are pro-duced for both Northwind and Freeform videos andare also concatenated together producing a total of9510 components per sample, which is denoted as(MIX MHH). Figure 4 shows MIX MHH Feature

6

vectors for 50 samples of the development set ofthe AVEC2014 dataset.

Figure 4: MIX MHH Feature vectors for 50 samplesof development subset of the AVEC2014 dataset.

To demonstrate the advantages of the proposedapproach to depression recognition, we comparethese achieved results with the ones of other config-urations. The vectors EOH MHH, LBP MHH andLPQ MHH have been tested with the developmentset before they are concatenated, to provide a com-parison from its individual and combined benefits.Furthermore, we explore to model the temporal fea-tures of facial expressions in the dynamic featurespace, similar to [14], i.e. first operating MHH onthe video to produce 5 (M=5) frames, to then ex-tract the local features (EOH, LBP and LPQ) fromeach of the 5 frames and then concatenate all thevectors, this is denoted as (MHH MIX). The base-line audio features (2268) provided by data set forthe short segments (short) and per instance (inst)have been used, denoted as (Audio). The combinedaudio features of Flatness, Band1000, POV, PSYSharpness, Shimmer and ZCR are used, contain-ing 285 of the 2268 features which is denoted as(Comb). For all the dynamic features including vi-sual and vocal ones, the dimensionality is reducedwith PCA and the results are provided by PLS Re-gression, Linear Regression and a fusion of the twomethods.

0.4.3 Performance Comparison

Table 1 shows the individual performance of thethree image feature extraction methods that arecombined with 1-D MHH, from which the depres-sion scales are predicted used two regression tech-niques separately and fused. We can see that usingPLS for regression is better than Linear Regression

in all tests. However, when they are fused with aweighting more towards PLS, the results are im-proved further. LBP is shown to be the weakestamongst the three and LPQ the strongest.

Table 2 compares the combined performanceof the three image feature extraction methodswhich has MHH operated before and after them.MIX MHH is for MHH after EOH, LBP and LPQare applied to each frame and MHH MIX is whenMHH is applied before, on the original video clip.MIX MHH has shown a significant improvementover the individual performances in Table 1, aswell as out-performing MHH MIX in all regressionmethods PLS showing that combining the differ-ent individual features does provide a much morereliable performance. The benefit of having MHHafter EOH, LBP and LPQ is that these three meth-ods will highlight more features and MHH wouldthen capture the frequency of the features acrossthe frames. Whereas if MHH is applied first tothe raw video, a lot of good information is com-pressed to six frames before being extracted. Sincethe AVEC2014 data set is different to the previ-ous AVEC2013 data set, MHH MIX, which is themethod approached in [14], will not produce simi-lar results to the 2013 data set.

In Table 3, the audio features for short segmentsand per instance are taken into account. Fromthe 2268 audio features, the Combined features(Comb) have been taken out to be tested sepa-rately. The Combined features have produced goodresults when using PLS and Linear regression fusedtogether, whereas when all the audio features areused, PLS alone works best. The four tests onlytake audio features into account and have showngood results when compared to the individual fea-ture results in Table 1, however they aren’t as goodas the combined visual features (MIX MHH).

Table 4 combines the audio domain with the bestfeatures from the visual domain at decision level,to show the performance when fused together. Theresults, when compared to Table 2 and Table 2,show a significant improvement when having audioalone and some improvement than having visualalone. PLS and Linear Regression fused has beenchosen for the testing, because in most cases it givesan increase to the performance. The Comb audiofeatures provide a better outcome than using allthe audio features provided.

Table 5 displays the final performance of each

7

Table 1: Performance of depression scale prediction using the dynamic visual features measured both inMAE and RMSE averaged over all sequences in the development set.

Partition Method MAE RMSEDevelopment EOH MHH PLS 8.87 11.09Development EOH MHH LR 9.92 12.39Development EOH MHH (PLS+LR) 9.14 11.39Development LBP MHH PLS 9.34 11.16Development LBP MHH LR 9.86 12.68Development LBP MHH (PLS+LR) 9.18 11.15Development LPQ MHH PLS 8.79 10.88Development LPQ MHH LR 9.73 11.49Development LPQ MHH (PLS+LR) 8.70 10.63

Figure 5: Predicted and actual depression scales of development subset of the AVEC2014 dataset.

modality as well as fusing both at decision level onboth development and test sets. The audio/visualdomain has outperformed the other modalities inthe development and test set. The fusion processgives more of the weighting to the visual only labelsthan the audio, based on its performance. Howeverthe results show that even though the audio labelsaren’t as good compared to visual, it still helps inpredicting accurate set of labels overall.

The baseline results [27] are listed in Table 6which are limited in both development and testsets. In comparison with the results in Table 5,our method has outperformed their developmentRMSE score and is better overall than their resultfor the test data set.

0.5 Conclusions and Perspec-tives

In this paper, a novel approach is proposed for au-tomatic depression scale prediction based on facial

and vocal expression in naturalistic video record-ings. Based on the idea of MHH for 2-D videomotion feature, we proposed 1-D MHH that canbe applied to feature vector sequences and providea dynamic feature (e.g. EOH MHH, LBP MHH,LPQ MHH) for the video. This dynamic feature isbetter than MHH EOH that was used in previousresearch [14] because it is based on feature vectorsinstead of raw images. Finally, PLS regression andLinear regression are then adopted to capture thecorrelation between and feature space and depres-sion scales.

From the experimental results, it can be seenthat the proposed method achieved good resultson the AVEC2014 dataset. From Table 2, it clearlydemonstrates the proposed dynamic feature is bet-ter than MHH EOH that was used in previous re-search [14]. In comparison with Table 1, fusionof the three image features produce better resultsthan any of the individual features. For the audiofeature, Comb achieved better results than wholeaudio feature provided by the organizer. The fusion

8

Table 2: Performance of depression scale prediction using MHH before and after MIX (EOH, LBP andLPQ) visual features, measured both in MAE and RMSE averaged over all sequences in the developmentset.

Partition Methods MAE RMSEDevelopment MIX MHH PLS 7.72 9.68Development MIX MHH LR 7.52 10.05Development MIX MHH (PLS+LR) 7.36 9.49Development MHH MIX PLS 8.91 10.78Development MHH MIX LR 10.59 12.71Development MHH MIX (PLS+LR) 9.00 10.95

Table 3: Performance of depression scale prediction using complete audio and Comb features measuredboth in MAE and RMSE averaged over all sequences in the development set.

Partition Methods MAE RMSEDevelopment Comb inst (PLS+LR) 8.35 10.56Development Comb short (PLS+LR) 8.16 10.32Development Audio inst PLS 9.45 11.25Development Audio short PLS 8.92 10.69

on decision level solution achieved better results onthe development set, and the same result for thetesting set.

There are two main contributions from this pa-per. First one is the dynamic feature extractionthat use the idea of MHH on the feature space.Another one is the feature fusion from differentfeatures from images. The overall results on thetesting set are better than baseline results.

There are still other image features such as Ga-bor features, these features can be used to extractdynamic features as well and easily be added intothe system to improve the performance. Thereare also some dynamic features such as LBP TOP,combination of these kind of feature with our dy-namic features together will be an interesting topic.For the audio feature, the combination of descrip-tors were used, other features should also be con-sidered to be integrated in the system. All this willbe our future work.

0.6 Acknowledgments

The work by Asim Jan was supported by School ofEngineering & Design/Thomas Gerald Gray PGRScholarship. The work by Hongying Meng andSaeed Turabzadeh was partially funded by the

award of the Brunel Research Initiative and En-terprise Fund (BRIEF). The work by Yona FalinieBinti Abd Gaus was supported by Majlis AmanahRakyat (MARA) Scholarship.

9

Table 4: Performance of depression scale prediction using the audio features combined with the bestvisual features (MIX MHH), fusing them at decision level, measured both in MAE and RMSE averagedover all sequences in the development set.

Partition Methods MAE RMSEDevelopment (MIX+Audio) (PLS+LR) 8.92 10.71Development (MIX+Comb) (PLS+LR) 7.34 9.09

Table 5: System performance of proposed depression scale prediction method measured in MAE andRMSE averaged over all sequences in development and test set.

Partition Modality MAE RMSEDevelopment Audio 8.92 10.69Development Video 7.36 9.49Development Video&Audio 7.34 9.09Test Audio 9.10 11.30Test Video 8.44 10.50Test Video&Audio 8.30 10.26

Table 6: Baseline performance of depression recognition measured both in MAE and RMSE over allsequences [27].

Partition Modality MAE RMSEDevelopment Audio - -Development Video - 9.26Test Audio - -Test Video 8.86 10.86

10

Bibliography

[1] http://www.nimh.nih.gov/health/

publications/depression/index.shtml,Retrieved 15 July 2013.

[2] T. Ahonen, A. Hadid, and M. Pietikainen.Face description with local binary patterns:Application to face recognition. IEEE Trans-actions on Pattern Analysis and Machine In-telligence, 28(12):2037–2041, 2006.

[3] I. Bloch. Information combination operatorsfor data fusion: a comparative review withclassification. IEEE Transactions on Systems,Man and Cybernetics, Part A: Systems andHumans, 26(1):52–67, 1996.

[4] A. F. Bobick and J. W. Davis. The recogni-tion of human movement using temporal tem-plates. IEEE Trans. Pattern Anal. Mach. In-tell., 23(3):257–267, 2001.

[5] C. Busso, Z. Deng, S. Yildirim, M. Bulut,C. M. Lee, A. Kazemzadeh, S. Lee, U. Neu-mann, and S. Narayanan. Analysis of emotionrecognition using facial expressions, speechand multimodal information. In InternationalConference on Multimodal Interaction, pages205–211, 2004.

[6] I. Cohen, N. Sebe, A. Garg, L. S. Chen, andT. S. Huang. Facial expression recognitionfrom video sequences: temporal and staticmodeling. Computer Vision and Image Under-standing, 91(1-2):160–187, 2003. Special Issueon Face Recognition.

[7] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang,M. H. Nguyen, M. Padilla, F. Zhou, andF. De la Torre. Detecting depression fromfacial actions and vocal prosody. In Interna-tional Conference on Affective Computing andIntelligent Interaction and Workshops, pages1–7, 2009.

[8] N. Dalal and B. Triggs. Histograms of orientedgradients for human detection. In IEEE Com-puter Society Conference on Computer Visionand Pattern Recognition, volume 1, pages 886–893, 2005.

[9] S. de Jong. Simpls: An alternative approach topartial least squares regression. Chemometricsand Intelligent Laboratory Systems, 18(3):251–263, 1993.

[10] P. Ekman and W. V. Friesen. Facial Ac-tion Coding System. Consulting PsychologistsPress, 1978.

[11] W. T. Freeman, W. T. Freeman, M. Roth, andM. Roth. Orientation histograms for hand ges-ture recognition. In IEEE International Work-shop on Automatic Face and Gesture Recogni-tion, pages 296–301, 1994.

[12] J. M. Girard, J. F. Cohn, M. H. Mahoor,S. Mavadati, and D. Rosenwald. Social riskand depression: Evidence from manual andautomatic facial expression analysis. In IEEEInternational Conference on Automatic Faceand Gesture Recognition, 2013.

[13] R. Kaliouby and P. Robinson. Real-time infer-ence of complex mental states from facial ex-pressions and head gestures. In B. Kisacanin,V. Pavlovic, and T. Huang, editors, Real-Time Vision for Human-Computer Interac-tion, pages 181–200. Springer US, 2005.

[14] H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang. Depression recogni-tion based on dynamic facial and vocal expres-sion features using partial least square regres-sion. In Proceedings of the 3rd ACM Inter-national Workshop on Audio/Visual EmotionChallenge, AVEC ’13, pages 21–30, New York,NY, USA, 2013. ACM.

11

[15] H. Meng and N. Pears. Descriptive tem-poral template features for visual motionrecognition. Pattern Recognition Letters,30(12):1049–1058, 2009.

[16] H. Meng, N. Pears, and C. Bailey. A humanaction recognition system for embedded com-puter vision application. In CVPR workshopon Embeded Computer Vision, 2007.

[17] H. Meng, N. Pears, M. Freeman, and C. Bai-ley. Motion history histograms for human ac-tion recognition. In B. Kisacanin, S. Bhat-tacharyya, and S. Chai, editors, Embeddedcomputer vision, Advances in pattern recog-nition, pages 139–162. Springer, 2009.

[18] H. Meng, B. Romera-Paredes, and N. Bianchi-Berthouze. Emotion recognition by two viewSVM 2K classifier on dynamic facial expres-sion features. In IEEE International Confer-ence on Automatic Face and Gesture Recogni-tion, pages 854–859, 2011.

[19] T. Ojala, M. Matti Pietikainen, andT. Maenpaa. Multiresolution gray-scaleand rotation invariant texture classificationwith local binary patterns. IEEE Transactionson Pattern Analysis and Machine Intelligence,24:971–987, 2002.

[20] V. Ojansivu and J. Heikkila. Blur insensi-tive texture classification using local phasequantization. In A. Elmoataz, O. Lezoray,F. Nouboud, and D. Mammass, editors, Im-age and Signal Processing, volume 5099 of Lec-ture Notes in Computer Science, pages 236–243. Springer Berlin Heidelberg, 2008.

[21] M. Pantic and L. J. Rothkrantz. Automaticanalysis of facial expressions: The state of theart. IEEE Trans. Pattern Anal. Mach. Intell.,22:1424–1445, 2000.

[22] S. Salmans. Depression: questions you have- answers you need. People’s Medical Society,1995.

[23] S. Scherer, G. Stratou, J. Gratch, J. Boberg,M. Mahmoud, A. S. Rizzo, and L.-P. Morency.Automatic behavior descriptors for psycholog-ical disorder analysis. In IEEE International

Conference on Automatic Face and GestureRecognition, 2013.

[24] C. Shan, S. Gong, and P. W. McOwan. Facialexpression recognition based on local binarypatterns: A comprehensive study. Image Vi-sion Comput., 27(6):803–816, May 2009.

[25] A. Sinha, H. Chen, D. G. Danu, T. Kirubara-jan, and M. Farooq. Estimation and decisionfusion: A survey. In IEEE International Con-ference on Engineering of Intelligent Systems,pages 1–6, 2006.

[26] N. Ueda. Optimal linear combination of neuralnetworks for improving classification perfor-mance. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 22(2):207–215,2000.

[27] M. F. Valstar, B. Schuller, K. Smith, T. Al-maev, F. Eyben, J. Krajewski, R. Cowie, andM. Pantic. AVEC 2014 - 3d dimensional af-fect and depression recognition challenge. InInternational Conference on ACM Multimedia- Audio/Visual Emotion Challenge and Work-shop, 2014.

[28] M. F. Valstar, B. Schuller, K. Smith, F. Eyben,B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie,and M. Pantic. AVEC 2013 - the continuousaudio/visual emotion and depression recogni-tion challenge. In International Conferenceon ACM Multimedia - Audio/Visual EmotionChallenge and Workshop, 2013.

[29] P. Wang, F. Barrett, E. Martin, M. Mi-lanova, R. E. Gur, R. C. Gur, C. Kohler, andR. Verma. Automated video-based facial ex-pression analysis of neuropsychiatric disorders.Journal of Neuroscience Methods, 168(1):224–238, 2008.

[30] J. R. Williamson, T. F. Quatieri, B. S. Helfer,R. Horwitz, B. Yu, and D. D. Mehta. Vocalbiomarkers of depression based on motor inco-ordination. In Proceedings of the 3rd ACM In-ternational Workshop on Audio/Visual Emo-tion Challenge, AVEC ’13, pages 41–48, NewYork, NY, USA, 2013. ACM.

12

[31] C. Yang, R. Duraiswami, and L. Davis. Fastmultiple object tracking via a hierarchical par-ticle filter. In IEEE International Conferenceon Computer Vision, pages 212–219, Washing-ton, DC, USA, 2005. IEEE Computer Society.

[32] Y. Yang, C. Fairbairn, and J. Cohn. Detectingdepression severity from intra- and interper-sonal vocal prosody. In IEEE InternationalConference on Automatic Face and GestureRecognition, 2013.

13

Documents

Automatic Depression Scale Prediction using Facial ... · Automatic Depression Scale Prediction using Facial Expression Dynamics and Regression Asim Jan, Hongying Meng, Yona Falinie