Model Based Emotion Detection using Point Clouds

Problem StatementThe goal is to use a model based approach for facial

emotion recognition of driver in real time environmentThe system should work on a embedded platformThe program is so developed to work on pipelined

architecture and parallel processing.

Why Model Based Approach?Illumination and pose variations are considered as major

concerns in facial emotion recognition which can be overcome by model based approach.

State Of The ArtRobert Niese, Ayoub Al-Hamadi, Axel Panning and Bernd

Michaelis“Emotion Recognition based on 2D-3D Facial Feature Extraction from Color Image Sequences”

Narendra Patel, Mukesh Zaveri“ 3D Facial Model Construction and Expression Synthesis using a Single Frontal Face Image

State Of The ArtAitor Azcarate, Felix Hageloh, Koen van de

Sande, Robert Valenti“ Automatic facial emotion recognition

Tie Yun , Ling Guan“Human Emotion Recognition Using Real 3D Visual Features from Gabor Library”

ChallengesSince the faces are non-rigid and have a high degree

of variability in location, colour and pose, several features of the face make facial expressions based emotion recognition more complex.

Occlusion and lighting distortions, as well as illumination conditions can also change the overall appearance of the face. Such changes will cause the emotion classification complex.

Spontaneous emotion recognition.Complexity of background when there is more than

one face in the image, system should be able to distinguish which one is being tracked

Emotion Recognition based on 2D-3D Facial Feature Extraction from Color Image Sequences

Facial Feature Points in 2D1. Detect the face2. Define fiducial points

3. Detect eyes and mouth

Complete set of feature points areas shown in the figure

Training

Set(sub-

windows)

IntegralRepresentation

Featurecomputation

AdaBoostFeature Selection

Cascade trainer

Testing phaseTraining phase

Strong Classifier 1(cascade stage 1)

Strong Classifier N(cascade stage N)

Classifier cascade framework

Strong Classifier 2(cascade stage 2)

FACE IDENTIFIED

Overview | Integral Image | AdaBoost | Cascade

Slide courtesy:Kostantina Palla, University Of Edinburgh

Camera ModelPin hole camera model is

usedCamera parameters Facilitating the camera

parameters, the transformation of 3D world points to image points is well described

Geometric 3D modelInitial registration step of the subject is captured one time

in frontal pose and with neutral expression.The face is localized in the stereo point cloud by using the

observation “surfaces are represented by more or less connected point clusters.

Similarity criterion h for clustering that combines color and the Euclidean distance of points

Surface reconstruction of the face cluster

Estimation of face poseCorrespondence between model and real-world is

established using fiducial points.According to camera model, the image projection of each

anchor point is determined.Goal of pose estimation is to reduce error between 3d

anchor points and fiducial points.After pose determination, the image feature points are

projected to the surface model at its current pose

Feature VectorThe feature vector consists of angles and distances

between a series of facial feature points in 3D.Feature vectors are normalized to increase classification

robustness

ClassificationClassification is done using ANN

ConsMisclassification at transition between facial expression

due to indistinct features.Performance is not optimised.Stereo-based initialization step can be inconvenient. Requires calibrated cameras.

References [1] H. D. Vankayalapati and K. Kyamakya, "Nonlinear Feature Extraction

Approaches for Scalable Face Recognition Applications," ISAST transactions on computers and intelligent systems, vol. 2, 2009.

[2]Hang-Bong Kang , “Various Approaches for Driver and Driving Behavior Monitoring: A Review ", ICCV2013 workshop.

[3]Emotion Recognition using Speech Features By K. Sreenivasa Rao, Shashidhar G. Koolagudi

[4] Yun Tie,” Human emotional state recognition using 3D facial expression features-thesis”, Ryerson University

Thank You

http://en.wikipedia.org/wiki/Perspective_%28graphical%29

camera calibrationand use subject specific surface model data reduces perspective foreshortening like in thecase of out of plane rotations

Photogrammetry is the science, technology and art of obtaining reliable information from noncontact imaging and other sensor systems about the Earth

Different Face Detection Techniques Two groups: holistic where face is treated as a whole unit and analytic where co-occurrence of

characteristic facial elements is studied.

Holistic face models: • Huang and Huang [7] used Point Distribution Model (PDM) which represents mean geometry of human face. Firstly, Canny edge detector is applied to find two symmetrical vertical edges which estimate the face position and then PDM is fitted. • Pantic and Rothkrantz [8] proposed system which process images of frontal and profile face view. Vertical and horizontal histogram analysis is used to find face boundaries. Then, face contour is obtained by thresholding the image with HSV color space values.

Analytic face models: • Kobayashi and Hara [9] used image captured in monochrome mode to find face brightness distribution.

Position of face is estimated by iris localization. • Kimura and Yachida [10] technique processes input image with an integral projection algorithm to find

position of eye and mouth corners by color and edge information. Face is represented with Potential Net model which is fitted by the position of eyes and mouth.

All of the above mentioned systems were designed to process facial images, however, they are not able to detect whether the face is present in the image. Systems which handle arbitrary images are listed below:

• Essa and Pentland [11] created the “face space” by performing Principal Component Analysis of eigenfaces from 128 face images. Face is detected in the image if its distance from the face space is acceptable.

• Rowley et al. [12] proposed neural network based face detection. Input image is scanned with a window and neural network decides if particular window contains a face or not.

• Viola and Jones [13] introduced very efficient algorithm for object detection with use of Haar-like features as object representation and Adaboost as machine learning method. This algorithm is widely used in face detection.

Three Dimensional Techniques Three dimensional models inherently provide more information than 2D models due to the presence of depth information, and are more robust

than 2D models. Many 3D model extraction solutions are subject to expensive computational complexity, or over simplified models that do not accurately represent the object.

The acquisition of 3D data can also produce image artifacts that may affect the rendered model [25]. The camera can receive light at intensities that saturate the detector or receive light levels too low to produce high quality images. This can occur in areas where there is specular reflection in stereo systems. Stereo based systems also have trouble getting true dense sampling of the face surface, and spare sampling points in regions where there is too much natural texture, leading to the exclusion of certain features (too smooth). Multimodal analysis with 3D and 2D data may be able to provide better data for classification (of face recognition) than single modalities, but compared to multiple 2D images (without 3D rendering), it does not show significant improvement, leading to a possible optimization problem in determining the best ways to use the acquired data [25]

A process completed by Chaumont et al. [26] breaks this problem into two steps, which first formulates an estimation of the 3D model, followed by model refinement. In the estimation section, a CANDIDE wireframe model (3D wireframe of an average face) is projected onto the 10

2D space from the 3D space under the assumption that all feature points are coplanar. This approximation is realistic because the differences in depth between features are very small compared to the distance to the camera. Making this assumption results in a projection of a 2D image on a 2D plane, which is a problem much easier solved. Also, since few 2D-3D correspondence points are available for use, the matrix is very sparse, and can be solved very quickly. After this approximation is determined, the wireframe is refined by perturbing the 3D points separately to match with the 2D points. This method is a fast method for face tracking and 3D face model extraction, can predict feature positions due to rotations and translations and model recovery in the presence of occultation because 3D information is known about the object.

Soyel et al. [27] used 3D distance vectors to obtain 3D FAPs between feature points to measure quantities like openness of eyes, height of eyebrows, openness of mouth, etc. to obtain distance vectors for test and training data for different expressions. They use only 23 facial features that are associated with the selected measurements and classify with a neural network. Tang et al. [28] utilizes the same approach, but performs an algorithm on the set of distances between the 83 points to determine the measurements that contain the most variation and are the most discriminatory, allowing for better recognition than empirically determined measurements.

Shape information is located in geometric features like ridges, ravines, peaks, pits, saddles, etc. local surface fitting is done, by centering the coordinate system at the vertex of interest (for ease of computation) . The patch can expressed in local coordinates and a cubic approximation (x^3, x^2y, xy^2, etc) can used to fit the surface locally, yielding two principle vectors that describe the maximum and minimum curvature at that point, and two corresponding eigenvalues. Along with the normal direction at that point, the surface properties can be classified into labels (flat, peak, ridge, etc) and a Primitive Surface Feature Distribution (PSFD) [29] can be generated as feature.

Other methods attempt to fit surface models onto point clouds of 3D sensor data. Mpiperis et al. [30,31] used a neutral face with an average identity and deformed it to the appropriate expression/identity. A triangular 3D mesh is placed on the face and subdivided into sub-triangles to increase the density. First a set of landmarks is associated with vertices on the mesh, which remain unchanged during the fitting process. Fitting is done as an energy minimization problem that consists of terms describing opposing forces between the landmarks 11

and mesh points, the distance between the surface and the mesh, and a smoothness constraint, which is solved by setting partial derivatives to 0 and solved using SVD. Asymmetric Bilinear models are used for facial expression recognition in which models identity in one dimension and expression in another. 3D Facial shapes obtained through finding the difference between neutral and expressive faces in 3D can also be used to classify facial expressions [32].

Venkatesh et al. employed principal component analysis on 3D mesh datasets to attempt to classify facial expressions [10]. PCA is a popular mathematical technique that for allows the dimensions of the problem to be reduced, making it easier to solve. For the training set, 68 feature points, which have been known to effectively represent facial expressions, have been manually selected around the eyes, mouth and eyebrows. PCA is done on the x, y, and z locations of these feature points to determine eigenvalues that can be used to find matrix projections on a given matrix A. This method automatically extracts features after they are divided into bounding boxes using anthropomorphic properties. This method achieves the automatic selection of points; however it is very computationally expensive.

Technology

Model Based Emotion Detection using Point Clouds