7
A supervised approach to support the analysis and the classification of non verbal humans communications Vitoantonio Bevilacqua 12* , Marco Suma 1 , Dario D‘Ambruoso 1 , Giovanni Mandolino 1 , Michele Caccia 1 , Simone Tucci 1 , Emanuela De Tommaso 1 , Giuseppe Mastronardi 12 1 Dipartimento di Elettrotecnica ed Elettronica, Polytechnic of Bari, Italy, 2 e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polyte chnic of Bari, Italy *corresponding author:  [email protected] Abst ract . Backg roun d: It is well known that non ve rbal communica ti on is sometimes more useful and robust than verbal one in understanding sincere emotions by means of spontaneous body gestures and facial expressions analysis acquired from video sequences. At the same time, the automatic or semi-automatic procedure to segment a human from a video stream and then figure out several features to address a robust supervised classification is still a relevant field of interest in computer vision and intelligent data analysis algorithms. Materials and Methods: We obtained data from four datasets: first dataset contains 100 images of humans silhouettes (or templates) acquired from a video sequence dataset, second dataset contains 543 images of gestures from a preregistered video of MotoGp driver Jorge Lorenzo, the third one 200 images of mouths and finally the fourth one 100 images of noses; third and fourth datasets contain images acquired by a tool implemented from the authors and also samples available in literature in public datab ases. We used supervis ed metho ds to train the propos ed classifie rs and, in partic ular, three different EBP Neur al-Net work architectures for human s templ ates, mouths and noses and J48 algorithm for gestures. Results: We obtained on average a 80% correct classification for binary classifier of humans templates (no false positives), 90% correct classification for happy/non happy emoti on, 85% of binar y disgu st/non disgust emotion and 80% correct class ifica tion related to the 4 different gestures. Keywords: Neural Network, Emotions Rec ogni tion, Huma ns Sil houe tts , Gesture Rec ogni tion, Faci al Expressions Rec ogni tion, Human Detection, Hands, Action Units, Centre of Gravity, Pose Estimation. 1 Introduction Good communication is the foundation of successful relationships, both personally and professionally. But we communicate with much more than words. In fact, many researches show that the majority of our communication is nonverbal. Nonverbal communication, or body language, includes facial expressions, gestures, eye contact, posture and even the tone of our voice. Although the details of his theory have evolved substantially since the 1960’s, Ekman remains the most vocal proponent of 

A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

Embed Size (px)

Citation preview

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 1/6

A supervised approach to support the analysis and the

classification of non verbal humans communications

Vitoantonio Bevilacqua12*, Marco Suma1 , Dario D‘Ambruoso1,

Giovanni Mandolino1, Michele Caccia1, Simone Tucci1,

Emanuela De Tommaso1, Giuseppe Mastronardi12

1Dipartimento di Elettrotecnica ed Elettronica, Polytechnic of Bari, Italy,2e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Italy

*corresponding author: [email protected]

Abstract. Background: It is well known that non verbal communication issometimes more useful and robust than verbal one in understanding sincere emotions

by means of spontaneous body gestures and facial expressions analysis acquired from

video sequences. At the same time, the automatic or semi-automatic procedure to

segment a human from a video stream and then figure out several features to address

a robust supervised classification is still a relevant field of interest in computer vision

and intelligent data analysis algorithms.

Materials and Methods: We obtained data from four datasets: first dataset contains

100 images of humans silhouettes (or templates) acquired from a video sequence

dataset, second dataset contains 543 images of gestures from a preregistered video of 

MotoGp driver Jorge Lorenzo, the third one 200 images of mouths and finally the

fourth one 100 images of noses; third and fourth datasets contain images acquired by

a tool implemented from the authors and also samples available in literature in public

databases. We used supervised methods to train the proposed classifiers and, in

particular, three different EBP Neural-Network architectures for humans templates,

mouths and noses and J48 algorithm for gestures.

Results: We obtained on average a 80% correct classification for binary classifier of 

humans templates (no false positives), 90% correct classification for happy/non happy

emotion, 85% of binary disgust/non disgust emotion and 80% correct classification

related to the 4 different gestures.

Keywords: Neural Network, Emotions Recognition, Humans Silhouetts,

Gesture Recognition, Facial Expressions Recognition, Human Detection,

Hands, Action Units, Centre of Gravity, Pose Estimation.

1 Introduction

Good communication is the foundation of successful relationships, both personally

and professionally. But we communicate with much more than words. In fact, many

researches show that the majority of our communication is nonverbal. Nonverbal

communication, or body language, includes facial expressions, gestures, eye contact,

posture and even the tone of our voice. Although the details of his theory have

evolved substantially since the 1960’s, Ekman remains the most vocal proponent of 

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 2/6

the idea that emotions are discrete entities [1]. Unlike some forms of nonverbal

communication, facial expressions are universal. About gestures recognition we

consider how the way we move communicates a wealth of information to the world.

This type of nonverbal communication includes our posture, stance, and subtle

movements. Gestures are omnipresent in our daily lives. However, the meaning of 

gestures can be very different across cultures and regions, so it is important being

careful to avoid misinterpretation. Using these ideas, we want to provide an automatic

system which is able to evaluate emotions in particular situations (videoconference,

meetings, neurological examination, investigation).

2 Materials

Materials for all the four datasets have been collected with the goal of increasing the

variance of their samples and then supporting the amount of information in the

training examples necessary for the proposed supervised classifiers.

2.1 Humans silhouettes

The humans silhouettes used in this paper comes from those walking in a video

stream dataset where the training examples consist of only 20 different silhouettes

binary images obtained after a pre-processing phase of background subtraction. By

this methods the training examples consist in each of a number of different and

several human silhouettes extracted from each frame.

Fig. 1. a) and b) samples frames and c) 4 different examples of humans silhouettes with their

several dimensions and behaviours.

2. 2 Facial Expressions

First of all we explain the concept of Action Units (AUs) as minimal facial actionsnot separable, elements for the construction of facial expressions.   Combination of 

these, with different intensities, generate facial expression. According to our previous

work [2] we can assert that, generally, prescinding other AUs, the presence of AU-10

discriminates unequivocally disgust emotion; the presence of AU-12 or AU-13

discriminates unequivocally happy emotion. For this reason we are able to recognize

two of the six primary emotions declared by Paul Ekman: happy and disgust

emotions. To extract middle and lower part of the face we have used our tool;

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 3/6

moreover we have used public databases of faces [3] and then we have taken our

regions of interest.

2.3 Gestures

Each frame of the video has a resolution of 640x480 pixel. As for the automatic

classification of gestures, the research has been based on different studies by

psychologist David McNeill [4], who divides them into four main categories:

- deictic gestures: typical indicating movements, usually emphasized by the

movement of fingers or by other parts of the body that can be used for this

purpose.

- iconic gestures: gestures that express formal relation in respect to the semantic

content of discourse. They mainly occur in the area occupied by the torso of the

prototype being focused;

- metaphoric gestures: they represent real figures. These refer to abstract concepts,

as moods or language. The density of such gestures is concentrated in the lower

part of the torso;

- beat gestures: these may be recognized by only focusing the attention on the

characteristics of their movements.

It has been decided to monitor the movement of the center of gravity (CG) of the

hands in each frame so as to be able to calculate various parameters of evaluation,

such as the velocity with which gestures are made.

3 Methods

The application of supervised neural network using Error Back Propagation algorithm

gives easier solution to complex problems such as in correct classification of 

silhouettes shapes, facial expression and gestures. Advantages of neural networks

include their high tolerance to noise as well as their ability to classify patterns not

used for training. In particular we implemented neural networks supervised classifier

for the classification of silhouettes, mouths and noses emotions features and the J48

classifier for gestures.

3.1 Silhouettes classification

The neural network classifier is a two layers feed-forward with 396 inputs(corresponding to 33*12 dimensions of the smallest figure previously resized to

contain the smallest human silhouette) with 6 logistic neurons in the first layer and 1

neuron as output. The images passed to the neural networks have the following

characteristics: the height bigger than the width, the ratio between height and width

ranging 1.9 and 4, the height bigger than 33 pixels and the width bigger than 12

pixels. All images are divided in more images and then each image contains a singular

human silhouette always resized to 33*12 pixels in order to have the same number of 

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 4/6

inputs for each neural network classification sample. This procedure guarantee a

constant number of neural network’s input. In any case to achieve good performance

in terms of generalization the training set is selected with large variability in terms of 

positive poses and movements that are people not staring the cameras (not frontal

images), people with their arms far or closed to the body, people not very well

identified owing of the presence of just one arm and negative ones that are objects

similar to people used as contrary examples.

3.2 Facial Expression classification

We have realized two NNs, that work in parallel; the first one receives the form of the

mouth: in happy expressions the mouth should be open, the teeth should be visible

and its shape is curved (AU-12, AU-13); the second one receives the nose: in disgust

expressions nasolabial furrows are visible (AU-10).

Fig. 2. Segmentation and vectorization of the face.

Each bitmap gray-scale image is a band of 40x80 pixels which contains respectively

the lower and the middle part of the face; to use it as input for the neural network they

have been arranged in an array and then normalized, obtaining a 1x50 vector (a

function calculates a mean value each 8x8 pixels). In case of no happy and no disgust

expressions, the network returns 0 (zero); in the other case the network returns 1. To

train the NN for the mouth, we have used a training set of 200 photos that are

composed of 100 negative and 100 of positive examples in 20000 epochs. The NN

comes with a structure of the first layer of 300 neurons, the second layer of 200

neurons, the third layer of 10 neurons and 1 output neuron (300x200x10x1). To train

the NN for the nose, we have used a training set of 100 examples that are composedof 50 negative and 50 positive examples in 20000 epochs. The NN comes with a

structure of the first layer of 400 neurons, the second layer of 80 neurons, the third

layer of 10 neurons and 1 output neuron (400x80x10x1).

Fig. 3. mouths and noses from our tool (the first four images) and from public databases.

8x8

M 50x1

M 50x1

ROIextraction

gray-scaleconversion

vectorization

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 5/6

3.3 Gestures

For gestures analysis the supervised classifier is implemented by means of J48

algorithm instead of using a EBP NN classifier. Rule induction systems are currently

employed in several different environments ranging from loan request evaluation to

fraud detection, bioinformatics and medicine [5]. In particular the main goal of this

scheme is to minimize the number of tree levels and tree nodes, thereby maximizing

data generalization. The input is a 10 elements array where the features are the x

coordinate of the right hand CG, the y coordinate of the right hand CG, the x

coordinate of the left hand CG; the y coordinate of the left hand CG, the position of 

the right hand (respect to the torso of the prototype being shot), the position of the left

hand, the right/left hands slant (measured in radiant), the velocity of the movement of 

the right/left hand. To find CG, frames have been processed according to the follow

workflow consisting of skin detection by color-space conversion from RGB to HSV,

background subtraction technique to exalt only the hands region, image smoothing

and binarization, tracing of rectangles that contain hands, CG identification, edge and

features detection; template matching to notice resting position of hands; gestures

classification and storing data on .csv file.

4 Experimental results

In this paper we have presented a system that recognizes separately shapes, two of six

primary emotions and analyzes information derived from gestures. The complete

project expects to recognize all primary emotions. In particular in the following we

show, separately, results related to facial expressions, gestures and silhouettes. About

facial processing, using about 150 test images, the results of NNs have achieved about

90% for happy/no-happy emotion and 85% for disgust/no-disgust emotion of success

rate. We can assert that the results are reliable, also because in some particular cases

nor human beings can distinguish exactly emotions. About gestures, the confusion

matrix is shown in Table 1. The NN has correctly classified approximately 80% of 

gestures. The network has specifically been able to label “metaphoric gestures” in a

precise way. Performances are not optimal as for the recognition of “deictic gestures”

and “beat gestures”. “Iconic gestures” are not present in the preregistered video.

Table 1. Confusion matrix of data set for gestures. Deictic (A); Spontaneous (B); Beat (C); not

recognized (D); Metaphoric (E)

A B C D E

A 53 71 3 0 0

B 6 280 10 3 1

C 1 9 54 1 0

D 1 2 1 20 0

E 0 1 1 0 25

Neural network shows on average good results in terms of false positives and then in

the following figure are reported detected Vs total humans per each frame.

8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications

http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 6/6

4 humans Vs 5 humans. 4 humans Vs 6 humans 4 humans Vs 6 humans

3 humans Vs 4 humans. 2 humans Vs 5 humans 2 humans Vs 5 humans.

Fig 4. Detected Vs total humans number per each tested frame.

Conclusions

The goal of this paper is to investigate emotion-related and realize a system to

recognize separately emotional patterns of the body and face using Neural Networks.

The research aims at developing an intelligent system that can interpret intellectual

conversation between human beings. When we interact with others, we continuously

give and receive countless wordless signals. The nonverbal signals we send either

produce a sense of interest, trust, and desire for connection, or they generate

disinterest, distrust, and confusion. The analyzed gestures and facial emotions

represent non-verbal communication; they provide the user to what the speaker is

saying, thus helping the listener to interpret the meaning of words. Future works

forecast the design of a new multimodal system performing at the same time emotion

recognition by means other several facial bands (eyebrows band eyes bands), gestures

recognition and human silhouettes.

References

[1] Paul Ekman, FACS: Facial Action Coding System, Research Nexus division of Network 

Information Research Corporation, Salt Lake City, UT 84107, (2002)

[2].V. Bevilacqua, D. D’Ambruoso, G. Mandolino, M. Suma, A new tool to support diagnosisof neurological disorders by means of facial expressions, IEEE Proc. of MeMeA pp 544-549

[3]. http://www.emotional-face.org

[4]. http://mcneilllab.uchicago.edu

[5].F. Menolascina, V. Bevilacqua et al . Novel Data Mining Techniques in aCGH based Breast

Cancer Subtypes Profiling: the Biological Perspective – Proc. of IEEE Symp. on Comp.

Intelligence in Bioinformatics and Comp. Biology (CIBCB 2007) pp.9-16