Upload
sana-nasar
View
253
Download
1
Tags:
Embed Size (px)
Citation preview
CHAPTER 1
INTRODUCTION
1.1 A Brief Description
Virtual learning is increasing day by day, and Human Computer
Interaction is a necessity to make virtual learning a better experience.
The emotions of a person play a major role in the learning process.
Hence the proposed work, detects the emotions of a person, by his face
expressions.
For a facial expression to be detected face location and area must
be known; therefore in most cases, emotion detection algorithms start
with face detection, taking into account the fact that face emotions are
mostly depicted using the mouth. Eventually, algorithms for eye and
mouth detection and tracking are necessary, in order to provide the
features for subsequent emotion recognition. In this project we propose a
detection system for natural emotion recognition.
1.2 Need For Face Detection
Human activity is a major concern in a wide variety of
applications such as video surveillance, human computer interface, face
recognition and face database management. Most face recognition
algorithms assume that face location is known. Similarly, face-tracking
algorithms often assume that initial face location is known. In order to
improve the efficiency of the face recognition systems, an efficient face
detection algorithm is needed.
1
1.3 Need For Emotion Detection
Human beings communicate through facial emotions in day to day
interactions with others. Human perceiving the emotions of fellow
human is natural and inherently accurate. Human can express his/her
inner state of mind through emotions. Many times, emotion indicates
that a human needs help. Computer recognising emotions is an important
research in Human Computer Interfacing (HCI). This HCI can be a
welcoming method for physically disabled and to those who are unable
to express their requirement by voice or by other means and especially to
those who are confined to bed. The human emotion can be detected
through facial actions or through biosensors. Facial actions are imaged
through still or video cameras. From still images, taken at discrete
times, the changes in eye and mouth areas can be exposed. Measuring
and analysing such changes will lead to the determination of human
emotions.
1.4 Existing Face Detection Approaches
1.4.1 Feature Invariant Methods
These methods aim to find structural features that exist even when
the pose, viewpoint, or lighting conditions vary, and then use these to
locate faces. These methods are designed mainly for face localization.
2
Texture
Human faces have a distinct texture that can be used to
separate them from different objects. The textures are computed using
second-order statistical features on sub images of 16X16 pixels. Three
types of features are considered: skin, hair, and others. To infer the
presence of a face from the texture labels, the votes of occurrence of hair
and skin textures are used. Here the colour information is also
incorporated with face-texture model. Using the face texture model, a
scanning scheme for face detection in colour scenes in which the orange
like parts including the face areas are enhanced. One advantage of this
approach is that it can detect faces which are not upright or have features
such as beards and glasses.
Skin Colour
Human skin colour has been used and proven to be an effective
feature in many applications from face detection to hand tracking.
Although different people have different colour, several studies have
shown that the major difference lies largely between their intensity rather
than their chrominance. Several colour spaces have been utilized to label
pixels as skin including RGB, Normalized RGB, HSV, YCbCr, YIQ,
YES, CIE XYZ and CIE LUV.
1.4.2 Template Matching Methods
In template matching, a standard face pattern is manually
predefined or parameterized by a function. Given an input image, the
correlation values with the standard patterns are computed for the four
3
contours, eyes, nose, and mouth independently. The existence of a face is
determined based on the correlation values. This approach has the
advantage of being simple to implement. However, it has proven to be
inadequate for face detection since it cannot effectively deal with
variation in scale, pose, and shape. Multiresolution, multiscale, sub
templates, and deformable templates have subsequently been proposed
to achieve scale and shape invariance.
Predefined Face Template
In this approach several sub templates for nose, eyes, mouth and
face contour are used to model a face. Each sub template is defined in
terms of line segments. Lines in the input image are extracted based on
greatest gradient change and then matched against the sub templates. The
correlations between sub images and contour templates are computed
first to detect candidate location of faces. Then, matching with the other
sub templates is performed at the candidate positions. In other words, the
first phase determines focus of attention or region of interest and second
phase examines the details to determine the existence of a face.
1.4.3 Appearance Based Methods
In the appearance based methods the templates are learned from
examples in images. In general, appearance based methods rely on
techniques from statistical analysis and machine learning to find the
relevant characteristics of face and non face images. The learned
characteristics are in the form of distribution models that are
consequently used for face detection.
4
1.5 Existing Emotion Detection Approaches
1.5.1 Genetic Algorithm
The eye feature plays a vital role in classifying the face emotion
using Genetic Algorithm. The acquired images must go through few pre-
processing methods such as grayscale, histogram equalization and
filtering. A Genetic Algorithm methodology estimates the emotions from
eye feature alone. Observation of various emotions lead to a unique
characteristic of eye, that is, the eye exhibits ellipses of different
parameters in each emotion. Genetic Algorithm is adopted to optimize
the ellipse characteristics of the eye features. Processing time for Genetic
Algorithm varies for each emotion.
1.5.2 Neural Network
Neural networks have found profound success in the area of
pattern recognition. By repeatedly showing a neural network, inputs are
classified into groups, the network can be trained to discern the criteria
used to classify, and it can do so in a generalized manner allowing
successful classification of new inputs not used during training. With the
explosion of research in emotions in recent years, the application of
pattern recognition technology to emotion detection has become
increasingly interesting. Since emotion has become an important
interface for the communication between human and machine, it plays a
basic role in rational decision-making, learning, perception, and various
cognitive tasks.
5
Human's emotion can be detected based on the physiological
measurements, facial expression. Since human shows the same facial
muscles when expressing a particular emotion, therefore the emotion can
be quantified. Primary emotions such as anger, disgust, fear, happiness,
sadness and surprise can be classified using Neural Network.
1.5.3 Feature Point Extraction
Template Matching
An interesting approach in the problem of automatic facial feature
extraction is a technique based on the use of template prototypes, which
are portrayed on the 2-d space in gray scale format. This is a technique
that is, to some extent, easy to use, but also effective. It uses correlation
as a basic tool for comparing the template with the part of the image that
we wish to recognize. An interesting question that arises is, the
behaviour of recognition with template matching in different resolutions.
This involves multi-resolution representations through the use of
Gaussian pyramids. The experiments proved that not very high
resolutions are needed for template matching recognition. For example,
the use of templates of 36x36 pixels proved sufficient. This fact shows
us that template matching is not as computationally complex as we
originally imagined.
This class implements the face detection algorithm which starts by
scanning the given image with the SSR filter and locating the face
candidates, then it assembles the candidates that are close to each other
using connected components (to treat less candidates which means less
processing time, remember this is a real-time application), then we take
6
the centre of each cluster and extract a template based on this centre; we
pass the template to the Support Vector Machine which tells us whether
this template is a face or not, if yes, we locate the eyes, then we locate
the nose.
Face detection techniques are of two categories:
1. Feature based approach
2. Image-based approach.
Template Matching provides for the human face detection system.
1. Feature Based Technique:
The techniques in the first category make use of apparent
properties of face such as face geometry, skin colour, and motion. Even
feature-based technique can achieve high speed in face detection, but it
also has problem in poor reliability under lighting condition.
2. Image Based Technique:
The image based approach takes advantage of current advance in
pattern recognition theory. Most of the image based approach applies a
window scanning technique for detecting face, which requires large
computation.
To achieve high speed and reliable face detection system, we
propose the method which combines both feature-based and image-based
approach using SSR Filter.
7
1.5.4 Template Matching.
Template matching is a technique in digital image processing for
finding small parts of an image which match a template image or as a
way to detect edges in images.
The basic method of template matching uses a convolution mask
(template), tailored to a specific feature of the search image, which we
want to detect.
This technique can be easily performed on grey images or edge
images. The convolution output will be highest at places where the
image structure matches the mask structure, where large image values
get multiplied by large mask values
Eyes and Nose detection using SSR Filter.
A real-time face detection algorithm using Six-Segmented
Rectangular (SSR) filter of the eyes and nose detection.
SSR is a six segment rectangle as illustrated in Figure 1.1.
Figure 1.1 SSR Filter
8
At the beginning, a rectangle is scanned throughout the input
image. This rectangle is segmented into six segments as shown below.
The SSR filter is used to detect the Between-the-Eyes based on
two characteristics of face geometry.
BTE - Between The Eyes
The detection of BTE is based on the property of the image
characteristics of the area on face. The intensity of the BTE image
closely resembles a hyperbolic surface as shown in Figure 1.2. The BTE
is the saddle point on the hyperbolic surface. A rotationally invariant
filter could thus be devised for detecting the BTE area.
9
Figure 1.2 Determination of BTE
The nose area is usually calculated to be 2/3rd of the value of L as
shown in Figure 1.3. The L is calculated as the approximate distance
between both eyes and the distance from eye to nose.
Figure1.3 Nose Tip Search Area Relative to Eyes
The common BTE area on human face resembles a hyperbolic surface.
The proposed work uses this hyperbolic model to describe the BTE
region, the centre of the BTE is thus the saddle point on the surface.
Blobs
Blobs provide a complementary description of image structures in
terms of regions, as opposed to corners that are more point-like.
Nevertheless, blob descriptors often contain a preferred point (a local
maximum of an operator response or a centre of gravity) which means
10
that many blob detectors may also be regarded as interest point
operators. Blob detectors can detect areas in an image which are too
smooth to be detected by a corner detector.
Gabor Filtering
It is possible for Gabor filtering to be used in a facial recognition
system. The neighbouring region of a pixel may be described by the
response of a group of Gabor filters in different frequencies and
directions, which have a reference to the specific pixel. In that way, a
feature vector may be formed, containing the responses of those filters.
Automated Facial Feature Extraction
In this approach, as far as the frontal images are concerned, the
fundamental concept upon which the automated localization of the
predetermined points is based consists of two steps: the hierarchic and
reliable selection of specific blocks of the image and subsequently the
use of a standardized procedure for the detection of the required
benchmark points. In order for the former of the two processes to be
successful, the need of a secure method of approach has emerged. The
detection of a block describing a facial feature relies on a previously,
effectively detected feature. By adopting this reasoning, the choice of the
most significant characteristic -the ground of the cascade routine- has to
be made. The importance that each of the commonly used facial features,
regarding the issue of face recognition, has already been studied by other
researchers. The outcome of surveys proved the eyes to be the most
dependable and easily located of all facial features, and as such they
were used. The techniques that were developed and tried separately,
utilize a combination of template matching and Gabor filtering.
11
The Hybrid Method
The basic question of the desired feature blocks is performed by a
simple template matching procedure. Each feature prototype is selected
from one of the frontal images of the face base. The practiced
comparison criterion is the maximum correlation coefficient between the
prototype and the repeatedly audited blocks of a smartly restricted area
of the face.
In order for the search area to be incisively and functionally
limited, the knowledge of the human face physiology has been applied,
without hindering the satisfactory performance of the algorithm in cases
of small violations of the initial limitations. However, the final block
selection by the mere use of this method has not always been crowned
with success. Therefore, the need of a measure of reliability came forth.
For that reason, the use of Gabor filtering was deemed to be one suitable
tool. As it can be mathematically deduced from the filter’s form, it
ensures simultaneous optimum localization in the natural space as well
as in frequency space.
The filter is applied both on the localized area and the template in
four different spatial frequencies. Its response is regarded as valid, only
in the case that its amplitude exceeds a saliency threshold. The area with
minimum phase distance from its template is considered to be the most
reliably traced block.
12
1.5.5 Preprocessing and Postprocessing of Images
Image Processing Toolbox provides reference-standard algorithms
for pre-processing and post processing tasks that solve frequent system
problems, such as interfering noise, low dynamic range, out-of-focus
optics, and the difference in colour representation between input and
output devices. Using region-of-interest tools to create a mask, items in
the original image (top) are selected to create the mask (bottom).
Image Enhancement techniques in Image Processing Toolbox
enables user to increase the signal-to-noise ration and accentuate image
features by modifying the colours or intensities of an image.
We can
• Perform histogram equalization
• Perform decorrelation stretching
• Remap the dynamic range
• Adjust the gamma value
• Perform linear, median or adaptive filtering.
1.5.6 Typical Tasks of Computer Vision
Each of the application areas in computer vision systems employ a
range of computer vision tasks, more or less well-defined measurement
problems or processing problems, which can be solved using a variety of
methods. Some examples of typical computer vision tasks are presented
13
below.
Recognition
The classical problem in computer vision, image processing and
machine vision is that of determining whether or not the image data
contains some specific object, feature, or activity. This task can normally
be solved robustly and without effort by a human, but is still not
satisfactorily solved in computer vision for the general case: arbitrary
objects in arbitrary situations. Computer vision for the general case:
arbitrary objects in arbitrary situations. The existing methods for dealing
with this problem can at best solve it only for specific objects, such as
simple geometric objects (e.g., polyhedrons), human faces, printed or
hand-written characters, or vehicles, and in specific situations, typically
described in terms of well-defined illumination, background, and pose of
the object relative to the camera.
Different varieties of the recognition problem are described in the
literature:
Recognition: one or several pre-specified or learned objects or object
classes can be recognized, usually together with their 2D positions in the
image or 3D poses in the scene.
Identification: An individual instance of an object is recognized.
Examples: identification of a specific person face or fingerprint, or
identification of a specific vehicle. Detection based on relatively simple
and fast computations is sometimes used for finding smaller regions of
interesting image data.
14
CHAPTER 2
LITERATURE SURVEY
Jarkiewicz et al [1] propose an emotion detection system where
analysis is done using a Haar-like detector and face detection is done
using a hybrid approach. The technique proposed here is to localize
seventeen characteristic points on the face and based on their
displacements certain emotions can be automatically recognized. An
improvement over the above proposed method is the feature extraction
technique.
A face detection algorithm is proposed by Zhao et al [2] for colour
images. This work is based on an adaptive threshold and a chroma chart
that shows probability of skin colours. Thus by identifying the skin
region, the facial part can be identified in the image. This technique
when used with the feature extraction technique yields better results.
Maglogiannis et al [3] present an integrated system for emotion
detection. The system uses colour images and it is composed of three
modules. The first module implements skin detection, using Markov
random fields for image segmentation and face detection. A second
module is responsible for eye and mouth detection and extraction. The
specific module uses the HSV colour space of the specified eye and
mouth region. The third module detects the emotions, pictured in the
eyes and mouth using edge detection and measuring the gradient of the
eye’s and mouth’s region.
15
A detailed experimental study of face detection algorithms based
on skin colour has been made by Singh et al [4]. Three colour spaces,
RGB, YCbCr and HSI are of main concern .The algorithms of these
three colour spaces have been compared and then combined to get a new
skin colour based face detection algorithm which gives higher accuracy.
A survey by Yang et al [5] categorizes and evaluates the various
face detection algorithms. Other relevant issues such as benchmarking,
data collection and evaluation techniques have also been discussed. The
algorithms have been analysed and their limitations have been identified.
The Eigenface method [6] which uses principal components
analysis for dimensionality reduction, yields projection directions that
maximize the total scatter across all classes, ie, across all images of all
faces. In choosing the projection which maximizes total scatter, principal
components analysis retains unwanted variations due to lighting and
facial expression. The Eigenface method is also based on linearly
projecting the image space to a low dimensional feature space.
The Bunch Graph technique [7] has been fairly reliable to
determine facial attributes from single images, such as gender or the
presence of glasses or a beard. If this technique was developed to extract
independent and stable personal attributes, such as age, race or gender,
recognition from large databases could be improved and speed-up
considerably by preselecting corresponding sectors of the database.
Image deblurring algorithms are blind, Lucy-Richardson, Wiener and
regularized filter deconvolution as well as conversions between point
16
spread and optical transfer solutions.
The Fisherfaces method [8], a derivative of Fisher’s Linear
Discriminant (FLD) maximizes the ratio between class scatter to that of
within-class scatter and appears to be the best at extrapolating and
interpolating over variation in lighting, although the Linear Subspace
method is a close second. The Eigenface method is also based on linearly
projecting the image space to a low dimensional feature space.
However, the Eigenface method, which uses principal components
analysis, yields projection directions that maximize the total scatter.
In a survey by Cheng-Chin Chiang[9] et al., presents a real-time
face detection algorithm for locating faces in images and videos. This
algorithm finds not only the face regions, but also the precise locations
of the facial components such as eyes and lips. The algorithm starts from
the extraction of skin pixels based upon rules derived from a simple
quadratic polynomial model. Interestingly, with a minor modification,
this polynomial model is also applicable to the extraction of lips. The
benefits of applying these two similar polynomial models are twofold.
First, much computation time are saved. Second, both extraction
processes can be performed simultaneously in one scan of the image or
video frame. The eye components are then extracted after the extraction
of skin pixels and lips. Afterwards, the algorithm removes the falsely
extracted components by verifying with rules derived from the spatial
and geometrical relationships of facial components. Finally, the precise
face regions are determined accordingly. According to the experimental
results, the proposed algorithm exhibits satisfactory performance in
terms of both accuracy and speed for detecting faces with wide
variations in size, scale, orientation, colour, and expressions.
17
Hironori Yamauchi[9], proposed Bio Security using Face
recognition for Industrial Use about current systems for face recognition
techniques which often use either SVM or adaboost techniques for face
detection part and use PCA for face recognition part.
In Robust real time face tracking for the analysis of human
behaviour proposed by Damien Douxchamp and Nick Campbell[10],
presented a real-time system for face detection, tracking and
characterization from Omni directional video. Viola-Jones is used as a
basis for face detection, and then various filters are applied to eliminate
false positives. Gaps between two detection of a face by the Viola-Jones
algorithms are filled using a colour-based tracking.
Shinjiro Kawato and Nobuji Tetsutani[11], proposed Scale
Adaptive Face Detection and Tracking in Real Time for detection and
tracking of faces in video sequences in real time. It can be applied to a
wide range of face scales. Fast extraction of face candidates is done with
a Six-Segmented Rectangular (SSR) filters and face verification by a
support vector machine.
18
Real-Time Face Detection Using Six-Segmented Rectangular
Filter (SSR Filter) by Oraya Sawettanusorn and et al.,[12], proposed
a real-time face detection algorithm using Six-Segmented
Rectangular (SSR) filter, distance information, and template
matching technique. Between-the-Eyes is selected as face
representative because its characteristic is common to most people
and is easily seen for a wide range of face orientation. Image is
scanned and divided into six segments throughout the face image.
A research by Li Zhang[13] and et al., concentrates on intelligent
neural network based facial emotion recognition and Latent Semantic
Analysis based topic detection for a humanoid robot. The work has first
of all incorporated Facial Action Coding System describing physical
cues and anatomical knowledge of facial behavior for the detection of
neutral and six basic emotions from real-time posed facial expressions.
Feedforward neural networks (NN) are used to respectively implement
both upper and lower facial Action Units (AU) analyzers to recognize six
upper and 11 lower facial actions including Inner and Outer Brow
Raiser, Lid Tightener, Lip Corner Puller, Upper Lip Raiser, Nose
Wrinkler, Mouth Stretch etc. An artificial neural network based facial
emotion recognizer is subsequently used to accept the derived 17 Action
Units as inputs to decode neutral and six basic emotions from facial
expressions. Moreover, in order to advise the robot to make appropriate
responses based on the detected affective facial behaviors, Latent
Semantic Analysis is used to focus on underlying semantic structures of
the data and go beyond linguistic restrictions to identify topics embedded
in the users’ conversations. The overall development is integrated with a
modern humanoid robot platform under its Linux C++ SDKs. The work
19
presented here shows great potential in developing personalized
intelligent agents/robots with emotion and social intelligence.
CHAPTER 3
PROBLEM DEFINITION
The aim of this project is to detect human facial emotions namely
happiness, sadness and surprise. This is done by first detecting the face
from an image, based on the skin colour detection technique. It is then
followed by image segmentation and feature extraction techniques,
where eye and mouth parts are extracted. Based on the eye and mouth
variances the emotions are detected. From the position of eyes, emotions
are detected. If the person is happy or sad then eyes will be open and
when a person is surprised, eyes will be wide open. Similarly for the lips
the shape and colour properties are important. Depending on the shape of
the lips, emotions are detected, i.e., if the lips are closed and curved
upwards it indicates happiness. If lips are opened it indicates surprise
etc. Therefore based on the facial features such as eyes and mouth,
emotions are detected and recognized.
20
CHAPTER 4
FACIAL EMOTION DETECTION AND RECOGNITION
4.1 Overview of the Algorithm
Our project proposes an emotion detection system where in the
facial emotions namely - happy, sad and surprised are detected. First
the face is detected from an image using the skin colour model. This is
then followed by feature extraction such as eyes and mouth. This is used
for further processing to detect the emotion. For detecting the emotion
we take into account the fact that emotions are basically represented
using mouth expressions. This is done using the shape and colour
properties of the lips.
4.1.1 Video Fragmentation
The input video of an e-learning student is acquired using an image
acquisition device and stored into a database. This video is extracted and
21
fragmented into several frames to detect the emotions of the e-leaning
student and to thereby improve the virtual learning environment. By the
video acquisition feature which is used to record and register the on-
going emotional changes in the e-learning student, the resulting emotions
are detected by mapping the changes in the eye and lip region. The
videos are recorded into a database before processing, thereby making it
useful to analyse the changes of emotion for a particular subject or
during a particular time of the day.
Frame rate and motion blur are important aspects of video quality.
This demo helps to show the visual differences between various frame
rates and motion blur.
A few presets to try out:
Motion blur is a natural effect when you film the world in discrete time
intervals. When a film is recorded at 25 frames per second, each frame
has an exposure time of up to 40 milliseconds (1/25 seconds). All the
changes in the scene over that entire 40 milliseconds will blend into the
final frame. Without motion blur, animation will appear to jump and will
not look fluid.
When the frame rate of a movie is too low, your mind will no longer be
convinced that the contents of the movie are continuous, and the movie
will appear to jump (also called strobing).
The human eye and its brain interface, the human visual system, can
process 10 to 12 separate images per second, perceiving them
individually, but the threshold of perception is more complex, with
different stimuli having different thresholds: the average shortest
22
noticeable dark period, such as the flicker of a cathode ray tube monitor
or fluorescent lamp, is 16 milliseconds, while single-millisecond visual
stimulus may have a perceived duration between 100ms and 400ms due
to persistence of vision in the visual cortex. This may cause images
perceived in this duration to appear as one stimulus, such as a 10ms
green flash of light immediately followed by a 10ms red flash of light
perceived as a single yellow flash of light.
4.1.2 Face Detection
The first step for face detection is to make a skin colour model.
After the skin colour model is produced, the test image is skin
segmented (binary image) and the face is detected. The result of Face
Detection is processed by a decision function based on the chroma
components (CrCb from YCbCr and Hue from HSV). Before the result is
passed to the next module, it is cropped according to the skin mask.
Small background areas which could lead to errors during the next stages
will be deleted.
A model image of face detection with the bounding box is
illustrated below in Figure 4.1.
23
Figure 4.1 Face Detection
4.1.3 Feature Extraction
After the face has been detected the next step is feature extraction
where the eyes and mouth are extracted from the detected face .For eye
extraction, this is done by creating two eye maps, a chrominance eye
map and a luminance eye map. The two maps are then combined to
locate the eyes in a face image, as shown in Figure 4.2.
24
Figure 4.2 Feature Detection
To locate the mouth region, we use the fact that it contains
stronger red components and weaker blue components than other facial
regions (Cr > Cb), and thus the mouth map is constructed .Based on this
the mouth region is extracted .Finally the extracted eyes and mouth from
the face image according to the maps are passed onto the next module of
our algorithm.
4.1.4 Emotion Detection
The last module is emotion detection. This module makes use of
the fact that the emotions are expressed majorly with the help of eye and
mouth expressions as show in Figure 4.3. Emotion detection from lip
images is based on colour and shape properties of human lips. Having a
binary lip image, shape detection can be performed. Thus, depending on
the shape of the lips and other morphological properties the emotions are
detected. A computer is being taught to interpret human emotions based
on lip pattern, according to research published in the International
Journal of Artificial Intelligence and Soft Computing. The system could
improve the way we interact with computers and perhaps allow disabled
people to use computer-based communications devices, such as voice
synthesizers, more effectively and more efficiently.
25
Figure 4.3 Emotion Detection
4.2 Architectural Design
The architectural diagram shows the overall working of the
system, where captured colour image sample is taken as the input and it
is processed using image processing tools and is analysed to locate the
facial features such as eyes and mouth, which will be further processed
to recognize the emotion of the person. After the localization of the facial
features the next step is to localize the characteristic points on the face.
Followed by this is the feature extraction process where the features are
extracted such as eyes and mouth.
Based on the variations of eyes and mouth, emotion of a person is
detected and recognized. For a person who is happy, the eyes will be
open and the lips will be closed upwards whereas for a person who is
26
sad, the eyes will be open and the lips will be closed facing downwards.
Similarly for a person who is surprised the eyes will be wide open and
there will be a considerable displacement of the eye brows from the eyes
and the mouth will be wide open. Based on the above measures mood
exhibited by a person is detected and it is recognized.
The Figure 4.4 shows the overall working of the system where the
input is the image and the output is the emotion recognized such as
happy, sad or surprised.
27
Figure 4.4 – Architectural Diagram
CHAPTER 5
REQUIREMENT ANALYSIS
28
The Software Requirements Specification is based on the problem
definition. Ideally, the requirement specification will state the “what” of
the software product without implying “how” the software design is
concerned, by specifying how the product will provide the required
features.
5.1 Product Requirements
5.1.1 Input Requirements
The input for this work is the video of an e-learning student, which
may contain the human face.
5.1.2 Output Requirements
The output is the detected facial emotion such as happy, sad, and
surprised.
5.2 Resource Requirements
The hardware configuration requirement is shown in Table 5.1 and
software configuration required to run this software is shown in Table
5.2.
5.2.1 Hardware Requirements
29
Table 5.1 – Hardware Requirements
S.No Feature Configuration
1 CPU Intel core 2 Duo processor
2 Main memory 1 GB RAM
3 Hard Disk 60 GB Disk size
The above configuration in the Table 5.1 is the minimum hardware
requirements for the proposed system.
5.2.2 Software Requirements
Table 5.2 – Software Requirements
S.No Software Version
1 Windows 7
2 Matlab R2012a
3 Picassa 3
The proposed system is executed using Windows 7,
MatlabR2012a and picassa 3 as shown in Table 5.2.
CHAPTER 6
DEVELOPMENT PROCESS AND DOCUMENTATION
30
6.1 Face Detection
Face detection is used in biometrics, often as a part of or together with
a facial recognition system. It is also used in video surveillance, human
computer interface and image database management. Some recent digital
cameras use face detection for autofocus. Face detection is also useful
for selecting regions of interest in photo slideshows that use a pan-and-
scale Ken Burns effect.
Face detection can be regarded as a specific case of object-class
detection. In object-class detection, the task is to find the locations and
sizes of all objects in an image that belong to a given class. Examples
include upper torsos, pedestrians, and cars.
Face detection can be regarded as a more general case of face
localization. In face localization, the task is to find the locations and
sizes of a known number of faces. In face detection, one does not have
this additional information.
6.1.1 Sample Collection
The sample skin coloured pixels is collected from images of
people belonging to different races. Each pixel is carefully chosen from
the images so that the other regions which are not belonging to the skin
colour do not get included.
6.1.2 Chroma Chart Preparation
Chroma chart shown in Figure 6.1 is the distribution of the skin
colour of different people over the chromatic colour space.
31
Figure 6.1 – Chroma Chart Diagram
Here the chromatic colour is taken in the (Cb, Cr) colour space.
Normally the images will be stored in the (R, G, B) format. A suitable
conversion is needed to convert it into YCbCr colour space.
The collected sample pixels values are converted from (R G B)
colour space to the YcbCr colour space and a chart is drawn by taking
the Cb along x- axis and Cr along Y-axis. Now the obtained chart shows
the distribution of the skin colour of different people. The Intensity(Y)
component is not considered because it has very little effect in the
chrominance variation. The following diagram shows the distribution of
the skin colour of different people.
6.1.3 Skin Colour Model
The skin-likelihood image is obtained using the developed skin
colour model. The skin colour model is the distribution of skin colour
32
over the chromatic colour space. Each and every pixel in the given input
image is compared with the skin colour model. If the particular
chrominance pair is present then that pixel is made as white pixel. This is
achieved by assigning the red, green and blue component of each pixel
as 255. If the chrominance pair is not present that pixel is made as black
pixel. This is achieved by assigning the red, green and blue component
of each pixel as 0.
The result of Face Detection is first processed by a decision
function based on the chroma components (CrCb from YCrCb and Hue
from HSV).If all the following conditions are true for a pixel, it's marked
as skin area; 140< Cr < 165 and 140< Cb <195.Now the obtained image
is a binary image where the white coloured regions show the possible
skin coloured region. The black region shows the non-skin coloured
region. Before the result is passed to the next module, it is cropped
according to the skin mask. Small background areas which could lead to
errors during the next stages will be deleted.
6.2 Feature Extraction
Feature extraction is the process of detecting the required features
from the face and extracting it by cropping or other such technique.
6.2.1 Eye Detection
Two separate eye maps are built, one from the chrominance
component and the other from the luminance component. These two
33
maps are then combined into a single eye map. The eye map from the
chrominance is based on the fact that high-Cb and low-Cr values can be
found around the eyes. The following formula presented helps us to
construct the map:
1/3*(Cb*Cb + (255-Cr)*(255-Cr) + (Cb/Cr)).
Eyes usually contain both dark and bright pixels in the luminance
component, so gray scale operators can be de- signed to emphasize
brighter and darker pixels in the luminance component around eye
regions. Such operators are dilation and erosion. We use gray scale
dilation and erosion with a spherical structuring element to construct the
eye map.
The eye map from the chrominance is then combined with the eye
map from the luminance by an AND (multiplication) operation, Eye
Map=(EyeMapChr) AND (EyeMapLum). The resulting eye map is then
dilated and normalized to brighten both the eyes and suppress other
facial areas. Then with an appropriate choice of a threshold, we can track
the location of the eye region.
6.2.2 Mouth Detection
To locate the mouth region, we use the fact that it contains
stronger red components and weaker blue components than other facial
34
regions (Cr >Cb), so the mouth map is constructed as follows:
n= 0.95 * (1/k sum (Cr(x,y)*Cr(x,y))) / (1/k sum (Cr(x,y)/Cb(x,y)))
Map = Cr*Cr*(Cr*Cr – n*Cr/Cb)
Where k is the number of pixels in the face.
The mouth detection diagram is shown in Figure 6.2
Happy:
Surprised:
Figure 6.2 – Mouth Detection Diagram
6.3 Emotion Detection
Emotion detection from lip images is based on colour and shape
properties of human lips. For this task we considered already having a
35
rectangular colour image containing lips and surrounding skin (with as
small amount of skin as possible). Given this we can start extracting a
binary image of lips, which would give us the necessary information
about the shape.
To extract a binary image of lips, a double threshold approach was
used. First, a binary image (mask) containing objects similar to lips is
extracted. The mask image is extracted in the way that it contains a
subset of pixels which is equal or greater of the exact subset of lip pixels.
Then, another image (marker) is generated by extracting pixels which
contain lips with highest probability. Later, the mask image is
reconstructed using the marker image to make results more accurate.
Having a binary lip image, shape detection can be performed.
Some lip features of face expressing certain emotions are obvious: side
corners of happy lips are higher compared to the lip centre than it is for
serious or sad lips. One way to express it more mathematically is to find
the leftmost and rightmost pixels (lip corners), draw a line between them
and calculate the position of lip centre with respect to that line. The
lower below the line is the centre, the happier the lips are. Another
morphological lip property that can be extracted is mouth openness.
Open lips imply certain emotions: usually happiness and surprise.
For example (surprised and happy):
1. Based on the original binary image the first step is to remove small
areas which is done with the 'sizethre(x,y,'z')' function.
36
2. In the second step a morphological closing (imclose(bw,se)) with a
'disk' structure element is done.
3. In the third step some properties of image regions are measured (blob
analysis).More precise:
A 'BoundingBox' is calculated which contains the smallest
rectangle of the region (in our case the green box). In digital image
processing, the bounding box is merely the coordinates of the
rectangular border that fully encloses a digital image when it is placed
over a page, a canvas, a screen or other similar bidimensional
background.
'Extremas' were calculated which means a 8-by-2 matrix that
specifies the extrema points in the region. Each row of the matrix
contains the x- and y-coordinates of one of the points. The format of the
vector is [top-left top-right right-top right-bottom bottom-right bottom-
left left-bottom left-top] (in our case the cyan dots).
A 'Centroid' which is a 1-by-ndims(L) vector that specifies the
centre of mass of the region (in our case the blue 'star').
The centroid is calculated based on:
1. p_poly_dist.....Calculates distance (shown as red line) between
Centroid and 'left-top-right-top-line'.
37
2. lipratio....Ratio between width and height of the bounding box.
3. lip_sign....Is a positive/negative number, which is calculated to
detect if the 'left-top-right-top-line' runs over/under the 'centroid'.
4. The decision is made if mood is 'happy', 'sad' and 'surprised'.
After reviewing some illumination correction (colour constancy)
algorithms we decided to use the "Max-RGB" (also known as "White
patch") algorithm. This algorithm assumes that in every image there is a
white patch, which is then used as a reference for present illumination. A
more accurate "Colour by Correlation" algorithm was also considered,
but it required building a precise colour-illumination correlation table in
controlled conditions, which would be beyond the scope of this task. As
the face detection is always the first step in the processes of these
recognition or transmission systems, its performance would put a strict
limit on the achieved performance of the whole system. Ideally, a good
face detector should accurately extract all faces in images regardless of
their positions, scales, orientations, colours, shapes, poses, expressions
and light conditions. However, for the current state of the art in image
processing technologies, this goal is a big challenge. For this reason,
many designed face detectors deal with only upright and frontal faces in
well-constrained environments.
This lip emotion detection algorithm has one restriction - the face
cannot be rotated more than 90 degrees, since then the corner detection
would obviously fail.
CHAPTER 7
EXPERIMENTAL RESULTS
38
7.1 General
The results obtained after successful implementation of the project
is given in this chapter. The results obtained are given in step by step
basis.
7.2 Chroma Chart
Chroma chart displayed in Figure 7.1 is the distribution of the skin
colour of different people over the chromatic colour space. Here the
chromatic colour is taken in the (Cb, Cr) colour space. The Intensity(Y)
component is not considered because it has very little effect in the
chrominance variation. The following diagram shows the distribution of
the skin colour of different people.
Figure 7.1 – Chroma Chart
7.3 Result Analysis
This gives the overall efficiency of the proposed system detected
39
at each step. The system was analysed for its detection rate and time
taken to detect a particular stage for a specified number of input images.
Three stages were considered in the system that is the skin detection, the
face detection such as eyes and mouth and the emotion detection and
recognition, at each of these stages detection rate and the time taken was
calculated. The results are tabulated in the Table 7.1.
Table 7.1 – Result Analysis
According to the table, 17 image samples were taken to determine
the skin detection rate and it was found that, out of 17 images, skin was
detected for 16 images giving a detection rate of 94.44 % with an
average time of 1.4 seconds per image. The face detection rate was
40
STAGES DETECTION
RATES
(%)
NUMBER OF
IMAGES TIME(s)
SKIN DETECTION 94.44 17 1.4
FACE DETECTION
(EYES AND
MOUTH)
83.33 15 1
EMOTION
DETECTION AND
RECOGNITION
88.88 16 0.5
calculated for 15 images out of which for 12 images face was detected
successfully giving a detection rate of 83.33 % with an average time of 1
second per image. Similarly the emotion detection and recognition rate
was calculated for 16 images. Out of which for 14 images exact
emotions were detected and recognized giving a detection rate of 88.88
% with an average time of 0.5 seconds per image.
The video fragmentation rate of a video depends on the duration
and length of the original video. The Frames per Second (fps) rate is
dependent on the time span of the video.
Frame rate (also known as frame frequency) is the frequency (rate)
at which an imaging device produces unique consecutive images
called frames. The term applies equally well to film and
video cameras, computer graphics, and motion capture systems. Frame
rate is most often expressed in frames per second (FPS) and is also
expressed in progressive scan monitors as hertz(Hz). If a video of a
greater time span is given, the interval between the fragments remain
constant. For every fragment produced, the emotion of the person is
detected. Thereby, acquiring an opinion on what intervals the change of
emotions occurs, and narrowing down to the corresponding reason of
occurrence.
CHAPTER 8
CONCLUSION AND FUTURE WORK
41
Conclusion
The proposed system utilizes feature extraction techniques and
determines the emotion of the person based on the facial features namely
eyes and lips. The emotion exhibited by a person is determined with a
good accuracy and it is user friendly system.
Face-Detection and Segmentation
In this project we have proposed an emotion detection and
recognition system for colour images. Although our application is only
constructed for full frontal pictures with only one person per picture
Face-Detection is necessary for decreasing the area of interest needed for
further processing in order to achieve the best results.
Trying to detect the skin of a face in an image really is a hard task
due to the variance of illumination. The success of correct detection
depends a lot on the light sources and illumination properties of the
environment the picture are taken.
Emotion Detection
The major difficulty of the used approach is determining the right
hue threshold range for lip extraction. Lip colours vary mostly according
to face owner's race, presence of make-up and illumination, under which
the photo was taken. The latter is the least problem, since there exist
illumination correction algorithms.
Future Enhancements
42
The future work includes enhancement of the system so that it is
able to detect emotions of the person even in complex backgrounds
having different illumination conditions and to eliminate the lip colour
constraint in the coloured images. The other criterion that can be worked
upon is to project more emotions other than happy, sad and surprised.
APPENDIX 1
SCREENSHOTS
43
SCREEN 1 : The detected face for the given video input.
44
SCREEN 2: The interface which is used to select the input image.
45
SCREEN 3: The image which is to given as reference
46
SCREEN 4: The image to be tested.
47
SCREEN 5: The smoothened reference image.
48
SCREEN 6: The test image after smoothening.
49
SCREEN 7: The image after the detection of edges.
50
SCREEN 8: The above screen is the result screen which displays the
end result of the system, the emotion portrayed by the person in the
image.
REFERENCES
51
[1] J. Jarkiewicz, R. Kocielnik and K. Marasek, “Anthropometric
Facial Emotion Recognition”, Novel Interaction Methods and
Techniques -Lecture Notes in Computer Science, Volume 5611,
2009.
[2]L. Zhao, X. LinSun, J. Liu and X.Hexu, “Face Detection Based on
Skin Colour”, Proceedings of the third international conference on
machine learning and cybernetics, Shanghai, 2004.
[3] I. Maglogiannis, D. Vouyioukas and C. Aggelopoulos, “Face
Detection and Recognition of Natural Human Emotion Using
Markov Random Fields”, Pers Ubiquit Comput, 2009.
[4]M.H Yang, D. J. Kriegman, N. Ahuja, “Detecting Faces in
Images”, IEEE transactions on pattern analysis and machine
intelligence, vol.24, no.1, 2002
[5]Pedro J. Muñoz-Merino,Carlos Delgado Kloos and Mario Muñoz-
Organero, “Enhancement of Student Learning Through the Use of
a Hinting Computer e-Learning System and Comparison With
Human Teachers ,“ IEEE Journal.vol.52, 2011.
[6]Emily Mower,Maja.J.Mataric and Shrikanth Narayanan, “A Frame
Work for Automatic Human Emotion Classification using
Emotions Profile,” IEEE Journal.vol.23, 2011
[7]Xiaogang Wang and Xiaoou Tang, “Face Photo-Sketch Synthesis
and Recognition”, IEEE transactions, 2009.
[8]Yan Tong Jixu Chen and Qiang ji, “A Unified Probabilistic
52
Framework for Spontaneous Facial Action Modelling”, IEEE
transactions on pattern analysis and machine intelligence, vol.32,
no.2, 2010.
[9] Chen, L.S., Huang, T.S. “Emotional Expressions in Audiovisual
Human Computer Interaction”, IEEE International Conference,
Volume: 1, 2000.
[10] De Silva, L.C., Ng, P. C. “Bimodal Emotion Recognition”,
Fourth IEEE International Conference, 2000.
53