Decoding Human Face Processing

Decoding Human Face Processing

Ankit AwasthiProf. Harish Karnick

Motivation• One of the most important goals of Computer Vision

researchers is to come up with a algorithm which can process face images and classify into different categories (based on gender, emotions, identity etc.)

• Human are extremely good at these tasks

• In order to match human performace and eventually beat it, it is imperative that we understand how humans do it

Motivation

• Moreover, similar cognitive processes might be involved in processing of other kinds of visual data or even data from other modalities

• Discovery of computational basis of face processing might be a good indication of generic cognitive structures

Where does our work fit in??

• A large number of neurological and psychological experimental findings

• Implications for computer vision algorithms • Closing the loop

Neural Networks (~1985)

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

Why Deep Learning??

• Brains have a deep architecture• Humans organize their ideas hierarchically,

through composition of simpler ideas• Insufficiently deep architectures can be

exponentially inefficient• Deep architectures facilitate feature and sub-

feature sharing

Restricted Boltzmann Machines (RBM)• We restrict the connectivity to make learning easier.

– Only one layer of hidden units.– No connections between hidden units.

• Energy of a joint configuration is defined as

(for binary visible units)

(for real visible units)

Hidden(h)

i

j

Visible(v)

Training a deep network

Sparse DBNs(Lee at. al. 2007)

• In order to have a sparse hidden layer, the average activation of a hidden unit over the training is constrained to a certain small quantity

• The optimization problem in the learning algorithm would look like

Oriented edge detectors using DBNs

Important observations about DBNs

• We found that in our experiments that – Fine tuning was important only for construction of

autoencoder– The final softmax layer can be learned on top of

the learned with marginal loss in accuracy• Fine tuning the autoencoder is important

Neural Underpinnings(Sinha et. al., 2006)

• The human visual system appears to devote specialized resources for face perception

• Latency of responses to faces in infero-temporal cortex is about 120 ms, suggesting a largely feed-forward computation

• Facial identity and emotion might be processed separatelyoOne of the reasons, we restricted ourselves to emotion

and gender classification

Experiments and Dataset

• Gender and Emotion Recognition (happy,neutral)

• Training images– 300, 50x50 images used

• Test images– 98,50x50 images used

Results on Normal images

• Same network architecture used for all experiments (3000->1000->500->200->100)

• Gender Recognition– 94%

• Emotion Recognition– 93%

Low vs High Spatial Frequency• A number of contradictory results• General Consensus

– Low spatial frequency is more important than higher spatial frequencies

– Hints at the importance of configural information• High frequency information by itself does not lead to good

performance– How to reconcile this with observed recognizability of line drawings

in everyday experience• Spatial frequency employed for emotions is higher than that

employed for gender classification (Deruelle and Fagot,2004)

Experiments• We cut off all spatial frequencies above 8cycles per face• Two cases each in gender and emotion recognition

1. A model trained on ‘normal’ images is tested on low spatial frequency images

2. A model trained on low spatial frequency images is tested on low spatial frequency images

Results

• Gender Recognition1. Model trained on ‘normal’ images ~ 89%2. Model trained on LSF images ~ 91%

• Emotion Recognition1. Model trained on ‘normal’ images ~ 87%2. Model trained on LSF images ~ 90.5%

Discussion

• The decrease in the accuracy is not much considering the significant reduction in the amount of information

• Implies low spatial frequency information can be used to classify a majority of images

• Tests with different spatial frequencies need to be done to reach a conclusive answer

• Importance of HSF is not apparent here because of the simplicity of the task

• In some other experiments where we looked at only HSF images, the results weren’t good!

Component and Configural Information

• Facial features are processed holistically in recognition (Sinha et. al,2006) and in emotion recognition (Durand et. al., 2007)

• The configural information affects how individual features are processed

• On the other hand, there is evidence that we process face images by matching parts– Thatcher illusion

Configural information affects individual features are processed

Thatcher Illusion

Experiments

• Two kinds experiments 1. Models trained on ‘normal’ images tested on

new images2. Same set of training and test images

Results (Gender Classification)

• Models trained on ‘normal’ images

~ 91% ~80%

~ 70% ~ random!!

Results(Gender Classification)

• Same training and test images

~ 93%

~ 85%

~ 79%

Results (Emotion Classification)

• Models trained on ‘normal’ images

~ 87% ~81%

~ 87% ~ random!!

Results(Emotion Classification)

• Same training and test images

~ 92%

~ 84%

~ 82%

Agreement with Human Performance

• Preliminary results show that humans are– Perfect in case of normal images we are using– Error prone when the parts are removed (3 out of

20 images on an average) • Accuracy depends a lot upon the time of

exposure• Proper timed experiments are expected yield

results much similar to the algorithm

Discussion

• The importance of important features (eyes,mouth) evident

• Eyes/eyebrows are important for gender recognition

• When trained on ‘normal’ images the algorithm learns features corresponding to important parts

• In absence of these features the algorithm learns to extract other features to increase

Inversion Effect

• One of the first findings which hinted at a dedicated face processing pathway

• Another indicator configural prccessing of face images

• Inverted images take significantly longer to process

Experiments and Results

• Models trained on ‘normal’ images– The results are “random”!!

• Training and testing on inverted is same doing it for ‘normal’ images

• Results show that the face image processing is not part based

Thatcher Illusion

Experiment and Results

• Models trained on ‘normal’ images– Random for both tasks!!

• Same training and test images– Gender: 92%– Emotion: 91%

High Level Features

• Only few connections to the previous layer have their weights either too high or too low

• Some of the largest weighted connections are used for linear combination

• Overlooks the non-linearity in the network from one layer to the other

Natural Extensions

• More exhaustive set of experiments need to be done to verify our preliminary observations

• It would be interesting to compare other models with Deep networks

• Some of the problems or inconsistencies are due to lack of translation invariant features– Best solution is to use a Convolutional Model

• Natural regularizer • Translational invariance• Biologically plausible

Conclusion

• We have done preliminary investigations with respect to various phenomenon

• Observed results certainly hint at the cognitive relevance of the model

References• Georey E. Hinton, Yee-Whye Teh and Simon Osindero, A Fast Learning

Algorithm forDeep Belief Nets. Neural Computation, pages 1527-1554, Volume 18, 2006.

• Georey E. Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines,Technical Report,Volume 1

• Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent (2010). Visualizing Higher-Layer Features of a Deep Network,Technical Report 1341

• Honglak Lee, Roger Grosse,Rajesh Ranganath, Andrew Y. Ng. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,ICML 2009

• Geoffrey E. Hinton Learning multiple layers of representation,Trends in Cognitive Sciences Vol.11 No.10 ,2006

• Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng, Sparse deep belief net model for visual area V2, NIPS,2007

References• Olshausen BA, Field DJ (1997) Sparse Coding with an Overcomplete Basis

Set: A Strategy Employed by V1? Vision Research, 37: 3311-3325.• Karine Durand , Mathieu Gallay, Alix Seigneuric ,Fabrice Robichon , Jean-Yves

Baudouin ,The development of facial emotion recognition:The role of configural information, Jornal of child Psychology, 2007

• Prawal Sinha, Benjamin Balas, Yuri Ostrovsky, Richard Russell,Face Recognition by Humans: Nineteen results all Computer Vision Researchers should know,

• Christian Wallraven, Adrian Schwaninger,Heinrich H. Bulltoff, Learning from humans, Computational modeling of face recognition, Computation in Neural Systems

• Christine Duerelle and Joel Faggot ,Categorizing facial indentities,emotions, and genders : Attention to high and low-spatial frequencies by children and adults

Documents

Decoding Human Face Processing