Upload
brier
View
56
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Decoding Human Face Processing. Ankit Awasthi Prof. Harish Karnick. Motivation. One of the most important goals of Computer Vision researchers is to come up with a algorithm which can process face images and classify into different categories (based on gender, emotions, identity etc.) - PowerPoint PPT Presentation
Citation preview
Decoding Human Face Processing
Ankit AwasthiProf. Harish Karnick
Motivation• One of the most important goals of Computer Vision
researchers is to come up with a algorithm which can process face images and classify into different categories (based on gender, emotions, identity etc.)
• Human are extremely good at these tasks
• In order to match human performace and eventually beat it, it is imperative that we understand how humans do it
Motivation
• Moreover, similar cognitive processes might be involved in processing of other kinds of visual data or even data from other modalities
• Discovery of computational basis of face processing might be a good indication of generic cognitive structures
Where does our work fit in??
• A large number of neurological and psychological experimental findings
• Implications for computer vision algorithms • Closing the loop
Neural Networks (~1985)
input vector
hidden layers
outputs
Back-propagate error signal to get derivatives for learning
Compare outputs with correct answer to get error signal
Why Deep Learning??
• Brains have a deep architecture• Humans organize their ideas hierarchically,
through composition of simpler ideas• Insufficiently deep architectures can be
exponentially inefficient• Deep architectures facilitate feature and sub-
feature sharing
Restricted Boltzmann Machines (RBM)• We restrict the connectivity to make learning easier.
– Only one layer of hidden units.– No connections between hidden units.
• Energy of a joint configuration is defined as
(for binary visible units)
(for real visible units)
Hidden(h)
i
j
Visible(v)
Training a deep network
Sparse DBNs(Lee at. al. 2007)
• In order to have a sparse hidden layer, the average activation of a hidden unit over the training is constrained to a certain small quantity
• The optimization problem in the learning algorithm would look like
Oriented edge detectors using DBNs
Important observations about DBNs
• We found that in our experiments that – Fine tuning was important only for construction of
autoencoder– The final softmax layer can be learned on top of
the learned with marginal loss in accuracy• Fine tuning the autoencoder is important
Neural Underpinnings(Sinha et. al., 2006)
• The human visual system appears to devote specialized resources for face perception
• Latency of responses to faces in infero-temporal cortex is about 120 ms, suggesting a largely feed-forward computation
• Facial identity and emotion might be processed separatelyoOne of the reasons, we restricted ourselves to emotion
and gender classification
Experiments and Dataset
• Gender and Emotion Recognition (happy,neutral)
• Training images– 300, 50x50 images used
• Test images– 98,50x50 images used
Results on Normal images
• Same network architecture used for all experiments (3000->1000->500->200->100)
• Gender Recognition– 94%
• Emotion Recognition– 93%
Low vs High Spatial Frequency• A number of contradictory results• General Consensus
– Low spatial frequency is more important than higher spatial frequencies
– Hints at the importance of configural information• High frequency information by itself does not lead to good
performance– How to reconcile this with observed recognizability of line drawings
in everyday experience• Spatial frequency employed for emotions is higher than that
employed for gender classification (Deruelle and Fagot,2004)
Experiments• We cut off all spatial frequencies above 8cycles per face• Two cases each in gender and emotion recognition
1. A model trained on ‘normal’ images is tested on low spatial frequency images
2. A model trained on low spatial frequency images is tested on low spatial frequency images
Results
• Gender Recognition1. Model trained on ‘normal’ images ~ 89%2. Model trained on LSF images ~ 91%
• Emotion Recognition1. Model trained on ‘normal’ images ~ 87%2. Model trained on LSF images ~ 90.5%
Discussion
• The decrease in the accuracy is not much considering the significant reduction in the amount of information
• Implies low spatial frequency information can be used to classify a majority of images
• Tests with different spatial frequencies need to be done to reach a conclusive answer
• Importance of HSF is not apparent here because of the simplicity of the task
• In some other experiments where we looked at only HSF images, the results weren’t good!
Component and Configural Information
• Facial features are processed holistically in recognition (Sinha et. al,2006) and in emotion recognition (Durand et. al., 2007)
• The configural information affects how individual features are processed
• On the other hand, there is evidence that we process face images by matching parts– Thatcher illusion
Configural information affects individual features are processed
Thatcher Illusion
Experiments
• Two kinds experiments 1. Models trained on ‘normal’ images tested on
new images2. Same set of training and test images
Results (Gender Classification)
• Models trained on ‘normal’ images
~ 91% ~80%
~ 70% ~ random!!
Results(Gender Classification)
• Same training and test images
~ 93%
~ 85%
~ 79%
Results (Emotion Classification)
• Models trained on ‘normal’ images
~ 87% ~81%
~ 87% ~ random!!
Results(Emotion Classification)
• Same training and test images
~ 92%
~ 84%
~ 82%
Agreement with Human Performance
• Preliminary results show that humans are– Perfect in case of normal images we are using– Error prone when the parts are removed (3 out of
20 images on an average) • Accuracy depends a lot upon the time of
exposure• Proper timed experiments are expected yield
results much similar to the algorithm
Discussion
• The importance of important features (eyes,mouth) evident
• Eyes/eyebrows are important for gender recognition
• When trained on ‘normal’ images the algorithm learns features corresponding to important parts
• In absence of these features the algorithm learns to extract other features to increase
Inversion Effect
• One of the first findings which hinted at a dedicated face processing pathway
• Another indicator configural prccessing of face images
• Inverted images take significantly longer to process
Experiments and Results
• Models trained on ‘normal’ images– The results are “random”!!
• Training and testing on inverted is same doing it for ‘normal’ images
• Results show that the face image processing is not part based
Thatcher Illusion
Experiment and Results
• Models trained on ‘normal’ images– Random for both tasks!!
• Same training and test images– Gender: 92%– Emotion: 91%
High Level Features
• Only few connections to the previous layer have their weights either too high or too low
• Some of the largest weighted connections are used for linear combination
• Overlooks the non-linearity in the network from one layer to the other
Natural Extensions
• More exhaustive set of experiments need to be done to verify our preliminary observations
• It would be interesting to compare other models with Deep networks
• Some of the problems or inconsistencies are due to lack of translation invariant features– Best solution is to use a Convolutional Model
• Natural regularizer • Translational invariance• Biologically plausible
Conclusion
• We have done preliminary investigations with respect to various phenomenon
• Observed results certainly hint at the cognitive relevance of the model
References• Georey E. Hinton, Yee-Whye Teh and Simon Osindero, A Fast Learning
Algorithm forDeep Belief Nets. Neural Computation, pages 1527-1554, Volume 18, 2006.
• Georey E. Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines,Technical Report,Volume 1
• Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent (2010). Visualizing Higher-Layer Features of a Deep Network,Technical Report 1341
• Honglak Lee, Roger Grosse,Rajesh Ranganath, Andrew Y. Ng. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,ICML 2009
• Geoffrey E. Hinton Learning multiple layers of representation,Trends in Cognitive Sciences Vol.11 No.10 ,2006
• Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng, Sparse deep belief net model for visual area V2, NIPS,2007
References• Olshausen BA, Field DJ (1997) Sparse Coding with an Overcomplete Basis
Set: A Strategy Employed by V1? Vision Research, 37: 3311-3325.• Karine Durand , Mathieu Gallay, Alix Seigneuric ,Fabrice Robichon , Jean-Yves
Baudouin ,The development of facial emotion recognition:The role of configural information, Jornal of child Psychology, 2007
• Prawal Sinha, Benjamin Balas, Yuri Ostrovsky, Richard Russell,Face Recognition by Humans: Nineteen results all Computer Vision Researchers should know,
• Christian Wallraven, Adrian Schwaninger,Heinrich H. Bulltoff, Learning from humans, Computational modeling of face recognition, Computation in Neural Systems
• Christine Duerelle and Joel Faggot ,Categorizing facial indentities,emotions, and genders : Attention to high and low-spatial frequencies by children and adults