Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition

TEL-AVIV UNIVERSITYThe Iby and Aladar Fleischman

Faculty of Engineering

TRADE-OFF BETWEEN RECOGNITION AND

RECONSTRUCTION:

APPLICATION OF NEURAL NETWORKS

TO ROBOTIC VISION

Thesis submitted for the degree ”Doctor of Philosophy”

by

INNA STAINVAS

Submitted to the Senate of Tel-Aviv University1999

TEL-AVIV UNIVERSITY

This work was carried out under the supervision of

Doctor Nathan Intrator

and

Doctor Amiram Moshaiov

This work is dedicated to my family

Acknowledgment

I would like to thank my husband, daughter and parents for their tolerance and moral

support during the completion of this thesis.

I am greatly indebted to my first advisor Dr. Amiram Moshaiov, who gave me a

chance to start as a Ph.D. Student at the Engineering Faculty of Tel-Aviv University,

when I was only two months in Israel. I am very grateful to him for proposing to work in

Neural Networks and Computer Vision and for allowing me freedom in my research.

I have been pleasantly surprised by the flexibility of the educational system of the Tel-

Aviv University in allowing me to listen and participate in courses at different faculties,

such as the Engineering Faculty, Computer Science and Foreign Languages.

While taking courses in Neural Networks, I met Dr. Nathan Intrator, who became my

main supervisor and collaborator for more than five years. He opened me to a new world

of Neural Networks and I have learned much from him, not only on the technical aspects

but also on scientific research methodologies. Without him, this thesis would have never

appear. I am grateful to him for his tolerance, endless support and guidance.

It is impossible to thank all the people who helped me, but I would like to mention the

system administrator of the Engineering faculty, Udi Mottelo, the Department secretary

Ariella Regev, the secretary of the Emigration Support department Ahuva, my friends,

and the people of the Neural Computation Group of Computer Science faculty, Yair

Shimshoni, Nurit Vatnick and Natalie Japkowich.

This work was supported by grants from the Rich Foundation, the Don and Sara

Marejn Scholarship Fund and by a grant from the Ministry of Science to Dr. Nathan

Intrator.

Inna Stainvas

March 8, 1999

Abstract

Autonomous and efficient action of robots requires a robust robot vision system that can

cope with variable light and view conditions. These include partial occlusion, blur, and

mainly a large scale difference of object size due to variable distance to the objects. This

change in scale leads to reduced resolution for objects seen from a distance. One of the

most important tasks for the robot’s visual system is object recognition. This task is also

affected by orientation and background changes. These real-world conditions require a

development of specific object recognition methods.

This work is devoted to robotic object recognition. We develop recognition methods

based on training that includes incorporation of prior knowledge about the problem.

The prior knowledge is incorporated via learning constraints during training (parameter

estimation). A significant part of the work is devoted to the study of reconstruction

constraints. In general, there is a tradeoff between the prior-knowledge constraints and

the constraints emerging from the classification or regression task at hand. In order to

avoid the additional estimation of the optimal tradeoff between these two constraints, we

consider this tradeoff as a hyper parameter (under Bayesian framework) and integrate

over a certain (discrete) distribution. We also study various constraints resulting from

information theory considerations.

Experimental results on two face data-sets are presented. Significant improvement in

face recognition is achieved for various image degradations such as, various forms of image

blur, partial occlusion, and noise. Additional improvement in recognition performance is

achieved when preprocessing the degraded images via state of the art image restoration

techniques.

Contents

1 Introduction 1

1.1 General motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Robotic vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Internal data representation . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.4 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Statistical formulation of the problem 8

2.1 Bias-Variance error decomposition for a single predictor . . . . . . . . . . . 9

2.2 Variance control without imposing a learning bias . . . . . . . . . . . . . . 10

2.3 Variance control by imposing a learning bias . . . . . . . . . . . . . . . . . 12

2.3.1 Smoothness constraints . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Invariance bias constraints . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Specific bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Reconstruction bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Minimum Description Length (MDL) Principle . . . . . . . . . . . . . . . . 17

2.5.1 Minimum description length . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 MDL in the feed-forward NN . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.1 MDL and EPP bias constraints . . . . . . . . . . . . . . . . . . . . 24

2.8 Appendix to Chapter 2: Regularization problem . . . . . . . . . . . . . . . 28

3 Imposing bias via reconstruction constraints 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . 30

3.1.2 Autoencoder network and MDL . . . . . . . . . . . . . . . . . . . . 31

3.1.3 Reconstruction and generative models . . . . . . . . . . . . . . . . 34

i

3.1.4 Classification via reconstruction . . . . . . . . . . . . . . . . . . . . 35

3.1.5 Other applications of reconstruction . . . . . . . . . . . . . . . . . . 38

3.2 Imposing reconstruction constraints . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Reconstruction as a bias imposing mechanism . . . . . . . . . . . . 38

3.2.2 Hybrid classification/reconstruction network . . . . . . . . . . . . . 40

3.2.3 Hybrid network and MDL . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.4 Hybrid network as a generative probabilistic model . . . . . . . . . 43

3.2.5 Hybrid Neural Network architecture . . . . . . . . . . . . . . . . . . 44

3.2.6 Network learning rule . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.7 Hybrid learning rule. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Imposing bias via unsupervised learning constraints 50

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Information principles for sensory processing . . . . . . . . . . . . . . . . . 51

4.3 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Entropy maximization (ME) . . . . . . . . . . . . . . . . . . . . . . 53

4.3.2 Minimization of the output mutual information (MMI) . . . . . . . 55

4.3.3 Relation to Exploratory Projection Pursuit. . . . . . . . . . . . . . 57

4.3.4 BCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.5 Sum of entropies of the hidden units . . . . . . . . . . . . . . . . . 59

4.3.6 Nonlinear PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.7 Reconstruction issue . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Imposing unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 Imposing unsupervised and reconstruction constraints . . . . . . . . . . . . 62

5 Real world recognition 69

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Different architecture constraints . . . . . . . . . . . . . . . . . . . 75

5.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.3 Neural Network Ensembles . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.4 Face data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.5 Face normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.6 Learning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Type of image degradations . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

ii

5.4.1 Different architecture constraints and regularization ensembles . . . 86

5.5 Saliency detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.1 Saliency map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.7 Appendix to Chapter 5: Hidden representation exploration . . . . . . . . . 95

6 Blurred image recognition 100

6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 Image degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Main filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.2 Other types of degradation . . . . . . . . . . . . . . . . . . . . . . . 106

6.3 Image restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.1 MSE minimization and regularization . . . . . . . . . . . . . . . . . 107

6.3.2 Image restoration in the frequency domain . . . . . . . . . . . . . . 109

6.3.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4.1 Image filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4.2 Classification of noisy data . . . . . . . . . . . . . . . . . . . . . . . 114

6.4.3 Gaussian blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4.4 Motion blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4.5 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.4.6 All training schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Summary and future work 124

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.2 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

iii

List of Figures

2.1 Supervised feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Hybrid network with EPP constraints . . . . . . . . . . . . . . . . . . . . . 25

3.1 Autoencoder network architecture . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Eigenspaces extracted by PCA . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Combined recognition/reconstruction network . . . . . . . . . . . . . . . . 40

3.4 Hybrid network with reconstruction and EPP constraints . . . . . . . . . . 41

3.5 Detailed architecture of the recognition/reconstruction network . . . . . . 45

4.1 Feed-forward network for independent component extraction . . . . . . . . 53

4.2 Pdf’s graphs for a family of the exponential density functions . . . . . . . . 65

4.3 Exploratory projection pursuit network . . . . . . . . . . . . . . . . . . . . 66

5.1 Misclassification rate time evolution . . . . . . . . . . . . . . . . . . . . . . 77

5.2 MSE (mean-squared) recognition error time evolution . . . . . . . . . . . . 78

5.3 Classification based regularization . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 “Caricature” faces in three resolutions . . . . . . . . . . . . . . . . . . . . 81

5.5 Image degradation and reconstruction (TAU data-set) . . . . . . . . . . . . 84

5.6 Summary of different networks and different image degradations . . . . . . 90

5.7 Saliency map construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.8 Hidden unit activities vs. classes - for an unconstrained network . . . . . . 96

5.9 Hidden unit activities vs. classes - for a reconstruction network . . . . . . . 97

5.10 Pdf’s of the hidden unit activities . . . . . . . . . . . . . . . . . . . . . . 98

5.11 Hidden weight representation . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1 Experimental design schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Training scheme C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3 Degraded images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4 Noisy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.5 Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 116

iv

6.6 Motion blur and deblur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.7 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.8 Recognition of blurred images via schemes A–C . . . . . . . . . . . . . . . 120

6.9 Reconstruction of Gaussian blurred images . . . . . . . . . . . . . . . . . . 123

v

List of Tables

4.1 Unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Classification results for Pentland data-set . . . . . . . . . . . . . . . . . . 85

5.2 Different ensemble types (Pentland data-set) . . . . . . . . . . . . . . . . . 87

5.3 Different ensemble types (TAU data-set) . . . . . . . . . . . . . . . . . . . 88

5.4 Recognition using saliency map (Pentland data-set) . . . . . . . . . . . . . 92

5.5 Recognition using saliency map (TAU data-set) . . . . . . . . . . . . . . . 93

6.1 Classification results for filtered data . . . . . . . . . . . . . . . . . . . . . 112

6.2 Noise and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3 Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4 Motion blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.6 Blurred image recognition via joined ensembles . . . . . . . . . . . . . . . . 121

7.1 Classification error for reconstructed images . . . . . . . . . . . . . . . . . 127

vi

Chapter 1

Introduction

1.1 General motivation

1.1.1 Robotic vision

Nowadays, robots that can move and operate autonomously in a real-world are in high

demand. One of the main perception tasks that has to be addressed in this context is a

recognition task. The recognition task in a real-world environment is challenging as it has

to address data variability, such as orientation, changing background, partial occlusion

and blur, etc.

For illustration let us consider a vision-guided robot helicopter which has to navigate

autonomously using only on-board sensors and computing power (Chopper, 1997). One

of the basic difficulties in recognition of images taken by helicopter cameras during an op-

eration is the significant difference between these images and the images, which the robot

is acquainted with in ideal flight conditions. Usually, the images taken during operation

contain a large amount of degradation caused by diverse factors, such as illumination

changing, bad weather conditions, relative motion between the cameras and the object

of interest in the scene, shadows and, low resolution capacity of the cameras, etc. Some

of these factors cause images to look blurred and foggy, others lead to noise and partial

occlusion. All these factors are crucial for recognition performance and require special

care.

Among the possible approaches to improve recognition performance of degraded im-

ages is an endeavor to recover images using state of the art restoration techniques as

preprocessing before a recognition stage. This preprocessing requires estimation of the

degradation process, e.g. the type and parameters of the blur operation. Another ap-

proach is to directly address the variability in the recognition system. It is well known

that for a restoration process to be successful a degradation process has to be accurately

1

Chapter 1: Introduction 2

modeled. However, in many cases, an exact modeling is impractical, and the restored im-

ages remain partially degraded and contain artifacts. Furthermore, restoration methods

are often computationally expensive and require a-priori knowledge or human interaction.

It follows that efforts have to be concentrated on development of recognition methods that

are more robust to image degradations.

1.1.2 Internal data representation

An important aspect of robust recognition methods is construction of an internal data

representation (feature extraction), that captures the significant structure of the data.

According to D. Marr (1982) finding an internal representation is an inherent component

of the vision process.

Feature based representation Many recognition methods include grouping or per-

ceptual organization as a first stage of the visual processing. In this stage, objects are

represented as models, containing the essential features and logic tight rules needed for

recognition. Some methods extract “anchor points” (Ullman, 1989; Brunelli and Poggio,

1992), others consider edge segments as interesting feature elements (Bhanu and Ming,

1987; Liu and Srinath, 1984). A relatively new approach is a deformable template match-

ing (Grenander, 1978; Brunelli and Poggio, 1993; Jain et al., 1996) and using generalized

splines for object classification (Lai, 1994). These methods attempt to extract salient

features locally in the low level stage of the visual processing, according to subjective

understanding of an investigator. Therefore, finding an internal representation based on

extraction of object features and relation between them may be limited.

Learning internal representations via Neural Networks A radical alternative

approach is to use all the available intensity information for finding internal representation.

Principal Component Analysis (PCA) (Fukunaga, 1990) is a non neural network example

of this approach, where internal representation space is spanned by the largest eigenvectors

of the data covariance matrix. These eigenvectors are macro-features extracted implicitly

from the images. When fed with intensity images, Neural Networks similar to PCA extract

internal representation in the space of hidden unit activities.

Processing an image as a whole is a high dimensional recognition task that leads to

the curse of dimensionality (Bellman, 1961) which means that there is not enough data

to robustly train a classifier in a high dimensional space. As an example, a network with

a single hidden unit and input images of 60× 60 pixels has 3600 weight parameters that

have to be estimated. Thus, the main issue is finding an intrinsic low dimensional


representation of the images. As was pointed out by Geman et al. (1992), a way to avoid

the curse of dimensionality in Neural Networks is to prewire the important generalizations

by purposefully introducing learning bias.

The work presented in this thesis is specifically devoted to this issue. We develop

image recognition techniques using hybrid feed-forward Neural Networks, obtained by

introducing a learning bias. In particular, we investigate the influence of the novel re-

construction learning constraints on the recognition performance of feed-forward Neural

Networks. In addition, we propose to use other learning constraints based on information

theory, and subsequently compare their efficiency with reconstruction learning constraints.

We demonstrate that hybrid Neural Networks are robust to real-world degradation in the

input visual data and show that their performance can be further enhanced when state

of the art (deblur) techniques are also incorporated.

1.1.3 Data compression

Often, a compression goal is defined as finding a compact data representation leading to

good data reconstruction. Principal Component Analysis (PCA), Discrete Fourier Trans-

form (DFT) and its generalization, Wavelet Transform and advanced best basis repre-

sentations (Coifman and Wickerhauser, 1992), are examples of compression techniques.

Compression may be also realized via an autoencoder network (Cottrell et al., 1987). The

autoencoder is a multi layer perceptron (MLP) type of the network with the output layer

coinciding with the input layer and a hidden layer of a small size.

Recently a novel type of an autoencoder network has been proposed by Zemel (1993).

The hidden layer is allowed to have a large number of hidden units but it has different con-

straints on the developed hidden representation. The network is simultaneously trained

to accurately reconstruct the input and to find a succinct representation in the hidden

layer, assuming sparse or population code formation in the autoencoder hidden layer.

When the main task is recognition, the compressed data representation has been used

instead of the original (high-dimensional) data (Kirby and Sirovich, 1990; Turk and Pent-

land, 1991; Murase and Nayar, 1993; Bartlett et al., 1998). Recognition from this rep-

resentation is faster and may have better generalization performance. However, it is

clear, that such compression is task-independent and may be inappropriate for a specific

recognition task (Huber, 1985; Turk and Pentland, 1993).

We seek a compact data description that is task-dependent, and is good for recognition.

Thus, the quality of the compression scheme is judged by its generalization property.

Often, a separate low-dimensional representation is created for every specific task at hand.

Another strategy could be to discover a hidden representation that is suitable for several


potential visual tasks (Intrator and Edelman, 1996). We show that a good task-dependent

compression is obtained when the data representation is constructed not only to minimize

the mean-squared recognition error, but also to maintain data fidelity and/or to extract

good statistical properties. These good properties may be the independence of hidden

neurons, maximum information transfer in the hidden layer or a multi-modal distribution

of the hidden unit activities. Therefore, in this case compression is task-dependent and is

assisted by the a-priori knowledge.

In summary, we investigate lossy compression techniques based on the two visual tasks

- image recognition and reconstruction. Our goal is to find a hidden representation that

optimizes the recognition using hints of the reconstruction task.

1.1.4 Face recognition

The performance of the proposed recognition schemes is examined on two facial data

sets. Face recognition has gained much attention in recent years due to the variety of

commercial applications, such as video conferencing, security, human communication and

robotics. Face recognition has recently attracted special attention of different human

robotic groups, that intensively work on the creation of personal adaptive robots to assist

the frail and elderly blind people, and creation of working mobile robots for delivery

assistance (Hirukawa, 1997; Connolly, 1997).

This recognition task is a very difficult one (Chellapa et al., 1995), since it is a high di-

mensional classification problem leading to “curse of dimensionality”. This is complicated

by the large variability of the facial data sets due to:

• viewpoint dependence

• nonrigidity of the faces

• variable lighting conditions

• motion

The task of face recognition is a particular case of the learning when the variability of

the data describing the same class is comparable with the similarity between different

classes. Other important possible recognition tasks from the same category may be the

recognition of different kinds of tanks, ships, planes and cars, etc.


1.2 Overview of the thesis

The thesis focuses on developing Neural Network techniques that improve the recognition

performance. A key aspect of this work is finding data representations that lead to better

generalization. We show that networks which are trained to recognize and reconstruct

images simultaneously extract features that improve recognition. Improved performance

is also achieved when networks are trained to find other statistical structures in the data.

The thesis is organized as follows:

Chapter 2: Formulates the recognition task in the framework of the “bias-variance”

dilemma. We show that for a good generalization ability the variance portion of the

generalization error has to be properly controlled. We discuss different methods to control

the variance portion of the generalization error and present two main approaches: reducing

the variance via ensemble averaging and introducing a learning bias. We review different

types of learning bias constraints, and finally, propose reconstruction constraints as a

novel type of bias constraints in the context of feed-forward networks.

Starting from Section 2.5, we discuss the relation between the “bias-variance” dilemma

in statistics, MDL principle and Bayesian framework. We show that the introduction of

a learning bias corresponds to a model-cost in the description length, which has to be

minimized along with an error-cost under the MDL principle. At the same time, under

the Bayesian framework, the model-cost corresponds to prior knowledge about the weights

and hidden representation distributions.

Chapter 3: Introduces a hybrid feed-forward network architecture, which uses the re-

construction constraints as a bias imposing mechanism for the recognition task. This net-

work, which can be interpreted under MDL and Bayesian frameworks, modifies the low

dimensional representation by minimizing concurrently the mean squared error (MSE)

of reconstruction and classification outputs. In other words, it attempts to improve the

quality of the hidden layer representation by imposing a feature selection useful for both

tasks, classification and reconstruction. A significance of each of the tasks is controlled

by a trade-off parameter λ, which is interpreted as a hyper-parameter in the Bayesian

framework. Finally, this chapter presents technical details about the network architecture

and its learning rule.

Chapter 4: Discusses various information theory principles as constraints for the clas-

sification task. We introduce a hybrid neural network with a hidden representation which


has some useful properties, such as the independence between hidden layer neurons or

maximum information transfer in the hidden layer, etc.

Chapter 5: Discusses the face recognition task. We review different Neural Networks

methods used for face recognition and apply the hybrid networks introduced in Chap-

ters 3–4. This chapter contains technical details related to face normalization and learn-

ing procedures. It is shown that the best regularized network is impractical for degraded

image recognition, and integration over different regularization parameters and different

initial weights is preferable. This integration is roughly approximated by averaging over

network ensembles. We consider three ensemble types: Unconstrained ensemble that cor-

responds to integration over initial weights and fixed trade-off parameter λ = 0, i.e. the

hidden representation is based on the recognition task alone; Reconstruction ensemble

that corresponds to integration over different values of the trade-off parameter λ for fixed

initial weights. Joined ensemble that corresponds to integration over both the trade-off

parameter λ and initial weights and is obtained by merging unconstrained and reconstruc-

tion ensembles.

Classification results on the degraded images, such as noisy, partially occluded and

blurred images are presented. We show that the joined ensemble is superior to the recon-

struction ensemble, which in turn is superior to the unconstrained ensemble. Finally we

conclude that reconstruction constraints improve generalization, especially under image

degradations. In addition we show that via saliency maps (Baluja, 1996) reconstruction

can deemphasize degraded regions of the input, thus leading to classification improvement

under “Salt and Pepper” noise.

Chapter 6: Addresses recognition of blurred and noisy images. In practice, images

appear blurred due to motion, weather conditions and camera defocusing. Several meth-

ods that address recognition of blurred images are proposed: (i) Expansion the training

set with Gaussian blurred images; (ii) Constraining reconstruction of blurred images to

the original images during training; (iii) Usage of state of the art restoration methods as

preprocessing to degraded images.

Three types of joined ensembles were considered and compared: Ensemble of networks

trained on the original training data only, and ensembles trained on the training set

expanded with Gaussian blurred images and with reconstruction constraints of two types,

where the first is a simple duplication of the input in the output and second as described

above in (ii).

It was shown that training with blurred images leads to a robust classification result


under different types of the blur operations and is more important than the restoration

methods.

Chapter 7: Summarizes our research and gives some perspective to its future develop-

ment, such as:

• Testing the hybrid architecture performance on the non face data sets of similar

object images, such as military, medical and astronomical

• Ensemble interpretation

• Using the recurrent network architecture

• Weighted network ensemble averaging based on the different error types between

input and output reconstruction layers

• Using invariance constraints (tangent prop like, see Chapter 2) regularization terms

for different types of blur operations for both recognition and reconstruction tasks

• Generalization of the proposed hybrid network on the other types of the generative

(reconstruction) models constrained by the classification task

Chapter 2

Statistical formulation of theproblem

Images as input to Neural Networks are a very high dimensional data with the size equal

to the number of pixels in the image. In this case, the number of the network weight

parameters is considerably larger than the size of the training set. This leads to the

curse of dimensionality (Bellman, 1961), which means that there is not enough data

to robustly train a classifier in a high dimensional space. Until recently, estimation in

such cases sounded unrealistic, but it is now accepted that such estimation is possible

if the actual dimensionality of the input data is much smaller. In other words, a true,

intrinsic dimensionality reduction is possible. A simple dimensionality reduction solely

via a bottleneck network architecture does not cope with the problem, since a network

continues to be an over-parameterized model (i.e. the number of free weight parameters

remains large).

It is well known that an estimation error is composed of two portions, bias and variance

(Geman et al., 1992). The over-parameterized models usually have a small bias (unless

they are incorrect), but have high variance, since the available data is always small com-

pared to the number of the free parameters and this leads to a high sensitivity to noise

in the training data. To robustify the estimator, the variance portion of the error has

to be controlled. One of the ways to control variance is via averaging single estimators

trained on the same task. The other method controls variance by introducing a learning

bias as constraints on the network architecture. Different types of smoothing constraints

are widely spread (Wahba, 1990; Murray and Edwards, 1993; Raviv and Intrator, 1996;

Munro, 1997). However, as has been pointed out by Geman et al. (Geman et al., 1992)

to solve the bias/variance dilemma innovative bias constraints have to be used. Introduc-

tion of these constraints into the network model leads naturally to a true dimensionality

reduction (Intrator, 1999).

8

Chapter 2: Statistical formulation of the problem 9

Below, we present the bias-variance dilemma and review methods to control the vari-

ance and bias portions of the prediction error. Then we propose to use image reconstruc-

tion as an innovative bias constraint for image classification. We proceed with discussion

on the relation between the “bias-variance” dilemma in statistics, MDL principle and

Bayesian networks.

2.1 Bias-Variance error decomposition for a single

predictor

The basic objective of the estimation problem is to find a function fD(x) = f(x;D) given a

finite training set D, composed of n input/output pairs, D = (xµ, yµ)nµ=1 x ∈ Rd,y ∈R1, drawn independently according to an unknown distribution P (x, y), which “best”

approximates the “target” function y (Geman et al., 1992).

Evaluation of the performance of the estimator is usually done via a mean squared

error by taking the expectation with respect to a marginal probability P (y|x):

E(x;D) ≡ E[(y − fD(x))2|x,D] = E[(y − E[y|x])2|x,D]︸︷︷︸V ar(y|x)

+E[(fD(x)− E[y|x])2|x,D]︸︷︷︸+

2E[(y − E[y|x])(fD(x)− E[y|x])|x,D]︸︷︷︸=0

(2.1.1)

It can be seen that the third term in the sum is equal to zero, since (fD(x) − E[y|x])

does not depend on the distribution P (y|x) and plays the role of a factor, while E[(y −E[y|x])|x,D] is equal to zero. The first term does not depend on the predictor f and

measures the variability of y given x (in the model with additive independent noise y =

f(x) + η(x) this term measures a noise variance in x). The contribution of the second

term can be reduced by optimizing f . This term measures the squared distance between

the estimator fD(x) and the mean of y given x (E[y|x]).

A good estimator has to generalize well to new sets drawn from the same distribution

P (y,x). A natural measure of the estimator effectiveness is an average error E(x) ≡ED[E(x;D)] = ED[E[(y − fD(x))2|x,D]] over all possible training sets D of fixed size:

E(x) = V ar(y|x)︸︷︷︸intrinsic error

+ (ED[fD(x)]− E[y|x])2

︸︷︷︸squared bias b2(f |x)

+ED[(fD(x)− ED[fD(x)])2]︸︷︷︸variance var(f |x)

(2.1.2)

The first term is an intrinsic error that can not be altered. If on average, fD(x) is

different from E[y|x], then fD(x) is biased. As we can see, an unbiased estimator may

still have a large mean squared error if the variance is large. Thus, either bias or variance

can contribute to poor performance (Geman et al., 1992). When training with a fixed


training set D, reducing the bias with respect to this set may increase the variance of

the estimator and contribute to poor generalization performance. This is known as the

tradeoff between variance and bias.

2.2 Variance control without imposing a learning bias

The variance portion of a prediction error can sometimes be reduced without a bias in-

troduction by ensemble averaging. An ensemble (committee) is a combination of single

predictors trained on the same task. For example, in neural networks, an ensemble is a

combination of individual networks that are trained separately and then their predictions

are combined. This combination is done by majority or plurality rules (in classification)

(Hansen and Salamon, 1990) or by a weighted linear combination of predictors in regres-

sion (Meir, 1994; Naftaly et al., 1997). The plurality rule is defined as the decision agreed

by the majority of networks. The majority rule is defined as the decision agreed by

more than half of the networks, otherwise the ensemble rejects to classify and an error is

reported. The most general method to create ensemble has been presented by Wolpert

(Wolpert, 1992). The method is called stacked generalization and a non-linear network

learns how to combine the network outputs with the weights that vary over the feature

space.

It is well known that ensemble is useful if its individual predictors are independent

in their errors or disagree on some inputs. Thus, the main question is to find network

candidates that achieve this independence. One of the widely spread methods to create

neural network ensembles is based on the fact that neural networks are non-identifiable

models, i.e. the selection of the weights is an optimization problem with many local

minima. Thus, a network ensemble is created by varying the set of initial random weights

(Perrone, 1993). Another way is to use different types of predictors, like a mixture of

networks with a different topology and complexity or a mixture of networks with completely

different types of learning rules (Jacobs, 1997). Another way is to train the networks on

different training sets. Below, a bias-variance error decomposition for a weighed linear

combination of predictors is presented (Raviv, 1998; Tesauro et al., 1995).

Let us consider M predictors fi(x,Di), each trained on a training set Di. All training

sets have the same size and are drawn from the same joint distribution P (y,x). Consider

the ensemble based on the linear combination of predictors:

fens(x) =∑

i

aifi(x,Di),∑

i

ai = 1, ai ≥ 0, i = 1, 2, . . . ,M. (2.2.3)


The normalization condition∑i ai = 1 is implied to make an ensemble unbiased, when

each individual estimator fi is unbiased. Let us consider the error (2.1.2) for this ensemble:

Eens(x) = V ar(y|x) + b2(fens|x) + var(fens|x), (2.2.4)

where the bias b(fens|x) is given as:

b(fens|x) = ED1,D2,...,DM [∑

i

aifi(x,Di)− E[y|x]] =

∑

i

aiEDi [fi(x,Di)− E[y|x]] =∑

i

aib(fi|x). (2.2.5)

Thus the bias of the ensemble is the same linear combination of the biases of the estima-

tors. Expanding the ensemble variance term we get:

var(fens|x) = ED1,D2,...,DM [∑i

aifi(x,Di)− ED1,D2,...,DM [∑

i

aifi(x,Di)]2] =

ED1,D2,...,DM [(∑

i

aifi(x,Di)−∑

i

aiEDi [fi(x,Di)])2] =

ED1,D2,...,DM [(∑

i

ai(fi(x,Di)− EDi [fi(x,Di)])2] =

ED1,D2,...,DM [∑

i

a2i (fi(x,Di)− EDi [fi(x,Di)])2 +

2∑

i>j

aiaj(fi(x,Di)− EDifi(x,Di))(fj(x,Dj)− EDjfj(x,Dj))] =

=∑

i

a2i var(fi|x) + 2

∑

i>j

aiajEDi,Dj [(fi − EDi [fi])(fj − EDj [fj])]

Finally, we get the next expression for the ensemble error:

Eens(x) = V ar(y|x) + (∑

i

aib(fi|x))2 +∑

i

a2i var(fi|x)

+2∑

i>j

aiajEDi,Dj [(fi − EDi [fi])(fj − EDj [fj])] (2.2.6)

If all estimators are unbiased, uncorrelated and have identical variances, simple averaging

with the same weights ai = 1/M leads to the following ensemble error (Raviv, 1998):

E(x) = V ar(y|x) + b2(f |x) +1

Mvar(f |x).

This decomposition shows that when biases are small and predictors are independent a

significant reduction of order 1/M in the variance may be attained.

If estimators are unbiased and uncorrelated it is easy to show that optimal weights

have to be inversely proportional to the variance of the individual predictors ai ∝ 1var(fi|x)

,

(Tresp and Taniguchi, 1995; Taniguchi and Tresp, 1997). Intuitively it means that a

predictor that is uncertain about its own prediction should obtain a smaller weight.


2.3 Variance control by imposing a learning bias

A regression function (E[y|x]) is the best estimator. In order to find an unbiased estimator,

a family of possible estimators has to be abundant. In the MLP (multi-layer perceptron)

networks, this may be attained at the expense of network architecture growing. This

eliminates bias, but increases variance unless the training data is infinite. In practice, the

training data is finite and the main question is to make both a bias and variance “small”

using finite training sets (Geman et al., 1992). Geman et al. point out that in this

limitation the learning task is to generalize in a very nontrivial sense, since the training

data will never “cover” a space of possible inputs. This extrapolation is possible, if the

important generalizations are prewired in learning algorithms by purposefully introducing

a bias.

The most general and weakest a-priori constraints assume that mapping is smooth.

Other, stronger a-priori constraints may be expressed as an invariance of the mapping

to some group of transformation or an assumption about the class of possible mapping.

Another type of specific bias constraints appears when a supervised task is learned in

parallel with its other related tasks.

One way to categorize different types of constraints into two groups: variance and

bias constraints, has been proposed in (Intrator, 1999). Both types of constraints serve

to reduce the variance portion of the generalization error, however they have a different

effect on the bias portion of the error. Variance constraints always result in an increase of

the bias portion of the error. In contrast, bias constraints assist in learning and even may

reduce the bias portion of the error. When networks are learned to satisfy constraints only,

the bias constraints lead to a meaningful hidden representation, capturing the structure

of the input domain; while a hidden representation extracted via the variance constraints

is less interesting.

2.3.1 Smoothness constraints

The easiest way to smooth the mapping approximated by neural networks is by controlling

network structure parameters such as numbers of hidden units and hidden layers. The

larger is the number of network units, the larger is the number of weight fitting parameters.

The over-parameterized models are highly flexible and reduce bias. However, they are

sensitive to noise that leads to a large variance and a large generalization error. Another

way to control smoothness in neural networks, borrowed from the spline theory (Wahba,

1990), is to use weight decay. This involves adding a penalty term controlling a weight’s

norm, to the network cost function E =∑i ‖ yi − f(xi, ω) ‖2 (other forms of cost functions


are presented in (Bishop, 1995a)):

Eλ = E + λ ‖ ω ‖2,

where xi and yi are the suitably scaled input and output samples (‖ z ‖ is the norm

in the space of the element z). Another tightly related approach is to constrain a range

of the weights to some middle values. The method is called weight elimination and the

regularization term has the form λ∑i ω

2i /(ω

2i + ω2

i0). A direct approach is to consider a

regularizer which penalizes curvature explicitly:

Eλ = E + λ ‖ P f ‖2,

where P is a differential operator. Another way to control the smoothness is to inject noise

during the learning. The noise is usually added to the training data (Bishop, 1995a; Raviv

and Intrator, 1996), but may be added to the hidden units (Munro, 1997) or weights

(Murray and Edwards, 1993) during learning as well. It has been shown (Bishop, 1995b)

that learning with input noise is equivalent to Tikhonov (direct curvature) regularization.

Though smoothness constraints bias toward smooth models, they are essentially variance

constraints.

2.3.2 Invariance bias constraints

Given an infinite training data and unlimited training time, a network can learn the

regression function. However, the data is rather limited in practice and this limitation

may be overcome by imposing bias as invariance constraints. One way to implement this

regularization is by training the system with additional data. This data is obtained by

distorting (translating, rotating, etc.) the original patterns (Baird, 1990; Baluja, 1996),

while leaving the corresponding targets unchanged. This procedure, called the distortion

model, has two drawbacks. First, the magnitude of distortion and the number of artificial

degraded patterns have to be defined. Second, the generated data is correlated with

the original training data. This type of regularization is referred to as a data driven

regularization (Raviv, 1998).

An alternative way is to impose invariance constraints by adding a regularization term

to the mean squared error E (Simard et al., 1992). The regularization term penalizes

changes in the output when the input is transformed under the invariance group. Let

x be an input, y = f(x,w) be the input-output function of the network and s(α,x) a

transformation parameterized by some parameter α, such that s(0,x) = x. When the

invariance condition for every pattern xµ is written as:

f(s(α,xµ),w)− f(s(0,xµ),w) = 0 (2.3.7)


the latter constraint for an infinitesimal α may be rewritten as:

∂f(s(α,xµ),w)

∂α|α=0 = 0, or

fx(xµ,w) · tµ = 0, tµ =∂s(α,xµ)

∂α|α=0, (2.3.8)

where fx is the Jacobian (matrix) of the estimator f for a pattern xµ, andtµ is a tangent

vector associated with the transformation s. The penalty term is written as Ω(f ,w) =∑µ ‖ fx · tµ ‖2 and a penalized function is Eλ = E + λΩ(f ,w). This regularization term

states that the function f should have zero derivatives in the directions defined by the

group of invariance and is called tangent prop.

The tangent prop is an infinitesimal form of the invariance ”hint” proposed by Abu-

Mostafa (Abu-Mostafa, 1993). The conditions of equivalence between adding distorted

examples and regularized cost function are presented in (Leen, 1995). In particular, it is

shown that smoothed regularizers may be obtained as a special case of a random shifting

invariance group: s(x, α) = x + α, where α is a Gaussian variable with a spherical

covariance matrix. Obviously, non-trivial invariance constraints belong to a bias type of

constraints.

2.3.3 Specific bias constraints

These constraints express our a-priori heuristic knowledge about the problem. A com-

bination of the Exploratory Projection Pursuit (EPP) method with Projection Pursuit

Regression (PPR) in feed-forward neural networks (Intrator, 1993a; Intrator et al., 1996;

Intrator, 1999) and the multi-task learning (MTL) method (Caruana, 1995), are examples

of this type of the bias constraints.

Hybrid EPP/PPR neural networks

PPR is a method to perform dimensionality reduction by approximating the desired func-

tion as a composition of lower dimensional smooth functions that act on linear dimensional

projections of the input data (Friedman, 1987). In other words, PPR tries to approximate

the best estimator, that is a regression function f(x) = E[Y |X = x] from observations

D = (xµ, yµ)nµ=1 by a sum of ridge functions gj (functions that are constant along lines):

f(x) ≈m∑

j=1

gj(aj · x), j = 1, . . . ,m. (2.3.9)

In the feed-forward neural networks, the ridge functions are set in advance (as logistic


sigmoidal, for example) and the output is approximated as

f(x) ≈m∑

j=1

βjσ(aj · x), j = 1, . . . ,m, x, aj ∈ Rd (2.3.10)

where an input vector x is usually extended by adding an additional component equal

to 1. Thus, in neural networks only projection directions aj and coefficients βj have

to be estimated. However, when the input is high-dimensional, even the dimensionality

reduction neural networks (m d) are over-parameterized models that require additional

regularization constraints.

The already considered smoothness constraint is one way to reduce a variance of

the network. Another way to impose bias constraints related to the data structure has

been proposed by Intrator (Intrator, 1993a). An idea is to train a network (via a back-

propagation algorithm) to fit the desired output and to extract a low-dimensional structure

of the data using EPP (Friedman, 1987) simultaneously. EPP is an unsupervised method

that searches in the high dimensional space directions with good clustering properties,

characterized by projection indices. An example of combination of supervised learning

with unsupervised using a BCM (Bienestock Cooper and Munro) neuron (Bienenstock

et al., 1982; Intrator and Cooper, 1992) has been proposed in (Intrator, 1993b). This

neuron is learned by minimizing a specific projection index that emphasizes the multi-

modality in the data.

Computationally, EPP constraints are expressed as minimization of a function ρ(w)

measuring the quality of the input after projection and a possible nonlinear transformation

φ: ρ(w) ≡ E[H(φ(w · x))], where φ(w · x) is a hidden representation A of the network,

H is a function measuring the quality of the hidden representation, and averaging takes

place over an ensemble of the input. The EPP constraints are introduced by modification

a synaptic weight learning rule:

∂wij∂t

= −ε[∂E(w,x)

∂wij+∂ρ(w)

∂wij+ C], (2.3.11)

where C is an additional complexity penalty term, such as smoothness constraints or the

number of learning parameters.

Multi-task learning (MTL)

Another attractive intuitive way to conceive different types of the bias constraints is MTL.

MTL is a wide-spread method used in the machine learning. It proposes to learn additional

tasks defined on the same data domain as the special task for improving the generalization

ability of the latter. Though the MTL idea is borrowed from the observation that humans

Chapter 2: MDL and Bayesian principles 16

successfully learn many related tasks at once, it has a rigorous mathematical base. It is

easy to see that the additional task learning in MTL emerges as a bias imposing mecha-

nism, that controls the balance between the bias-variance portions of the generalization

error.

The MTL approach in the artificial networks is realized via connectionist network

architectures. In connectionist network one shared representation is used for multiple

tasks. The hidden weights, connected input and this shared representation are updated

as a linear combination of the multi-task gradients in the back propagation of their errors.

Such learning moves the shared hidden layer towards representations that better reflect

regularities of the input domain.

Though the measure of task relation can not be rigorously defined, some mechanisms

explaining the benefit of MTL have been suggested (Caruana, 1995; Abu-Mostafa, 1994).

Nevertheless, the way to test the appropriateness of the related task as a proper bias

is empirical. It is easy to see that the combination of EPP and PPR neural networks

can be also considered in the MTL framework, though in MTL, a related task is usually

expressed more loosely and heuristically than the EPP constraints.

2.4 Reconstruction bias constraints

As shown above in Section 2.3.3, feed-forward Neural Networks which require estimation

of many parameters, are subjected to the bias/variance dilemma. We have seen also in

Sections 2.2–2.3 that different ways to control the bias/variance portion of the predictor

error exist. However, when the dimensionality of the input is very high, innovative ways

to reduce the variance portion of the error, as well as methods to impose (reasonable)

bias, are required.

In this thesis, continuing the previous line of study, we propose a new kind of spe-

cific bias constraints for image classification feed-forward networks in the form of the

image reconstruction. We also consider new information theory constraints, seeking di-

verse structure in the data and compare the effect of the different constraints on the

generalization performance of the classification neural network.

Below, we discuss Bayesian and minimum description length (MDL) frameworks for

learning in neural networks. We show that the bias-variance dilemma can be naturally

reformulated in the MDL framework, where learning constraints emerge as a model-cost,

that has to be minimized along with an error-cost, which is represented as the mean

squared error (MSE) on the main learning task.


2.5 Minimum Description Length (MDL) Principle

In the MDL formulation, one searches for a model that allows the shortest data encoding,

together with a description of the model itself (Rissanen, 1985). One of the first perspec-

tives for applying the MDL principle in Neural Networks was pointed out by Nowlan and

Hinton (1992) for supervised learning. In supervised learning, the output y is predicted

from the input x which is presented at the input layer. The network model is defined by

the weight parameters. Thus, to specify the desired output y given x, the weights and

errors in the output layer have to be described. If it is assumed that the output errors

are Gaussian, then the number of bits to describe the errors is equal to the mean-squared

recognition error. The weights are encoded using different weight probability models

and their descrition length is a negative log of weight probabilities. The weight descrip-

tion length is equivalent to different complexity terms and the MDL principle leads to a

regularization approach in the Neural Networks. For example, the Gaussian probabilistic

model leads to the weight decay regularization term (see Section 2.7). A more sophis-

ticated form of weight decay is obtained when the weights are encoded as a mixture of

Gaussians (Nowlan and Hinton, 1992).

Later on the MDL principle was applied for unsupervised learning, in particular for

autoencoder networks (Zemel, 1993) (see also Section 3.1.2). The autoencoder network

is a feed-forward network which duplicates the observed input in the output layer. The

autoencoder network has a natural interpretation in the MDL framework (Hinton and

Zemel, 1994). It discovers an efficient way to communicate data to a receiver. A sender

uses a set of input-to-hidden weights and, in general, non-linear activation functions to

convert the input into a compact hidden representation. This representation has to be

communicated to the receiver along with the reconstruction errors and hidden-to-top

weights. Receiving the hidden-to-top weights, the receiver reconstructs the input from

this abstract representation and communicated errors. The description length in this case

consists of three parts:

1. The set of activities A of the representation units. These are codes that the net

assigns to each training input sample. Encoding activities of the representation

(hidden) units enables to avoid communication of the hidden weights and does not

require the knowledge of the input data X . However, the sender and the receiver

have to agree on the a-priori distribution of the internal representation. This part

of the message corresponds to the representation-cost.

2. The set of hidden-to-output weights W . This part of the message is represented by

the weight-cost.


3. The reconstruction error, which is a disagreement between desired and predicted

outputs. This part of the message is represented by the reconstruction or the error-

cost. In order to evaluate the latter, the sender and receiver have to agree on the

probability of the desired output of the network given its actual output.

In the standard autoencoder, the weight cost is neglected and the representation cost

is considered to be small and proportional to the number of network hidden units, since it

is assumed that all units participate in the equal parity in the data representation. How-

ever, instead of the direct evaluation of the representation code, the autoencoder with a

bottleneck in the hidden layer is trained to minimize the MSE reconstruction error. In

contrast, in the nonstandard versions of autoencoders (Zemel, 1993), the representation

cost is evaluated explicitly and its minimization encourages sparse distributed represen-

tation, where only few neurons are active, which are responsible for the presence of the

specific features in the patterns.

The main difference between the MDL principle for supervised and unsupervised learn-

ing proposed by Zemel may be understood considering the unlimited number of training

samples. When the number of patterns is infinite, the model cost of the supervised

learning, which is the cost of the weights, is negligent. In contrast, in the unsupervised

learning, the model cost never vanishes and the MDL is applied per sample to minimize

representation cost and to maintain data fidelity.

In this thesis, we combine supervised and unsupervised learning in the hybrid re-

construction/recognition network and formulate the MDL principle for this case (see Sec-

tion 3.2.3). It turns out that this interpretation is three-fold, depending on what is defined

as the main task:

1. When the main task is reconstruction (Gluck and Myers, 1993, a hippocampus

model), the reconstruction MSE is an error cost and the recognition MSE is a model

cost (or a representation cost, since the MSE recognition error depends on the hidden

layer representation and the recognition top weights that must not affect on the

description length). Thus, the network maintains the data fidelity and encourages

representation with a good discriminative property.

2. When the main task is recognition and it is assumed that the sender observes both

the input and output, while the receiver sees only the input, the recognition MSE is

an error cost as in supervised learning and the reconstruction MSE is a model cost

(or a representation cost). However, in contrast to a standard supervised learning

the representation cost never vanishes.


3. When the main task is recognition, but the receiver does not see both x and y, he

has in parallel to reconstruct x and predict y. Thus, the sender encodes x, taking

into account also the dependence of y on x. He sends the encoded data and errors

of recognition and reconstruction outputs, since in the supervised learning the task

is to predict y for the given x. In this case, both the recognition and reconstruction

MSE stand for error codes and the representation cost is restricted to a small number

of the hidden units.

2.5.1 Minimum description length

MDL can be formulated based on an imaginary communication game, in which a sender

observes the data D and communicates it to the receiver. Having observed the data, the

sender discovers that the data has some regularity that can be captured by a model M.

This fact encourages the sender to encode the data using a model, instead of sending the

data as it is. Due to noise, there are always aspects of the data which are unpredicted by

the model, that can be seen as errors. Both the errors and the model have to be conveyed

to the receiver to enable him to reproduce the data. The goal of the sender is to encode

data so that it can be transmitted as accurately and compactly as possible.

It is clear, that complex models allow to achieve a high accuracy, but their description

is expensive. In contrast, models which are too simple or wrong, are not able to extract the

data regularity. Intuitively, such a communication game can be thought of as a tradeoff

between the compactness of the model and its accuracy.

To transmit the data the sender composes a message consisting of two parts. The first

part of the message with a length L(M) specifies the model and the second with a length

L(D|M) describes the data D with respect to the model M. The goal of the sender is

to find a model that minimizes the length of this encoded message L(M,D), called the

description length:

L(M,D) = L(D|M) + L(M), (2.5.12)

According to Shannon’s theory (Shannon, 1948; Cover and Thomas, 1991) to encode

a random variable X with the known distribution p(X) by the minimum number of bits,

a realization x has to be encoded by − log p(x) bits. Thus the description length (2.5.12)

is represented as:

L(M,D) = (− log p(D|M)− log p(M)), (2.5.13)

where p(D|M) is the probability of the output data given the model, and p(M) is an

a-priori model probability. The MDL principle requires searching for a model M? that


minimizes the description length (2.5.13):

M? = argminM

(− log p(D|M)− log p(M)). (2.5.14)

As we have seen in Section 2.1, in the supervised learning the problem is to find a model

that describes output y as a function of input x based on the available input/output pairs

D = (xµ, yµ)nµ=1. In a standard application of MDL to supervised learning, the output y

is treated as the data D that has to be communicated between the sender and the receiver,

while the input data X is assumed to be known by them. Therefore, all the probabilities

in the formula (2.5.13) are conditioned on the input data, i.e. p(M) ≡ p(M |X ) and

p(D|M) ≡ p(D|M,X ). However, to simplify the notation we omit X in these expressions.

The connection between MDL and Bayesian theory for Neural Networks is demon-

strated in the next section.

2.6 Bayesian framework

In the Bayesian framework, one seeks a model that maximizes a posterior probability of

the model M given the observed input/output data (X , D):

p(M |D,X ) =p(D|M,X )p(M |X )

p(D|X ), (2.6.15)

Usually, in the feed-forward networks trained by supervised learning the distribution of

the input data p(x) is not modeled1. Thus, in (2.6.15), X always appears as a conditioning

variable, which we omit to simplify the notation (similar to the convention accepted for

the description length evaluation):

p(M |D) =p(D|M)p(M)

p(D). (2.6.16)

Since p(D) does not depend on the model and the most plausible model M? has to

minimize the negative logarithm of the posterior probability, we get:

M? = argminM

[− log(p(D|M))− log(p(M))]. (2.6.17)

Usually, to apply both the MDL and Bayesian frameworks, one decides in advance on a

class of parameterized models and then searches within this class of parameters to optimize

a corresponding criterion. The probability of the data, given a model parameterized by

w, can be computed by integrating over the model parameter distribution:

p(D|M) =∫p(D|M,w)p(w|M)dw. (2.6.18)

1In Section 3.2.3 we will consider the effect of such modelling.


Using the Bayesian formula we get:

p(w|M,D) =p(w, D|M)

p(D|M)=p(D|M,w)p(w|M)

p(D|M), (2.6.19)

that shows that a posterior probability of the weights p(w|M,D) is proportional to

p(D|M,w)p(w|M). It is usually assumed that a posterior probability of the weights

p(w|M,D) is highly peaked at the most plausible parameter w?, and the integral (2.6.18)

may be approximated by the height of the peak of the integrand p(D|M,w)p(w|M), times

a width of this distribution ∆w|M,D (MacKay, 1992):

p(D|M) ≈ p(D|w?,M)︸︷︷︸best fit likelihood

× p(w?|M)∆w|M,D︸︷︷︸Occam factor

(2.6.20)

The quantity ∆w|M,D is the posterior uncertainty in w. Assuming that the prior p(w?|M)

is uniform on some large interval ∆0w, representing the range of values of w that the

model M admits before seeing the data D, p(w?|M) simplifies to p(w?|M) ≈ 1∆0w

, and

Occam factor =∆w

∆0w. (2.6.21)

Thus the Occam factor is the ratio of the posterior accessible volume of the model pa-

rameter space to the prior accessible volume. Typically, a complex model with many

parameters, has larger prior weights uncertainty ∆0w. Thus, the Occam factor is smaller

and it penalizes the complex model more strongly (MacKay, 1992).

Another interpretation of the Occam factor is obtained by viewing the model M as

composed of a certain number of equivalent sub-models. When data arrive, only one

sub model survives and thus the Occam factor appears to be inversely proportional to

the number of sub models. Thus, − log(Occam factor) is the maximal number of bits

required to describe/indicate this remaining sub model.

Using the Occam factor (2.6.21) the condition (2.6.17) states that the most plausible

model has to minimize the description length:

L(M,D) = − log p(D|w?,M)︸︷︷︸inaccuracy for the best parameters

− log p(M)− log(Occam factor)︸︷︷︸model complexity

(2.6.22)

The first term in (2.6.22) is the ideal shortest message that encodes the data D using

w? and characterizes inaccuracy of the model prediction for the best parameters. The

second term characterizes the complexity of the model. The more complex the model

is, the less is the discrepancy between the data and their prediction, but this accuracy

is achieved at the expense of the model description. This relationship between a model

accuracy and complexity is tightly related to the bias-variance dilemma considered in


the previous section. We have seen that the introduction of many parameters leads to a

better accuracy (decreases bias), but incurs high variance. Thus MDL and the Bayesian

approach offer the natural way to resolve the dilemma by seeking a model with a good

generalization ability.

Another MDL interpretation to (2.6.20) is straightforward:

L(D,M) = − log p(D|w?,M)︸︷︷︸error−cost

− log p(w?|M)︸︷︷︸weight−cost

− log ∆w|M,D︸︷︷︸precision−cost

− log p(M). (2.6.23)

The first term in (2.6.23) is the length of the ideal shortest message that encodes the

data D using the best parameters w?. The second term is the number of bits required

to encode the best model parameters. In addition, the negative logarithm of uncertainty

about parameters after observing the data (− log ∆w|M,D) penalizes models which have

to be described with a high precision to fit the data. Usually, the third component is

neglected since model parameters are communicated only once, while the data arrive one

after another. A way to take the third component into consideration in neural networks,

but neglecting the second term, describing the a-priori knowledge about the model pa-

rameters, has been considered in (Hochreiter and Schmidhuber, 1997).

2.7 MDL in the feed-forward NN

A feed-forward neural network is an example of the parameterized models that is rep-

resented graphically as a feed-forward diagram of several layers of activation units, con-

nected by the so called synaptic weights that represent the model parameters. The neural

network architecture allows to evaluate the output data as a function of the input data.

The network is supplied by the input data presented in the low input layer of the network.

The input is successively propagated via the hidden layers using the weights and network

units’ activation functions in the forward direction to get the output data D in the top

output layer of the network. The network weights, the number of hidden units and the

activation unit functions are the main parameters that define the network complexity. In

general, it is often assumed that the network architecture is already defined and the main

problem is to find the weight parameters.

Implementing the MDL principle in neural networks is easy. For simplicity we consider

training a single hidden layer feed-forward neural network (Figure 2.1). Neglecting the

third term in the description length (2.6.23) and assuming that the models have the same


Supervised feed-forward network

w - hidden weights

representation - A

Hidden

Input - X

W - top weights

Output

Figure 2.1: Feed-forward supervised network. A single arrow between two layers indicatesthat the units of both layers are fully connected.

a-priori probabilities p(M) an optimal weight vector has to minimize 2:

L(M,D) = − log p(D|w,W,M)︸︷︷︸error−cost

− log p(w,W|M)︸︷︷︸model−cost

+const (2.7.24)

The first term in this expression is the error-cost of specifying the data for the given

weights, i.e. the cost of specifying the errors between true and predicted by the models

with the given weights outputs. The second term is the model-cost.

To evaluate the error-cost, the receiver and the sender have to agree on the specific

form of the conditional distribution of the output t ∈ Rn. In the assumption of the

independent Gaussian additive noise with zero mean in the output layer, the posterior

probability of the output is given by:

p(t|x,w,W) =1

Cn(λ)exp(−λ

2‖ t(x,w,W)− t ‖2), (2.7.25)

where C(λ) =√

2πλ

and the parameter λ is inversely proportional to the Gaussian variance

(λ = 1/σ2).

Provided the samples are drawn independently from the distributions (2.7.25) we get:

p(D|w,W,M) =r∏

i=1

p(ti|xi,w,W), (2.7.26)

2We have omitted the ? super-index for convenience


where r is the number of training samples. The assumptions (2.7.25) and (2.7.26) produce

p(D|w,W,M) =1

Cnr(λ)exp(−λ

2ED), where

ED =r∑

i=1

‖ t(xi,w,W)− ti ‖2 . (2.7.27)

When the weight probability distribution is Gaussian and the hidden w and top

weights W are independent we get:

p(w,W|M) = p(w|M)p(W|M)

p(w|M) =1

CNw(γw)exp(−γw

2‖ w −mw ‖2),

p(W|M) =1

CNW (γW )exp(−γW

2‖W −mW ‖2), (2.7.28)

where Nw, NW are numbers of the hidden and top weights, coefficients γw, γW are inversely

proportional to the corresponding Gaussian variances and mw,mW are mean values of the

hidden and top weights, respectively. Assumptions (2.7.25,2.7.28) lead to the following

expression for the description length (2.7.24):

L(M,D) =λ

2ED

︸︷︷︸error

+γw2‖ w −mw ‖2 +

γW2‖W −mW ‖2

︸︷︷︸weight decay

+

Nw logC(γw) +NW logC(γW ) + nr logC(λ) + const︸︷︷︸ (2.7.29)

The first term may be recognized as an error and the next as a modified weight decay

term. The third term is constant for a chosen net architecture. Thus, the weight-decay

term controls a network complexity imposing smoothness constraints. Another form of

weight decay term has been obtained by modelling the weights as a mixture of Gaussians

(Nowlan and Hinton, 1992).

There is a deep relationship linking the MDL approach and regularization techniques.

The intuitive idea is that complex models can fit better training data, but are not robust

to small variations in the data. This relationship between a generalization ability of the

model and its complexity is related to the bias-variance dilemma in statistics (Geman

et al., 1992): over-parameterized models have high variance, while restricting the model

parameters incurs a high bias in the generalization error. The MDL formulation allows

to control bias and variance in a natural way.

2.7.1 MDL and EPP bias constraints

Let us assume again that a network architecture, such as a number of hidden units and

nonlinear activation functions, is fixed. Nevertheless, does there exists another way to


control complexity of the network? It turns out that this can be done by imposing bias

constraints on the supervised neural network. A general framework for imposing EPP

bias constraints in neural networks (Figure 2.2) has been considered in Section 2.3.3.

We have seen that computationally these constraints are expressed as a minimization

Hybrid network with EPP constraints

w - hidden weights

representation - A

Hidden

Input - X

W - top weights

Bias constraints

Output

Figure 2.2: A hybrid feed-forward network with exploratory projection pursuit (EPP)constraints. A single arrow between two layers indicates that the units of both layers arefully connected.

of some function H, measuring the quality of the hidden layer representation A, and

averaged over an ensemble of the input. In other words, EPP constraints are constraints

on the specific form of the hidden representation that are known a-priori. Thus, the

projection index ρ(w) is a complex function depending on the hidden weights via the

hidden representation A: ρ(w) ≡ E[H(A)], where A = f(w,x) and H measures the

quality of the hidden representation. This form of constraints may be easily wired in

the MDL framework assuming a particular form of a-priori probabilities of the hidden

weights:

p(w|M) = CH(µ) exp(−µ2E[H(f(w,x))]), (2.7.30)

where CH(µ) is a normalization constant. The a-priori probability p(w|M) (2.7.30) does

not depend on the input x explicitly, although it does, since in the Bayesian formulation

(2.6.16) all the probabilities have to be conditioned by the input data X . Assuming


independence of the hidden and top weights, we get:

L(M,D) =1

2λED

︸︷︷︸error−cost

+1

2µE[H(A)]

︸︷︷︸representation−cost

− log p(W|M)︸︷︷︸weight−cost

+const. (2.7.31)

The expression for the description length (2.7.31) gives a deeper level of description to

the data communication and is close (though not equivalent) to Zemel’s interpretation of

MDL (Zemel, 1993).

In Zemel’s interpretation one gets a more realistic interpretation of the communication

game, where a real communication takes place between the hidden layer with internal

representation A and the top layer. The receiver requires three items in order to be able

to recover the desired output:

1. The set of activities A of the representation units; these are codes that nets assign to

each training input sample. Encoding activities of the representation (hidden) units

avoids communication of the hidden weights and does not require the knowledge of

the input data X . However, the sender and the receiver have to agree on the a-priori

distribution of the internal representation. This part of the message corresponds to

the representation-cost.

2. The set of hidden-to-output weights W . This part of the message is represented by

the weight-cost.

3. Reconstruction error, which is a misfit between desired and predicted outputs. This

part of the message is represented by the reconstruction or the error-cost. In order

to evaluate the latter, the sender and receiver have to agree on the probability of

the desired output of the network given its actual output.

Usually, the weight-cost, i.e. the number of bits required to communicate the hidden-

to-top weights, is not taken into account, since it has to be communicated only once,

while representation-cost and error-cost have to be sent for every sample. Thus, the main

communication tradeoff takes place between representation and error costs. Reducing

dimensionality of the data in the hidden layer, i.e. compressing the data, a shorter

description is obtained, but at the same time the errors are larger. The MDL principle is

a tool for achieving a good data representation that is compact and accurate.

We see that similar to Zemel’s interpretation of MDL, imposing EPP constraints leads

to the description length (Eq. 2.7.31) that consists of three parts. It requires the same

agreement on probabilities of hidden representation and errors between the sender and

receiver as described above. However, the representation cost in (Eq. 2.7.31) is taken only


once for all samples, while in Zemel’s interpretation it is permanent and is assigned to each

training input sample. When the number of input patterns is infinite, the representation

cost induced by EPP constraints is negligible. Thus, in a manner similar to supervised

learning, EPP constraints lead to a model in which model cost vanishes as the number of

input patterns becomes infinite.

We postpone the consideration of the hybrid autoencoder network with reconstruction

constraints and its MDL interpretation to the next section, where reconstruction task and

its application are considered.

Chapter 2: Regularization problem 28

2.8 Appendix to Chapter 2: Regularization problem

Regularization may be expressed as a minimization problem with a goal function that is

a penalized cost function:

Eλ = E + λΩ(f ,w), E =∑

i

‖ yi − f(xi, ω) ‖2 .

A large value of the regularization parameter λ leads to a network with a large bias

(unless the regularization term captures the underlying structure of the data), while a

small value reduces bias but increases variance. Then the regularization task is to find

an optimal parameter λ? and corresponding model parameters ωλ? providing the minimal

generalization error:

Eλ? = E[‖ y − f(x, ωλ?) ‖2].

This task is computationally very expensive.

Split-sample validation and hold-out method The simplest way to find the regu-

larization parameter is to use split-sample validation. This process includes the following

steps for each tested value of the regularization parameter λ (this process is common for

the choice of the other regularization parameters, such as the number of hidden units, a

choice of the early time stopping moment, etc.):

• A random data is split into a training and validation set. Often 2/3 of the data is

used for training and 1/3 for testing.

• The training set is used for estimation of the predictor parameters by minimizing

Eλ.

• The validation set is used to test a prediction error (E). The validation set must

not be used in any way during training.

• The predictor with the smallest prediction error corresponds to the optimal regu-

larization parameter λ.

The generalization error of the best predictor is in general too optimistic. The prediction

error on a third separately kept data set, called the test set is more realistic and is

often reported as the result of the predictor accuracy. This method is called the hold-out

method.

The disadvantage of the split-sample validation and hold-out method is that they

reduce the amount of data available for both training and validation. Two methods that

Chapter 2: Regularization problem 29

overcome this drawback are cross-validation and bootstrapping (Efron and Tibshirani,

1993; Bishop, 1995a).

Cross-validation In k-fold cross-validation, the data is divided into k subsets of (ap-

proximately) equal size. A network is trained k times, each time leaving out one of the

subsets from the training set and using the omitted subset as a validation set to compute

an error. If k equals the sample set size, this is called “leave-one-out” cross-validation.

“Leave-v-out” is a more elaborate and expensive version of cross-validation that involves

leaving out all possible subsets of v cases. A generalization error is then measured as an

average performance over all possible validation tests. Cross-validation is an improvement

on split-sample validation.

Bootstrapping In many cases, bootstrap seems to be better than cross-validation

(Efron and Tibshirani, 1993). In the simplest form of bootstrapping, the training data is

bootstrapped, instead of repeatedly analyzing subsets of the data as in cross-validation.

Given a data set of size n, a bootstrap sample is created by sampling n instances uni-

formly from the data with replacement. Then the probability of the instance to remain in

the test set is (1− 1/n)n ≈ e−1 ≈ 0.368; and to be in the training data is 0.632. Given a

number b of bootstrap samples, the average performance is evaluated as a weighted sums

of the training (E trainingi ) and testing (E testingi ) errors:

E =1

b

b∑

i=1

(0.632E trainingi + 0.368E testingi ) (2.8.32)

Usually the number of recommended bootstrap samples is between 200 − 2000 (Kohavi,

1995).

Cross-validation and bootstrapping require many runs that may be computationally

prohibitive, especially for the most interesting perception tasks, when the input dimension-

ality is very high. Both cross-validation and bootstrapping work well for continuous error

functions, such as the mean squared error, but it may perform poorly for non-continuous

error functions, such as the misclassification rate.

Chapter 3

Imposing bias via reconstructionconstraints

3.1 Introduction

Reconstruction is one of the important tasks of the complex visual processing. It is

a process of reproducing the input via some reasonably well chosen model. It is com-

monly assumed that there is a compression via a bottleneck model and thus, the input

is reproduced from a reduced internal representation. The oldest and widely spread re-

construction method is Principal Component Analysis (PCA). PCA is an optimal linear

compression, that is based on minimization of the mean squared error between input and

its reconstruction. A simple generalization of PCA, in the nonlinear case, is a nonlinear

autoencoder. Below, we present both these models and discuss their relationship to the

MDL principle.

We proceed then with a more general notion of reconstruction via a generative model

and reexamine diverse applications of the reconstruction models. Finally, we introduce a

novel method that uses reconstruction as a bias constraint to a supervised classification

task.

3.1.1 Principal Component Analysis (PCA)

PCA is widely used in multivariate analysis (Duda and Hart, 1973). PCA, also known as

the Karhunen-Loeve transformation (Oja, 1982; Fukunaga, 1990), is a process of mapping

the original data into a more efficient representation, using an orthonormal linear trans-

formation that minimizes the mean squared error between the data and its reconstructed

version.

It is well-known that the optimal orthogonal basis of the data space is formed by the

eigenvectors of the covariance matrix of the data. New data representation is obtained

30

Chapter 3: Reconstruction constraints 31

by projecting the data to this new optimal basis. The eigenvectors corresponding to the

largest eigenvalues are the most significant (accounting for most of the variance in the

data). Thus, discarding coordinates in these directions, leads to the largest error in the

mean-squared sense. Therefore, the coordinates corresponding to the small eigenvalues

should be deleted first, when compression is performed.

Different PCA algorithms using neural networks have been reported (Haykin, 1994,

see review). The first PCA network proposed by Oja (1982), uses a Hebbian learning

rule to find the first eigenvector corresponding to the maximal eigenvalue. It’s gener-

alized version, called the generalized Hebbian network (GHA) (Sanger, 1989), extracts

the first successive eigenvectors and uses feed-forward connections only. A modification

of GHA, an adaptive principal component extraction (APEX) algorithm (Kung and Dia-

mantaras, 1990), uses additional lateral connections to decorrelate network outputs. GHA

and APEX are examples of reestimation and decorrelating types of the PCA algorithms,

respectively.

PCA using Hebbian networks has been considered as a first principle of perceptual

processing (Miller, 1995; Atick and Redlich, 1992; Hancock et al., 1992; Field, 1994). The

main goal of these studies is to explore the similarities between the PCA eigenvectors

and the receptive fields of cells in the visual pathway. It may be shown (Fukunaga, 1990;

Gonzalez and Wintz, 1993; Field, 1994), that for stationary and ergodic processes, PCA is

approximately equivalent to the Fourier transform. The natural images are not stationary,

however, and their covariance matrix does not describe completely the data distribution.

It has been recently shown (Hancock et al., 1992), that the first 3−4 eigenvectors extracted

from Gaussian smoothed natural images resemble ”Gabor functions”, that provide good

models of cortical receptive fields. However, the following eigenvectors no longer look like

cortical receptive fields. PCA extracts a fully distributed representation, because only

few neurons that carry most of the variance are kept, and thus all components of the

observation vector participate in its projection into the eigenspace.

Below, we present autoencoder network that is tightly related to PCA and discuss its

interpretation in the MDL framework.

3.1.2 Autoencoder network and MDL

An autoencoder network (Figure 3.1) is a feed-forward multi-layer perceptron (MLP) net-

work with the output layer coinciding with the input layer. Usually, it contains a single

hidden layer, though variants with additional hidden layers have been also considered

(Kramer, 1991). The number of the hidden units is assumed to be much less than dimen-

sionality of the input. Therefore, it reduces dimensionality of the input extracting the


Autoencoder network architecture

w - hidden weights

W - hidden-to-top weights

Figure 3.1: Reconstruction of the inputs is done from the hidden layer representation.

so-called internal representation in the hidden layer.

The autoencoder network has a natural interpretation in the MDL framework (Hinton

and Zemel, 1994). It discovers an efficient way to communicate data to a receiver. A

sender uses a set of input-to-hidden weights and, in general, non-linear activation functions

to convert the input into a compact hidden representation. This representation has to

be communicated to the receiver along with the reconstruction errors and hidden-to-top

weights. Knowing the hidden-to-top weights the receiver reconstructs the input from this

abstract representation and communicated errors.

From Eq. 2.7.24 the description length is composed of the error-cost and the model-

cost. Assuming that the errors are encoded using a zero-mean Gaussian with the same

predetermined variance for each output unit, the error-cost is given by the sum of the

squared errors. Since in the autoencoder the hidden units are always active, the model cost

may be approximated by the size of the hidden layer. Often, the model cost is ignored,

and the MDL principle leads to a simple minimization of the sum of squared errors

via a network with a bottleneck structure. Thus, the autoencoder learns the compact

representation of the input. In addition, the bottleneck structure forces the network to

learn prominent features of the input distribution which are useful for generalization. The

network is robust to noise and may be used for pattern completion, when part of the input

is corrupted or absent.

A linear one-hidden layer autoencoder is closely related to PCA, since its hidden

weights span the same subspace as found by principal eigenvectors (Bourlard and Kamp,

1988). However, contrary to PCA, the hidden weights are not forced to be orthogo-

nal and do not coincide with the hidden-to-top weights. The analytical solution of the


optimization problem imposed by the linear autoencoder is given by:

W = UT−1, w = TUt (3.1.1)

where T ∈ Rp×p is an arbitrary nonsingular scaling matrix; U ∈ Rn×p (p ≤ n) is a

matrix of the principal eigenvectors stacked by columns; W and w are hidden-to-top

and hidden weights respectively; n and p are the number of units in the input and hidden

layers respectively. However, since learning in the autoencoder relies on a gradient descent

technique it can get trapped in local minima.

In the nonlinear case, Bourlard and Kamp claim that nonlinear and linear autoen-

coders are equivalent, since when the norm of the scaling matrix T is infinitely small,

sigmoidal activation functions can be approximated arbitrary close by linear activation

functions. However, their proof is valid only from the reconstruction error minimization

viewpoint, and not the extracted internal representation context. Their analysis does

not take into account a convergence issue. Indeed, to make nonlinear and linear autoen-

coder solutions arbitrarily close, the norm of the matrix T has to be arbitrarily small (for

example, by introducing some scaling parameter ε → 0). While ε is positive the linear

autoencoder hidden weights span the same space as the principal eigenvectors, but at

the same time there is a difference between hidden weights extracted by the linear and

nonlinear autoencoders. This difference disappears only for ε = 0, when the matrix T

becomes singular. Thus, it is not obvious that the hidden weights obtained in the limit

of this convergence span the space extracted by the principal eigenvectors.

It has been recently shown, that when the data is whitened (i.e. the data covariance

matrix is unit and spherical) and non-linear activation functions are adjusted properly, the

autoencoder is able to extract the independent components (Oja, 1995a) (i.e. responses

of different hidden neurons are independent, see also Chapter 4), while the PCA solution

is not well defined. Thus, the non-linear autoencoder can be made sensitive to higher

order statistics, while PCA is sensitive to the second order statistics of the data.

The presence of the proper nonlinearities in the autoencoder allows to extract sparse

representation, while PCA forms distributed representation. In the distributed represen-

tation, all the hidden units participate in the pattern encoding, while in the sparse, only

a few are active, which are responsible for the presence of some specific features in the

pattern. PCA forms the distributed representation, since only few neurons which carry

most of the variance are kept for data reconstruction and they are active for all patterns.

Other variants of autoencoders that encourage sparse hidden representations have

been proposed by Zemel (1993). The code-cost of the sparse representation is small, even

when the number of hidden units is large. Thus, though these autoencoders are trained


to minimize a sum of the representation (code) and error costs, they do not necessary

have a bottleneck structure and develop interesting biologically plausible representations.

3.1.3 Reconstruction and generative models

There is evidence in several psychological experiments (for example, completion of par-

tially occluded contours (Lesher, 1995)) that humans perceive a reconstructed version

of the input instead of the raw ambiguous input. The reconstruction may be a more

complex process than simple duplication of the incoming information, including deblur,

denoising, completion of occluded areas, etc. It is often assumed that the observed signals

are synthesized by some generative model from an abstract internal representation. Thus,

the reconstruction is considered to be composed of two phases (Hinton and Ghahramani,

1997). The first phase is a recognition phase, inferring the underlying internal repre-

sentation of the incoming input and the second – a generative phase converts internal

representation into an input form (reconstructed object).

From a statistical viewpoint, learning to reconstruct is the problem of maximizing

the likelihood of the observed data under a generative model. This estimation is often

an ill-posed problem, that can be solved using the expectation maximization (EM) algo-

rithm (Dempster et al., 1977; Neal and Hinton, 1993). This iterative algorithm increases

(or does not change) maximum likelihood in every iteration, which consists of two steps,

expectation and maximization. In EM, the recognition phase corresponds to the expec-

tation step (E-step) and generative phase to the maximization (M-step). In the E-step, a

distribution of the internal representation is estimated from the observed data and current

model parameters. Using this distribution and the observed data, the generative model

parameters are updated via an average likelihood maximization.

Different generative models and assumptions about distribution of the internal rep-

resentation lead to different network models and sensory representations. The inference

phase is difficult. In logistic belief networks (LBN) and Boltzmann machine (Hinton and

Ghahramani, 1997) the hidden state is picked using Gibbs sampling, i.e. each unit is vis-

ited one at a time and its new state is stochastically picked from its posterior distribution

given the current states of all the other units (Jordan, 1999, comprehensive survey). In

the wake-sleep algorithm (Hinton et al., 1995), a model uses separate bottom-up recogni-

tion connections to pick up binary states for units in one layer, given the already selected

binary states of units in the layer below.

Both PCA and the autoencoder network may be interpreted as generative models.

PCA as a generative model emerges as a constrained case of factor analysis (Roweis and

Ghahramani, 1997; Hinton and Ghahramani, 1997). In factor analysis the observation


is a linear transformation of the hidden variables, corrupted with an additive sensory

noise that is Gaussian. The linear transformation is realized via a matrix of the gener-

ative weight vectors. Each generative weight vector connects hidden variables with the

corresponding observation variable. Hidden variables are referred to as factors and are

assumed to be Gaussian. PCA is obtained when the covariance matrix of the sensory

noise is assumed to be a scaled identity matrix εI, with the infinitesimal scaling factor

ε→ 0. In this limiting case, the posterior distribution of the hidden variables shrinks to a

single point, i.e. given the observation, the hidden representation becomes non random.

In PCA the generative weight vectors are forced to be orthogonal, that leads to a simple

recognition of the deterministic hidden representation as a linear transformation with the

matrix of the recognition weight vectors equal to the transpose of the generative weight

matrix.

Interpretation of PCA as a generative model disregards the order of the hidden vari-

ables, but allows the use of EM for the extraction of eigenvectors (Roweis, 1997). This

method is especially efficient for high dimensional data, where a covariance matrix is not

full rank and has a large size that makes the simple diagonalization of the covariance

matrix computationally difficult.

The transformation from the input to the hidden layer in the autoencoder net is

associated with the recognition phase and from the hidden layer to the output as the

generative phase. Therefore, the hidden weights emerge as recognition weights and the

hidden-to-top weights as generative weights.

3.1.4 Classification via reconstruction

As we have shown above, an implicit reconstruction goal is to find a meaningful internal

representation of the data that can be obviously used for data compression and commu-

nication. Interpreted as a set of good features, it may be applied for further processing

and learning. This usage is not absolutely apparent, since during feature extraction some

information is lost. Below, we consider some examples of using internal representations

extracted via reconstruction for recognition.

PCA for classification PCA was first used as a means of preprocessing for subsequent

face recognition in (Kirby and Sirovich, 1990; Turk and Pentland, 1991). Later PCA was

used for a man-made object recognition and pose estimation (Murase and Nayar, 1993).

PCA proceeds by scanning and representing images as points of a high dimensional

space with the dimension equal to the number of image pixels. The eigenvectors of the

data covariance matrix represented as images are called the eigenpictures. The first large


eigenvectors form the basis of a low-dimensional subspace, called the eigenspace. All the

sample-images and new images of the objects are projected into the eigenspace and the

recognition problem is solved in the reduced dimensional space by different statistical

methods (nearest neighbor rule, vector quantization, etc.).

Though application of PCA for recognition has been relatively successful, a question

of the PCA optimality for recognition task has been also addressed (Turk and Pentland,

1993; O’Toole et al., 1993). Experimental studies of Turk et al. (1993) show that the first

few eigenfaces primarily capture the effects of changing illumination and neglecting the

first few eigenfaces can lead to a substantial increase in the recognition accuracy. This

observation has been supported by a different study (O’Toole et al., 1991; O’Toole et al.,

1993). It has been shown that a low-dimensional representation of the faces associated

with the small eigenvalues is better for face classification and familiarity, than a high-

dimensional representation associated with the large eigenvalues when these spaces have

the same small dimensionality.

The explanation of a PCA utility is based on the fact that the eigenvectors correspond-

ing to the large eigenvalues are the directions with the large data variability (Figure 3.2a).

Thus, it seems reasonable that these directions are good for recognition. However, this

assumption fails as can be easily seen from Figure 3.2b. This figure demonstrates the main

Eigenspaces extracted by PCA

a b

e2

e1

e2

e1

Figure 3.2: Two examples of eigenspaces extracted by PCA. The first principle eigenvectore1 is marked with a bold line and the second e2 (e1 ⊥ e2) with a dashed line. Example(a) demonstrates why PCA can be used for dimensionality reduction before classification.The projection on the e1 direction captures all information needed for classification. Incontrast, example (b) indicates a PCA drawback. Classification after projecting data toe1 direction is impossible.

drawback of the PCA technique, namely a high sensitivity to the scaling that changes

the ordering of the eigenvectors. Scaling affects the reduced low-dimensional eigenspace,


extracted by PCA and being optimal for reconstruction it may be inappropriate for recog-

nition. When the data is whitened, PCA is not clear at all, since all orthogonal systems

are equivalent from a PCA viewpoint.

Autoencoder network The autoencoder networks have been successfully used not

only for compression (Mougeot et al., 1991; Cottrell et al., 1987), but for classification as

well (Elman and Zipser, 1988; Japkowicz et al., 1995; Schwenk and Milgram, 1995). In

these works, a classification process is considered to consist of two phases. In the first

phase several autoencoders are trained. Each autoencoder is trained separately on the

samples of the corresponding class. The second phase is heuristic and is based on the idea

that the reconstruction error is, in general, much lower for examples of the learned class

than for the other ones.

In (Japkowicz et al., 1995) classification is constrained to a two-class discrimination

task that is replaced by a dual task of familiarity with a concept. In the first phase, the

single autoencoder is trained on the conceptual examples solely. In the second phase, the

conceptual examples or two classes examples are used to estimate the decision threshold

for a reconstruction error (the sum of squared errors) between input and output. If

the reconstruction error is smaller than the decision threshold, the instance is classified

as conceptual, if larger it is classified as counter-conceptual. Similarly, in (Elman and

Zipser, 1988) the autoencoder is trained on segmented sounds that allows to segment a

continuous speech on the base of the mean squared error.

In (Schwenk and Milgram, 1995) the basic idea is to use one autoencoder for each

class and to train it only with examples of the corresponding class. In contrast with the

usual autoencoder, a tangent distance is used instead of the squared reconstruction error.

This tangent distance allows to incorporate a high-level knowledge about typical input

transformation. Classification is done using the reconstruction errors of the autoencoders

as discriminant functions.

“Wake-sleep” network Another example of classification based on the reconstruction

has been proposed via the “wake-sleep” network (Hinton et al., 1995). Similarly to au-

toencoders, each “wake-sleep” network is trained separately on different examples of the

same digit. Classification is done by observing which of the networks provides the most

economical description of the data.


3.1.5 Other applications of reconstruction

Reconstruction via a modified autoencoder has been used for input reconstruction relia-

bility estimation (IRRE) for autonomous car navigation (Pomerleau, 1993). In IRRE a

connectionist network is trained simultaneously to produce the correct steering response

for a car navigation and to reconstruct the input image in the mean squared error sense.

After learning, the reliability measure which is a correlation between the input and its

reconstructed image is evaluated. This reliability measure may be used to control vehicle

speed and its location in the a priori known confusing situations. Another application

of IRRE is by integrating the outputs of multiple networks trained for different driving

situations, i.e. the network that has the best reliability has to be used for a navigation

task.

Another related recurrent network has been used for autonomous vehicle navigation

(Baluja and Pomerleau, 1995). Baluja et al. use prediction of the next future input image

as a related task to the navigation task, i.e. the MLP network is learned to predict an

input image and to produce a right steering response simultaneously. Computationally,

the hidden weights are updated based on the navigation task only, but from the obtained

hidden activities the network is trained to predict. Recursion has a place by propagating

the predicted image back to the input layer for refining the next input image via noise

and unpredicted object elimination.

A similar to IRRE connectionist network has been proposed as a hippocampal model

(Gluck and Myers, 1993). This model assumes that the hippocampal region develops

stimulus internal representation that enhances the discrimination of predictive cues while

compressing the representation of redundant cues.

3.2 Imposing reconstruction constraints

3.2.1 Reconstruction as a bias imposing mechanism

We have shown above that the reconstruction task is related to the classification task

and two main approaches to classification via reconstruction take place. The first ap-

proach offers the use of a common hidden representation obtained for all data as a pre-

processing step for the following learning (Kirby and Sirovich, 1990; Murase and Nayar,

1993; Moghaddam and Pentland., 1994). In the second approach (Japkowicz et al., 1995;

Schwenk and Milgram, 1995; Hinton et al., 1995), reconstruction (generative) networks

are used to extract the underlying structure of the data drawn from the same class.

The assumption is that an example drawn from another class does not share the already


learned structure and produces a high description length. Thus, the description length

may be used as a discriminant function.

Though these approaches have been relatively successful, there are cases when they are

not appropriate. For example, when the samples belonging to the same class have multi-

modal distribution, or the classes are very similar, the second approach is not obvious.

As we have shown in Section 3.1.4, PCA is very sensitive to data scaling.

This consideration favors the view that each perceptual task needs data preprocessing

that can not be obtained based only on the other related task. Contrary to the considered

above approaches, we propose to use reconstruction realized via a modified autoencoder as

a bias-imposing mechanism in the feed-forward networks for improving the classification

task. An intuitive way to conceive the idea of imposing reconstruction as a proper bias

constraint for classification is via the multi-task learning (MTL) approach (2.3.3). As

has been shown above both recognition and reconstruction are related but different tasks

of visual processing. In some cases, they were also replaced by one another. Secondly,

it has been experimentally shown (Elman and Zipser, 1988; Cottrell et al., 1987), that

reconstruction via an autoencoder extracts a valuable internal representation. Thus, it is

reasonable that hidden representation that relies on recognition and reconstruction tasks

can improve the generalization performance of classification. This assumes that such

hidden representation has to capture some prominent (recognition) features of the data,

while keeping most important information needed for reconstruction.

As an illustration, let us assume that we want to classify between two individuals and

suppose that one of them has some prominent features in the training images (glasses, hair

style, moustache, beard and so on), then it seems plausible that recognition will exhibit

a tendency to process these corresponding areas of the face and all the other information

will be redundant for the recognition goal. However, these features may be absent or

appear rarely in new images of this person, thus failure in the testing phase is likely. In

contrast, the addition of the reconstruction task during training of the system, forces the

system to extract other features which may not be so useful for recognition of the original

training images, but may be of use with the novel test set. This motivates our suggestion

to add reconstruction constraints during learning of the classification task.

Similar approach has been proposed in (Gluck and Myers, 1993) to model a hip-

pocampus function. It is assumed that one of the roles of the hippocampus is to extract a

common recognition/reconstruction internal representation of the input stimulus. Though

conceptually our work is close to this model we have remarkable differences that are elu-

cidated later on. Below, we present a hybrid classification/reconstruction network.


3.2.2 Hybrid classification/reconstruction network

Figure 3.3 presents the architecture of the combined classification/reconstruction network.

This network attempts to improve the low dimensional representation by minimizing

concurrently the mean squared error (MSE) of reconstruction and classification outputs.

In other words, it attempts to improve the quality of the hidden layer representation by

imposing a feature selection useful for both tasks, classification and reconstruction. The

hidden layer should have a smaller number of units compared with the input, so as to

achieve a bottleneck compression and to allow for generalization. The combined learning

Combined recognition/reconstruction network

Classification

Reconstruction

Hiddenlayer

Input

Figure 3.3: A single hidden layer drives the classification layer and the reconstructionlayer.

rule for the hidden layer units is a composition of the errors backpropagated from both

reconstruction and recognition layers. The relative influence of each of the output layers

is determined by a constant λ which represents a tradeoff between reconstruction and

classification confidence.

Below, we present a rigorous mathematical explanation of the hybrid network in the

MDL framework.

3.2.3 Hybrid network and MDL

It is easy to see that the proposed network is a modified autoencoder network. The mod-

ified autoencoder shares a common hidden representation with the supervised (classifica-

tion) network. It finds the compact hidden representation that is good for reconstruction

in addition to a task at hand (Figure 3.4). In contrast to the autoencoder (Section 3.1.2)


Hybrid network with reconstruction and EPP constraints

w - hidden weights

representation - AHidden

W W

Reconstruction

hidden-to-top

weights Bias constraints

Input - X

Output

12

Figure 3.4: The hidden layer drives the reconstruction and classification output. Inaddition, the search of another statistical structure in the data is made.

and supervised feed-forward network (Section 2.7), the hybrid network is associated with

a different communication game, in which the sender uses a compact internal representa-

tion to communicate both the observed data and the corresponding desired output (for

example, class labels of the images). Since this internal representation has to encode

efficiently both the observed data and corresponding output, a cost for communicating

the input data X has to be involved in the description length (2.7.24), yielding:

L(M,D,X ) = − log p(D,X|w,W1,W2,M)− log p(w,W1,W2|M) + const.

Assuming that given the input and the net weights, conditional probabilities of the recon-

struction and supervised outputs are independent and Gaussian, we get similar to (2.7.29)

the expression for the description length:

L(M,D,X ) =1

2(λ1ED + λ2EX )− log p(w,W1,W2|M) +

r1d logC(λ1) + r2n logC(λ2) + const. (3.2.2)

In expression for the description length (3.2.2), λ1 and λ2 are inversely proportional to

the variances of the specific task and reconstruction outputs respectively; ED and EX are

sums of the squared errors of the supervised task and reconstruction outputs, respectively;

r1, r2 are numbers of training samples for reconstruction and specific tasks.

Assuming the same a-priori probability for the hidden weights as in (2.7.30), the entire


description length may be simplified to:

L(M,D,X ) =1

2(λ1ED + λ2EX + µE[H(w,x)])

︸︷︷︸(1)

+

r1d logC(λ1) + r2n logC(λ2) + logCH(µ) + const︸︷︷︸(2)

. (3.2.3)

In general, the numbers of training samples for reconstruction (r1) and specific (r2)

tasks may be different, which seems to be a common situation in a real-world learning.

In the limit, when we do not have enough information provided by supervised learning,

internal representation is constructed based on the unsupervised learning only.

Since, in our consideration, parameters λ1, λ2, µ are assumed to be fixed, the second

part of the description length (3.2.3) is a constant and the description length may be

rewritten as:

L(M,D,X ) = 12(λ1ED + λ2EX + µE[H(w,x)]) + Const (3.2.4)

Therefore, when one is interested in both tasks, two scaled sum-square errors ED and EXpresent the error cost and the third term µE[H(w,x)] is the model-cost or representation-

cost. This interpretation of the hybrid network in the MDL framework is not single.

Indeed, an interpretation depends on a way to look at the hybrid network.

When one is mainly interested in the reconstruction via bottleneck hybrid structure,

the task may be formulated as a compression problem. This compression has benefits

compared to a conventional autoencoder, since it admits not only a good reconstruction

of the data, but a successful handling on the specific task, such as classification, for

example. In this statement, the reconstruction error EX is recognized as the error cost

and the scaled classification error and third term as the model-cost.

The third and the last interpretation is produced when one is mainly interested in

the specific task (for example, classification or control tasks). In this case, the specific

error ED is recognized as the error cost and the scaled reconstruction error and third term

as the model-cost. This interpretation gives the rigorous mathematical way for imposing

reconstruction and other unsupervised types of constraints in the supervised network.

Below, based on this last interpretation, we explain why the hybrid network may be

better than the conventional classification feed-forward network. Let us consider two

different principal melodies. Suppose that the first is embellished with specific tones,

however, the second is not arranged at all. Hearing these two melodies, arranged and not,

many times, one can decide that these specific tones are enough for recognizing, which


one of the melodies is played. However, the next time the first melody may be played by

a non skilled pianist, that skips all the beautiful ornaments. Obviously, in this case, the

first melody will never be recognized, based on the presence of the ornaments only.

This example demonstrates that a bottleneck network for classification, attempting to

minimize the description length has a tendency to throw away salient information from

the data. The internal representation extracted, based on the classification task alone,

may be too poor. Reconstruction helps to process information as a whole, it does not

concentrate on the particular details, balances the relationship between the whole and its

parts, resulting in a better prediction on the supervised specific task.

Bayesian interpretation for the hybrid NN

We have shown that the MDL approach naturally explains and interprets the proposed hy-

brid classification/reconstruction network. It also states that the most probable network

weights have to minimize the following part of the description length (3.2.4):

R(w,W1,W2) = λ1ED(w,W1) + λ2EX (w,W2) + λ3H(w). (3.2.5)

We recall now that the MDL principle is tightly related to the Bayesian approach, where

parameters λ1, λ2, µ are recognized as hyper-parameters. When the hyper-parameters

are unknown, the Bayesian correct treatment (Bishop, 1995a) is to integrate the hyper-

parameters out of any predictions P :

p(P|D,X ) =∫p(P|D,X , λ)p(λ|D,X )dλ, (3.2.6)

where λ = (λ1, λ2, µ) is a vector of hyper-parameters and p(λ|D,X ) is the evidence

for the hyper-parameters. This integration is similar to generating an ensemble from

the networks which depend on the hyper-parameters, where instead of evaluation of the

hyper-parameter evidences (that is impossible analytically), we integrate predictions in

the vicinity of the most likely hyper-parameters assuming equal evidences.

Thus, contrary to (Gluck and Myers, 1993; Pomerleau, 1993), we do not consider

some fixed manually adjusted parameters, but a class of the reconstruction-classification

networks depended on the “regularization” parameter, with the subsequent combination

of the networks to ensembles.

3.2.4 Hybrid network as a generative probabilistic model

Both “recognition” and “generative” phases can be identified in the proposed hybrid

model. The “recognition” phase infers an internal/hidden representation of the input data.


The “generative” phase reconstructs the input from the inferred compact representation

in the reconstruction output sublayer and, in addition, predicts the specific task output

in the corresponding sublayer.

From a Bayesian viewpoint, learning in the hybrid network is equivalent to maximiza-

tion of the joint probability of the input and specific task output, given the observation

and specific constraints on the internal representation. According to Bayesian theory, the

best classification is based on the conditional probability of the image classes given the

input (i.e. the conditional probabilities of the image classes are the best discriminant

functions, that lead to the minimal classification error). The output of the recognition

layer of the hybrid network estimates this conditional probability and the reconstruction

sublayer regenerates the input data, implicitly estimating the probability of the input

data.

The proposed architecture differs from a probability network that has a generative

model (reconstruction) and a recognition model in a manner similar to the binary wake/sleep

architecture (Hinton et al., 1995), or the Rectified Gaussian Belief Network (Hinton and

Ghahramani, 1997). First, it is not a full forward/backward model, namely there are no

two hidden unit representations, one for the top-down and one for the bottom-up, but

instead a single hidden representation is used for both (Figure 3.3). Second, its learning

goal is to minimize the classification error (via the mean squared error) as well as to

minimize the reconstruction error, as opposed to the goal of constructing a probabilistic

model of internal representations. The two goals may coincide under a continuous hidden

unit network, but are certainly different for a binary network.

3.2.5 Hybrid Neural Network architecture

The detailed architecture of the network is presented in Figure 3.5. This hybrid network

is a modification of the well-known feed-forward network. It is supplied by images in the

input layer which are propagated via a hidden layer to the output layer. The output layer

consists of two sub-layers, one sub-layer reconstructs the image, the second one serves

for classification. The number of units in the output reconstruction sub-layer and the

input layer are the number of pixels in the image. The hidden layer has a smaller number

of units, because we are looking for aggressive compression techniques to overcome the

“curse of dimensionality”. The output classification layer has a number of units equal to

the number of image classes. Each image is propagated to the hidden layer in the form:

hj =N∑

i=1

wjixi + wj0 (3.2.7)

yj = σ(hj), j = 1, . . . ,m, (3.2.8)


Detailed architecture of the recognition/reconstruction network

sublayerReconstruction

(s=1,..K)

PsXi

j

Reconstructed image.

sj1

W

Input image.

sublayer.Recognition

Y

Hj(j=1,...,m)

(i=1,...,n)

Wji

Xi

ij2

W

Figure 3.5: Feed-forward Neural Network with recognition and reconstruction outputsub-layers

where m N is the number of hidden units, N is the number of pixels in the image, σ

is the sigmoid activation function:

σ(x) =1

1 + exp(−x). (3.2.9)

Image reconstruction, based on the hidden layer representation, is given by:

xi =m∑

j=1

W 2ijyj +W 2

i0, i = 1, . . . , N, (3.2.10)

and the output of the recognition layer unit is calculated according to the formula:

ps = σ(m∑

j=1

W 1sjyj +W 1

s0), s = 1, . . . , K, (3.2.11)


where K is the number of individuals (number of classes). The classification is made

according to the maximal response of the recognition sub-layer (ps is interpreted as the

probability of the sample to belong to a certain class -s).

3.2.6 Network learning rule

Let us consider the error back-propagation learning rule with a goal of minimizing the

cost function, which is a weighted sum of scaled recognition and reconstruction errors

with coefficients λ1 and λ2, respectively1:

E(w,W1,W2) = λ1E1(w,W1)/K + λ2E

2(w,W2)/N. (3.2.12)

As has been shown in the Section 3.2.3, the coefficients λ1 and λ2 are inversely propor-

tional to the noise variances in the reconstruction and recognition channels, respectively.

Therefore, the larger the noise is in the channel, the less is the weight of the error-cost

corresponding to this channel. Recognition E1 and reconstruction E2 errors, sum squared

over all samples are given by:

E1 =M∑

µ=1

K∑

s=1

(pµs − tµs )2. (3.2.13)

E2 =M∑

µ=1

N∑

i=1

(xµi − t(xµi ))2 (3.2.14)

In this expression, t(xµi ) determines the target of the µ-sample in the reconstruction

unit-i. The most reasonable choice for t(xµi ), not demanding any a-priory knowledge, is

t(xµi ) = xµi . Correspondingly, tµs is a target for recognition given by:

tµs =

1 if s coincides with class of µ-sample0 otherwise

The weights between output-to-hidden and hidden-to-input layers update according

to the gradient descent rule :

∆W1 = −η?λ15W1 E1(w,W1) (3.2.15)

∆W2 = −η?λ25W2 E2(w,W2) (3.2.16)

∆w = −η?(λ15w E1 + λ25w E2) (3.2.17)

1For convenience, scaling of errors E1 and E2 by the number of pixels in the image N , and the numberof image classes K respectively is carried on. This scaling serves to balance the values of the recognitionand reconstruction errors in Eq. 3.2.4.


Specifically, the weights between output reconstruction-to-hidden layers are given by:

∆W 2ij = ηλ2

M∑

µ=1

δµ,2i yµj (3.2.18)

δµ,2i = ∆µ,2i ≡ (xµi − xµi )/N (3.2.19)

i = 1, . . . , N , j = 1, . . . ,m,

where yµj (3.2.8) is the output of the hidden unit-j in the feed-forward propagation of the

input image-µ, and δµ,2i is the image reconstruction error, scaled by the number of pixels

in the image. Similarly, the weights between output recognition-to-hidden layers change

by:

∆W 1sj = ηλ1

M∑

µ=1

δµ,1s yµj (3.2.20)

δµ,1s = σ′(m∑

j=1

W 1sjy

µj +W 1

s0)∆µ,1s (3.2.21)

s = 1, . . . , K , j = 1, . . . ,m

where ∆µ,1s -recognition (regression) error scaled by the number of image classes:

∆µ,1s = (tµs − pµs )/K. (3.2.22)

We call δ the output error of the layer and ∆ the input error to the layer in the

backward propagation. According to the generalized delta-rule (Hertz et al., 1991), the

change of deeper embedded weights between hidden-to-input layers has the form:

∆wji = ηM∑

µ=1

δµj xµi (3.2.23)

and the output error of the hidden unit-j δµj in the backward propagation of the error is

given by:

δµj = σ′(hµj )∆µj . (3.2.24)

Input error to the hidden unit-j ∆µj has the form:

∆µj = λ1∆µ,1

j + λ2∆µ,2j (3.2.25)

∆µ,1j =

N∑

i=1

W 1ijδ

µ,1i (3.2.26)

∆µ,2j =

K∑

s=1

W 2sjδ

µ,2s . (3.2.27)


From (3.2.25–3.2.27) it is easy to see that the output error δµj may be written as the

sum of the errors back propagated concurrently from the reconstruction and recognition

sub-layers:

δµj = λ1δµ,1j + λ2δ

µ,2j (3.2.28)

δµ,1j = σ′(hµj )∆µ,1j (3.2.29)

δµ,2j = σ′(hµj )∆µ,2j . (3.2.30)

In general, the input/output errors to any layer of the network are a weighted sum

of the input/output errors back propagated from the lateral sub-layers (a chain rule of

the derivatives). Thus, in the error back-propagation mode, hybrid network with lateral

sub-layers emerges as a linear superposition of the conventional (classical) subnetworks.

3.2.7 Hybrid learning rule.

We follow the gradient descent algorithm as the errors are back propagated from an input

layer to a hidden layer with a properly scaled cost function (3.2.12):

E(w,W1,W2) = (1− λ)E1(w,W1)/K + λE2(w,W2)/N, (3.2.31)

where λ ∈ [0, 1] (λ = λ2/(λ1 + λ2)) is a regularization parameter, which represents a

tradeoff between reconstruction and classification confidences.

According to the gradient descent method, updating of the weight vector in each

iteration has to be done in the direction that has a negative projection on the gradient

direction. This permits us to rescale a learning rule (3.2.15-3.2.16):

∆W1 = −η?5W1 E1(w,W1)

∆W2 = −η?5W2 E2(w,W2)

∆w = −η?((1− λ)5w E1 + λ5w E2) (3.2.32)

We emphasize that the parameter λ in our implementation, affects only the weights w

between input and hidden layers, i.e. on the hidden layer representation. Our rule (3.2.32)

may be treated as the hidden layer belief in the performance of the two upper channels,

transferring backward information from reconstruction and recognition sub-layers. Thus,

we take the errors of the reconstruction layer with the weight λ, and the errors of the

recognition layer with the weight 1 − λ. It can be seen, that for λ = 0 the hidden

representation is built based only on the recognition task, and reconstruction is learned

from the hidden layer. This marginal case corresponds to the Baluja consideration (Baluja

and Pomerleau, 1995). In contrast, when λ = 1, the hidden representation is based on the


reconstruction task solely; and we attempt to solve the recognition task in the reduced

space. We see that this marginal case is equivalent to a first approach to classify via

reconstruction (Kirby and Sirovich, 1990; Turk and Pentland, 1991; Murase and Nayar,

1993).

This network and its hybrid rule may be interpreted as the parallel concurrent work

of two separate feed-forward networks for recognition and reconstruction. The hybrid net

hidden weight updating is a linear combination of the gradient directions of both networks

in the common hidden weight space. For small λ our method is a kind of gradient descent

method that prevents zig-zags (peculiar to the gradient steepest descent method (Ripley,

1996)) in the search of the optimal weights minimizing the recognition regression error.

Chapter 4

Imposing bias via unsupervisedlearning constraints

4.1 Introduction

Information theory provides some explanation to sensory processing (Rieke et al., 1996).

According to these principles, neural cell responses are developed by optimizing criterions

based on the information theory. The first proposed information principles are redundancy

reduction (Barlow, 1961) and “infomax” (Linsker, 1988), that are similar and lead to a

factorial code formation under some conditions (Nadal and Parga, 1994). Recently, with

the parallel development of independent component analysis (ICA) (Comon, 1994) in

the signal processing, new efficient algorithms for the factorial code formation have been

proposed. Of particular interest are algorithms via feed-forward networks with no hidden

layer (Bell and Sejnowski, 1995; Yang and Amari, 1997).

In this chapter, we propose to use information theoretical measures as constraints for

the classification task. We introduce a hybrid neural network with a hidden representa-

tion that is arranged mainly for the classification task and, in addition, has some useful

properties, such as the independence of hidden neurons or maximum information transfer

in the hidden layer, etc.

The chapter is organized as follows. In the first section, the main information principles

and their relation to sensory processing are discussed. The second section presents the

mathematical background and algorithms for ICA and other related information prin-

ciples. In the third section, a hybrid neural network with unsupervised constraints is

introduced and some algorithmical details are presented.

50

Chapter 4: Unsupervised learning constraints 51

4.2 Information principles for sensory processing

Mammals process incomplete and noisy sensory information in an apparently effortless

way. This is possible since sensory inputs: images, sounds, etc., have very specific

statistical properties that are efficiently encoded by the biological nervous systems. The

sensory inputs appear usually smooth over large spatial and temporal regions that lead

to redundancy in the sensory input. The redundancy emerges as a statistical regularity,

which means that many pieces of a signal are a-priori predictable from other pieces and

hence by clever recoding it is possible to get more economical representation of the data.

In the past, the principle of redundancy reduction (Barlow, 1961) was suggested as a

coding strategy in neurons. According to this principle each neuron should encode features

that are as statistically independent as possible from other neurons over a natural ensemble

of inputs. The ultimate obtained representation is called the factorial code (Redlich, 1993).

In the factorial code, the multivariate probability density function (pdf) is factorized as

a product of marginal pdfs. This property provides an efficient way of storing statistical

knowledge 1 about the input (Barlow, 1989).

One of the earliest attempts to construct the factorial representation via neural net-

works was proposed by Atick (1992). The underlying computational learning rule is based

on the minimization of the sum of the entropies of the hidden units under constraint to

preserve the input entropy (the total information about the signal). A type of gradient

descent algorithm in the assumption of a Gaussian input signal and linear output, re-

sults in a Hebbian-like learning rule and a decorrelated hidden representation. The major

limitation of Hebbian-like rules is dependence on the linear, pairwise correlations among

image pixels (second order statistics). Thus, they are not sensitive to phase changes in

the image responsible for oriented localized structures, such as lines, edges and corners

(Field, 1994).

Motivated by the principle of redundancy reduction Field (1994) contrasts two different

coding approaches. Both approaches take advantage of the input redundancy, but in a

different manner. The first one, compact coding, is based on the mean-squared error and

uses only the second order statistics of the input. The main goal of this coding is to

reduce dimensionality of the input in the directions with a low input variance. PCA

and linear auto-associator networks, considered in Chapter 3, are examples of this coding

scheme. An alternative sparse distributed coding does not necessarily imply the reduction

of dimensionality. In contrast, the dimensionality may be enlarged. A sparse distributed

1For an image description the probability of each possible set of pixel values has to be known. Forinstance, an image having N pixels with Q intensity quantization levels requires the storing of QN possibleprobabilities. If the code is factorial the number of the required probabilities reduces to NQ.


coding approach encourages representations, where only a small, adaptive to input, subset

of hidden units is simultaneously active.

Although, there is not a general tool to form the sparse code, it has some typical

features. The sparse code is characterized by the extremely peaked distribution of the

hidden unit activities which provides both high probability of a neuron to be silent or

active according to its relevance to the input pattern representation. A way to construct

sparse coding based on this feature has been proposed in (Olshausen and Field, 1996) by

minimizing the cost functional consisting of a mean-squared error and a penalty term for

neuron activities.

Peaked distributions are characterized by high kurtosis or low entropies (Oja, 1995b),

thus, maximization of kurtosis or entropy minimization can be used for sparse coding

formation. At the same time, via minimization of the sum of the entropies sparse coding

is related to a factorial coding. It is also known that under a fixed variance the Gaussian

distribution has the largest entropy (Cover and Thomas, 1991). Thus, hidden unit entropy

minimization is tightly related to exploratory projection pursuit (EPP), which tries to

find a structure in the projected data, seeking directions that are as far from Gaussian as

possible (Friedman, 1987). Therefore, a deviation from the Gaussian distribution serves

as a good measure for hidden unit independence and can be used as a strategy for sparse

coding construction.

Recently, an interest in EPP has been revived and formulation of the new unsuper-

vised rules based on the information theory has been stimulated with the development of

independent component analysis (ICA). In the next section, ICA is formulated and some

algorithms producing factorial codes are presented.

4.3 Mathematical background

ICA has been developed as a tool for blind source separation. The problem is to recover

independent sources from sensory observations which are unknown linear mixtures of the

unobserved independent source signals. Let us consider m unknown mutually independent

sources si(t), i = 1, . . . ,m with no more than one being normally distributed. In general,

t is a sampling variable, that may be a time variable for signals or a two dimensional

spatial variable for images, or an index of the pattern in a data-set. The sources are

mixed together linearly by an unknown non-singular matrix A ∈ Rn×m:

x(t) = As(t), s(t) = [s1(t), . . . , sm(t)] (4.3.1)

It is assumed that in (4.3.1) the number of sensors xi(t), i = 1, . . . n is greater or equal

to the number of sources (n ≥ m). The task is to recover the original signals via a linear


transform defined by a matrix W ∈ Rm×n:

u(t) = Wx(t), u(t) = [u1(t), . . . , um(t)] (4.3.2)

Since recovered signals may be permuted and scaled versions of the sources, the de-mixing

matrix W has to be a solution of the following linear equation:

ΛP = WA,

where Λ is a non-singular diagonal matrix and P is a permutation matrix.

4.3.1 Entropy maximization (ME)

One of the first algorithms extracting the independent components via a neural network

has been proposed by Bell et al. (1995). Assuming that the number of sources is equal

to the number of sensors, a fully connected n → n feed-forward network consisting from

an input and nonlinear output layers, having the same number of units as the number of

sources, has been considered (Figure 4.1). The network has been trained to maximize a

Feed-forward network for independent component extraction

y i =g i (u i )

uiy i

u=Wx

Input - x

(u-recovered sources)

Output y:

Figure 4.1: A one layer n→ n feed-forward network.

joint entropy H(y) of the nonlinear output y:

u = Wx + w0, y = g(u), y ∈ Rn,u ∈ Rn,w0 ∈ Rn (4.3.3)

H(y) = −∫p(y) log p(y)dy

In the case of the output additive noise, the entropy maximization (ME) is equivalent

to maximization of the mutual information between input and output (Nadal and Parga,


1994). As has been shown earlier (Linsker, 1988), the principle of the mutual informa-

tion maximization called “infomax” in the case of a linear neural network leads to a

Hebbian like learning rule, that is sensitive to the second order statistics only, therefore,

nonlinearity in the output layer is essential.

The joint entropy of the output can be represented as:

H(y) =∑

i

H(yi)− I(y), (4.3.4)

where H(yi) = − ∫ p(yi) log p(yi)dyi are marginal entropies of the outputs and I(y) is

their mutual information. The mutual information (MI) of the output y is a Kullback-

Leibler measure between output distribution p(y) and a product of marginal distributions∏i p(yi):

I(y) =∫p(y) log

p(y)∏i p(yi)

dy (4.3.5)

Due to a ∩-convexity of the log function, the Kullback-Leibler measure is nonnegative and

attains its minimum zero value if and only if outputs yi are independent almost every-

where. Maximization of the joint entropy consists of maximizing the marginal entropies

and minimizing the mutual information. Since the nonlinear functions bound the outputs,

the marginal entropies are maximum for a uniform distribution of yi. The mutual infor-

mation I(y) is invariant under an invertible component-wise transform (I(y) = I(u)) and

achieves its minimum equal zero when the presynaptic outputs u (4.3.3) are independent.

Thus, if the nonlinear functions gi have the form of the cumulative density function (cdfs)

of the true source distribution, then the matrix W recovers independent sources as the

presynaptic output u (4.3.3), and this is a single global maximum of the joint entropy

H(y), which is a convex ∩ function.

As has been rigorously proven (Yang and Amari, 1997), the ME approach leads to

the independent components only if the nonlinear activation functions gi in the output

layer coincide with the cumulative density functions (cdfs) of the sources. For zero mean

mixtures and functions gi not equal to the (cdfs) of the sources, the ME algorithm does

not converge to the ICA solution W = ΛPA−1. However, if the initial matrix is the right

ICA solution W0 = ΛPA−1

, the algorithm does not update the de-mixing matrix W in

the directions of increasing the cross-talking. This fact partially explains the ME success,

even when cdfs are not known exactly. In applications considered by Bell and Sejnowski

(1995), nonlinear activation functions have been chosen ad hoc as logistic sigmoidal, that

has a highly peaked derivative with long tails. Since sound signals are super-Gaussians2

this type of nonlinearity appears to be appropriate for “infomax” principle.

2Super-Gaussian signals have pdf with large tail areas and a sharp peak. In contrast, sub-Gaussiansignals have pdf with small tail areas and a flat peak (see also Appendix A to Chapter 4.)


The de-mixing matrix W is found as synaptic weights of the network iteratively using

the stochastic gradient ascent method applied to the joint entropy H(y):

∆W = η([Wt]−1 + (1− 2y)xt) 1 ∈ Rn,

∆w0 = η(1− 2y)

Amari et al. (1997) have suggested a modification of this rule that utilizes the nat-

ural gradient and does not require the inversion of the weight matrix. It proceeds by

multiplying the absolute gradient by WtW, producing3:

∆W = η(I + (1− 2y)ut)W (4.3.6)

4.3.2 Minimization of the output mutual information (MMI)

Another way to derive independent outputs for the blind separation problem has been

presented in (Amari et al., 1996). An algorithm minimizes the mutual information (MI)

of the linear outputs, Iu(W):

Iu(W) = −H(u) +n∑

i=1

H(ui), (4.3.7)

u = Wx, (4.3.8)

that attains its minimum if and only if the outputs ui are independent about everywhere.

In order to approximate marginal entropies H(ui), truncated Gram-Charlier expansion

(Stuart and Ord, 1994) of the marginal pdfs p(ui) has been used and a mild assumption

about the original source statistics has been done. It has been assumed that the original

sources have zero mean and their variances are normalized to 1. A stochastic gradient

descent applied to the approximated expression of the mutual information Iu(W) leads

to the following equation for the network weight dynamics:

∆W = η([Wt]−1 −Φ(u)xt) (4.3.9)

where Φ(u) = f(k3,k4) u2 + g(k3,k4) u3 and the following notations have a place:

f y = [f1y1, . . . , fnyn]t, uk = u uk−1

f(k3,k4) = [f(k13, k

14), . . . , f(kn3 , k

n4 )]t, g(k3,k4) = [g(k1

3, k14), . . . , g(kn3 , k

n4 )]t

ki3 = mi3 = E[u3

i ], ki4 = E[u4i ]− 3(E[u2

i ])2

f(a, b) = −1

2a+

9

4ab, g(a, b) = −1

6b+

3

2a2 +

3

4b2

3w0 is assumed to be zero.


The natural gradient descent for MMI leads to the following algorithm:

∆W = η(t)[I−Φ(u)ut]W (4.3.10)

As has been pointed out in (Yang and Amari, 1997), both ME and MMI algorithms

have the same typical form (4.3.10). In ME Φ depends on the nonlinear activation

functions gi and is given by:

Φ(u) = −(g′′1(u1)

g′1(u1), . . . ,

g′′n(un)

g′n(un))t (4.3.11)

Since gi should coincide with the cdfs of the unknown original signals, Φ(u) have to

be chosen properly. In MMI, functions Φ depend on the cumulants of the third and

fourth orders ki3, ki4 of the linear scalar output ui. These cumulants may be replaced by

instantaneous values or be estimated. Another possibility is to use a-priori knowledge

about cumulants of the unknown original signals. Therefore, whereas a success of the

ME algorithm depends on the a-priori knowledge about data statistics, MMI is more

flexible.

In (Yang and Amari, 1997) the following types of Φ(u) = (φ(u1), . . . , φ(un))t have

been used:

(a) φ(u) = u3 (4.3.12)

(b) φ(u) = tanh(u) (4.3.13)

(c) φ(u) =3

4u11 +

15

4u9 − 14

3u7 − 29

4u5 +

29

4u3 (4.3.14)

The (a-b) forms of Φ(u) correspond to the ME algorithm and assume pdfs (and equiva-

lently g(u)) 4 to be proportional to:

(a) p(u) ∝ exp(−u4/4) (4.3.15)

(b) p(u) ∝ (cosh(u))−1 (4.3.16)

Therefore in both cases, distributions are assumed to be symmetrical and sub-Gaussian.

The form (4.3.14) of Φ(u) is the instantaneous form of MMI (ki3 = u3i , k

i4 = u3

i − 3) and

it does not assume the shape of the source distributions.

The ME and MMI learning rule (4.3.10) has been obtained in the assumption of a

square weight matrix, W. However, in some applications it may be interesting to separate

4Here we use the fact that − g′′(u)g′(u) = φ(u), which leads to g′(u) = exp(− ∫ u

0φ(u))du and at the same

time g′(u) must coincide with pdfs of the original sources.


only a part of the sources. This may be done via multiplication of the right side of the

learning rule (4.3.9) by WtΛW, where the block matrix Λ ∈ Rn×n is given by :

Λ =

(I 00t 0s

). (4.3.17)

In (4.3.17) I ∈ Rm×m is an identity square matrix, 0 ∈ Rm×(n−m) is a rectangular zero

matrix, 0s ∈ R(n−m)×(n−m) is a square zero matrix and m < n. The final learning rule for

a part of the weight matrix W, obtained by deleting the last (n−m) rows of the matrix

W will be the same one as (4.3.10): W ∈ Rm×n:

∆W = η(t)[I−Φ(u)ut]W

W ∈ Rm×n, Φ(u) ∈ Rm×1, u ∈ Rm×1, I ∈ Rm×m

The network architecture then implies dimensionality reduction since the number of out-

put units is less than the number of input units. In addition, such a network extracts

independent components.

4.3.3 Relation to Exploratory Projection Pursuit.

MMI has been considered as the starting point for a large family of ICA contrast functions

proposed by Hyvarinen (Hyvarinen, 1997a). It has been noted that MI can be expressed

using negentropies J(u), J(ui)5 (Hyvarinen, 1997a; Girolami and Fyfe, 1996):

Iu(W) = J(u)−∑i

J(ui) +1

2log

∏iCii

det(C), (4.3.18)

where C is a covariance matrix of u and Cii are its diagonal elements. Since the negentropy

J(u) is invariant for invertible linear transformations (J(u) = J(x), note that J(ui) =

J(xi) holds only when nonlinear transformation: x → u, is componentwise with ui =

f(xi)), MMI is roughly equivalent to finding directions in which negentropy is maximized.

This equivalence is rigorous, when components ui are constrained to be uncorrelated (the

last term of 4.3.18 is zero). This means that the directions in which the data distribution

is as non-Gaussian as possible are preferable. This is the point where EPP and ICA have

come into contact.

The natural gradient ascent applied to the sum of the marginal negentropies leads to

the same learning rule (4.3.10) (Girolami and Fyfe, 1996; Lee et al., 1998). When the

5Negentropy of the multivariate random variable u is a difference between entropies of the multivariateGaussian distribution with the same covariance matrix as u and entropy of the u: J(u) = H(uG)−H(u).It measures deviation of the distribution from Gaussian and is nonnegative. The valuable property ofnegentropy is invariance under invertible linear transforms.


nonlinearities Φ are taken to be:

φi(ui) =

ui + tanh(ui) for super-Gaussian sourceui − tanh(ui) for sub-Gaussian source

the learning rule may be written in the elegant form:

∆W = η(t)[I−K tanh(u)ut − uut]W, (4.3.19)

where K is a diagonal matrix with elements sign(kur(ui)) and kur(ui) is kurtosis of the

i-source.

The advantage of EPP, however, is the possibility to find independent components

recursively one-by-one by maximization of the 1-D negentropy. For the same conditions,

as in (Yang and Amari, 1997): E[ui] = 0, E[u2i ] = 1, negentropy may be approximated

by:

J(ui) ≈ 1

12(E(u3

i ))2 +

1

48(k4(ui))

2 (4.3.20)

When source distributions are assumed to be symmetrical, negentropy simplifies to J(ui) ∝(k4(ui))

2 = kur2i and minimization of the output mutual information is approximately

equivalent to maximization of the sum of the source kurtosises6:

Fmax(W) =∑

kur2i (4.3.21)

In other words, the directions in which signal distribution is highly peaked or extremely

flat, are considered as interesting.

In (Hyvarinen, 1997b) a new family of approximated contrast ICA functions has been

proposed via the negentropy approximation:

J(u) ∝ (E[G(u)]− E[G(ν)])2, (4.3.22)

where ν is a standardized Gaussian variable and the function G fulfills some orthogonality

property and is suitable to the assumed original source statistics and is reasonably simple

for computation. The simplest proposed choices for the function G is polynomial G = |u|α,

where α < 2 for super-Gaussian densities and α > 2 for sub-Gaussian densities. This

approach appears finally as a generalization of different projection pursuit indices (Blais

et al., 1998), where skewness and kurtosis are used explicitly to measure deviation from

the Gaussian distribution. It is related also to the BCM neuron learning rule (Intrator

and Cooper, 1992).

6See Appendix A to Chapter 4 for cumulants and kurtosises definitions.


4.3.4 BCM

An idea of BCM is to find a direction w which emphasizes data multi-modality by mini-

mizing a specific loss function (a specific projection index):

F(w) = −µ(1

3E[u3]− 1

4θ2) (4.3.23)

u = wtx, θ = E[u2],

In order to make this measure robust to outliers, a rectification nonlinear function is

applied in the linear output. Thus, in general, y = g(wtx). The gradient descent rule

yields the following learning rule:

∆w = µE[φ(y, θ)g′(u)x] (4.3.24)

where φ(y, θ) = y2 − yθ, θ = E[y2].

4.3.5 Sum of entropies of the hidden units

Being motivated to obtain a hidden representation where each neuron contains as much

information as possible, we suggest to maximize the sum of the entropies of the output

units:

F(W) =m∑

i=1

H(yi).

The stochastic gradient descent method leads to the following equation for the weight

dynamics (details are given in Appendix B to Chapter 4:

∆W = η(f(u) + g′′g′ )x

t , (4.3.25)

where f(u) is defined as φ(u) in (4.3.14). Since the nonlinear output functions bound the

output values yi, the entropy is maximized, when yi is uniformly distributed, which leads

to a relation pu(ui) = dgidui

. This means that the distribution of the presynaptic variables

ui is controlled by the nonlinearities in the learning rule (4.3.25).

For logistic sigmoidal activation functions, (4.3.25) simplifies to:

∆W = η(f(u) + (1− 2y))xt (4.3.26)

The same rule, but with the negative parameter η, can be used for a sparse code formation,

as suggested in (Olshausen and Field, 1996; Atick, 1992)7.

7When the output is bounded c < y < d, due to a ∩-convexity of the log function, the entropy of theoutput is upper bounded: H(y) =

∫ dcp(y) log 1

p(y)dy ≤ log(∫ p(y)p(y) )dy = log(d− c). Therefore, the entropy

maximization is properly defined mathematically. At the same time the lower estimate depends on thedistribution: −H(y) =

∫ dcp(y) log p(y)dy ≤ log

∫ dc

(p(y))2dy ≤ 2 log((d − c) max p(y)). It is clear thatin practice max p(y) is bounded and therefore, the problem of the sum of entropies minimization is alsoproperly defined mathematically.


4.3.6 Nonlinear PCA

Although the nonlinear PCA method has no apparent connection to the ME or MMI,

it has been shown that it allows separation of the whitened linear mixtures of sources

(Oja, 1995b; Oja, 1995a). In nonlinear PCA, the input signals are first prewhitened, i.e.

the signals are represented as the projections on the eigenspace of the input covariance

matrix and are properly scaled. As a result prewhitened signal x has a zero mean and a

unit spherical covariance matrix. The learning rule is an approximate stochastic gradient

descent algorithm that minimizes the mean-squared reconstruction error:

E = E[‖ x−Wty ‖2], (4.3.27)

where the weight matrix W and nonlinear output y are defined to be the same as in

(4.3.3) and the bias is assumed to be zero w0 = 0.

An approximate learning rule has the form:

∆W = ηy(xt − ytW) (4.3.28)

For separation, odd twice differentiable nonlinear functions gi have to be properly taken

to satisfy some stability conditions depending on the data statistics. Particularly, it is

shown (Oja, 1995a) that a sigmoidal nonlinear activation function as g = tanh(βu), β > 0

is feasible for sub-Gaussian original signals and polynomial g = u3 for super-Gaussian

densities (in this analysis it was assumed that the sources are statistically identical and

have a symmetrical distribution).

The MSE for whitened data and nonlinear activation functions in the form g(u) = u3

or tanh(u) may be approximated as −∑ kuri (Lee et al., 1998). Thus, minimization of

the MSE leads to maximization of the sum of the kurtosises:

Fmax(W) =∑

kuri (4.3.29)

The latter expression is equivalent to (4.3.21) for super-Gaussian original sources. This

evaluation shows that in some cases the nonlinear PCA can also be viewed from information-

theoretic principals, as a method to minimize approximately the mutual information of

the output.

4.3.7 Reconstruction issue

Learning in the nonlinear PCA and nonlinear autoencoders is based on the reconstruction

mean-squared error. Similarly to a linear case, nonlinear PCA and nonlinear autoencoder

extract different weights. The nonlinear autoencoder with proper activation functions does


not necessary extract the independent components as the nonlinear PCA does in some

cases. However, this consideration sheds light on the relation between the unsupervised

learning based on the information theory and reconstruction.

Reconstruction and ICA are related also via a generative model approach (MacKay,

1996; Roweis and Ghahramani, 1997; Lee et al., 1998). ICA recovering independent com-

ponents (hidden variables) and de-mixing weight matrix W is itself a recognition phase

of the reconstruction process, with a nonlinear generative model that differs from genera-

tive models underlying PCA and linear autoencoder. Thus, although ICA (information-

theoretic) constraints may be also considered as some type of “generalized” reconstruction

constraints with another underlying generative model, we keep the notion of reconstruc-

tion constraints for an autoencoder network.

ICA similar to PCA has been also used as a preprocessing step for face classification

(Bartlett et al., 1998). As will be clear later, the hybrid classification/feature extraction

scheme which is introduced in the next section corresponds to this type of preprocessing,

when trade-off parameter λ = 1.

4.4 Imposing unsupervised constraints

The unsupervised learning rules we have used are based on different assumptions about the

quality of the low dimensional representation (LDR). These rules are based on statistics

of order higher than two, and use low order moments of the distribution and a sigmoidal

squashing function for robustness against outliers.

The learning rule for hidden weights modification for the constrained network (Fig-

ure 2.2) is described by:

∆w = −η((1− λ)5w E1 − λh(w,x)), (4.4.30)

where the term h(w,x) corresponds to weight updating, that emerges via additional

unsupervised feature extraction. When h(w,x) is a gradient of some information measure

H(w,x) 8 the learning rule (4.4.30) corresponds to minimization of the penalized mean

squared recognition error:

F(w,W1) = (1− λ)E1(w,W 1)− λH(w,x). (4.4.31)

Table 4.1 summarizes different learning constraints with the corresponding h(w,x)-

function. The bottom rows of Table 4.1 describe a few variations on the sum of entropy

8The term h(w,x) can appear as a gradient of some information measure scaled by a positive definitematrix P(x,w), then in general corresponding H(w,x) may not exist. We use the negative sign beforeh(w,x) term in Eq. 4.4.30 for convenience, since most of the used feature extraction rules are formulatedas a maximization problem.


Unsupervised Constraints

Type of h(x,w)constraints

Entropy maximization (Bell and Sejnowski, 1995) with sigmoidal activation function:(ME) ∆W = η(I + (1− 2y)ut)W ( 4.3.6 )BCM (Intrator and Cooper, 1992) with sigmoidal activation function

∆wij = ηE[φ(yi, θi)g′(ui)xj] ( 4.3.24 )

Sum of entropies: ∆wij = η(f(ui) + (1− 2g(ui)))xjA f(u) = u3

B f(u) = 2tanh(u)C f(u) = 3

4u11 + 15

4u9 − 14

3u7 − 29

4u5 + 29

4u3

D ∆wij = −η(f(ui) + (1− 2g(ui)))xjf(u) = [3

4u11 + 15

4u9 − 14

3u7 − 29

4u5 + 29

4u3]

Nonlinear PCA ∆W = ηy(xt − ytW)

Table 4.1: Different learning rules used as unsupervised constraints in addition to recon-struction (see text and Appendix for details).

rules, based on a different type of function f(u). These functions emphasize different

statistical properties of the input distribution and are discussed in (Blais et al., 1998). In

particular, the last two rows use the Gram-Charlier approximation to the entropy which

is done via moments (Stuart and Ord, 1994). The last row represents a minimization of

entropy rather than maximization, as might be suggested by the desire to find distributions

that are far from Gaussian.

Similar to the hybrid network with reconstruction constraints the constrained network

with the learning rule (4.4.30) may be interpreted as a competitive learning of two nets for

classification and statistical feature extraction. The output layer of the feature extraction

network coincides with the hidden layer of the classification network. Thus, the hybrid

network learns to classify and extract useful statistical properties simultaneously.

4.5 Imposing unsupervised and reconstruction con-

straints

Generalizing our approach further, we offer to constrain classification by reconstruction

and other types of unsupervised constraints (see Figure 3.4). The generalized learning

rule has the form:

∆W1 = −η?5W1 E1(w,W1)

∆W2 = −η?5W2 E2(w,W2)


∆w = −η?((1− λ)5w E1 + λ((1− µ)5w E2 − µh(w,x)), (4.5.32)

where now we have two regularization parameters λ and µ. Thus, the most general net-

work corresponds to the goal function (3.2.4) and its flow-chart is presented in Figure 3.4.


Appendix A to Chapter 4: Order statistics

Here we give some definitions and relations between order statistics (see (Stuart and Ord,

1994)).

Definition: Moments of order r about the point a

µ′r =∫ ∞−∞

(x− a)rdF, (4.5.33)

where F is a distribution function.

Definition: Characteristic function c.f.

φ(t) =∫ ∞−∞

exp(itx)dF (4.5.34)

It may be easily seen that moments of distribution µ′r about pont 0 are related to the

r-order derivative drtφ(t) of the characteristic function φ(t) via:

µ′r = (−i)r[drtφ(t)]t=0 (4.5.35)

Another set of statistical measures that are widely used in statistics are cumulants.

Definition: The cumulants are defined by the identity

∞∑

r=1

kr(it)r/r! = log φ(t) (4.5.36)

Thus if a moment of order r µ′r is the coefficient of (it)r/r! in the Taylor series expansion

of the characteristic function φ(t), kr is the coefficient of (it)r/r! in the Taylor series

expansion of log φ(t). Here we present the relation between the first four order statistics:

k1 = µ′1

k2 = µ′2 − µ′21k3 = µ′3 − 3µ′1µ

′2 + 2µ′31

k4 = µ′4 − 4µ′3µ′1 − 3µ′22 + 12µ′2µ

′21 − 6µ′41

In order to describe some interesting properties of the distribution, some other statis-

tical measures have been defined:

Definition: Kurtosis

kur(u) =µ4

µ22

− 3 =k4

k22

(4.5.37)

The kurtosis characterizes the degree of peakedness of the graph of a statistical distri-

bution. It is indicative of the concentration around the mean. Distribution for which

kurtosis is equal to zero is called mesocurtic. Those with positive kurtosis are called


leptokurtic and with negative platycurtic. Kurtosis is equal to zero for Gaussian distri-

bution, is negative for sub-Gaussian and positive for super-Gaussian random variables.

The super-Gaussian random variable is “sharper” than the Gaussian, its pdf has large

tail areas and is more sharply peaked. The pdf of the sub-Gaussian random variable has

smaller tail areas and are also flatter-topped (see Figure 4.2). For a normally distributed

random variable (µ1 = 0 and µ2 = 1), kurtosis coincides with the cumulant of the fourth

order. For a family of the density function:

fα(x) = C1 exp(−C2|x|α), (4.5.38)

where positive constants C1, C2 are the normalization constants that ensure that fα is a

probability density of the unit variance:

C1 = (m2

m31

)1/2; C2 = (m2

m1

)α/2,

where

m1 =∫ ∞−∞

exp(−|x|α)dx =2

αΓ(

1

α) (4.5.39)

m2 =∫ ∞−∞

x2 exp(−|x|α)dx =2

αΓ(

3

α)

The different values of the positive parameter of α exhibit different shapes of the distri-

Pdf’s graphs for a family of the exponential density functions

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

1.4

α=0.75

α=2

α=5

Figure 4.2: Sample graphs for a family of the exponential density functions. This figuredemonstrates the typical shapes of the super-Gaussian (α = 0.75), and sub-Gaussian(α = 5) random variables.


bution. The random variable is super-Gaussian for 0.5 < α < 2 and is sub-Gaussian for

α > 2 (Figure 4.2).

Appendix B to Chapter 4: Derivation of the sum of

entropies learning rule

We consider compression of the input x = (x1, x2, . . . , xn) via the following nonlinear

transformation:

ui =n∑

j=1

wijxj + wi0, i = 1, . . . ,m, m < n, yi = g(ui).

u and y are vectors of pre and post-synaptic activations of the hidden layer, wij network

weights and wi0 network biases; and g-is a nonlinear monotone-increasing activation func-

tion. The network architecture is presented in Figure 4.3. As a learning rule we choose

Exploratory projection pursuit network

y i =g i (u i )

uiy i

u=Wx

Output y:

Input - x

Figure 4.3: Feature extraction is achieved via (non linear) projection and dimensionalityreduction

to maximize the sum of the entropies of the hidden units:

F(W) =m∑

i=1

H(yi).

The probability of the output of the hidden unit py(yi) can be written as:

py(yi) =pu(ui)

y′i, where y′i =

dyidui

This leads to: ∫p(yi) ln(p(ui))dyi =

∫p(ui) ln(p(ui))dui,


which implies the following expression for the sum of entropies:

F(W) = E[m∑

i=1

lny′i

p(ui)] = (

m∑

i=1

H(ui))1 + E[(m∑

i=1

ln y′i)2]

Thus, our goal consists of two terms, the first , F1 is the sum of the entropies of the pre-

synaptic activations of the hidden units and was evaluated by Amari et al. (1996) using

the truncated Gram-Charlier expansion to approximate the probability density function

(pdf) pu(ui) and the second, E[F2] represents an expectation of the sum of the log-terms.

The weights W have to be adjusted to maximize F(W). Using a gradient ascent algorithm

we obtain:

∆wij = η(∂F1

∂wij+ E[

∂F2

∂wij])

Replacing the gradient method by a stochastic method we obtain:

∆wij = η(f(ui)xj +∂F2

∂wij),

where f(u) is defined by Amari et al. (1996) and is the same as the function φ(ui) in the

expression (4.3.14). However, in our simulation similar to Amari et al., we use f(u) as in

(4.3.12,4.3.13).

The second term for nonlinearities gi(ui), may be written as:

∂F2

∂W=

g′′

g′xt

g′′

g′= (

g′′1(u1)

g′1(u1), . . . ,

g′′m(um)

g′m(um))t (4.5.40)

Thus, the learning rule simplifies to:

∆W = η(f(u) + g′′g′ )x

t (4.5.41)

The second term for any nonlinear function y = g(u), such that its derivative depends

only on y itself y′u = G(y), can be simplified by the following:

∂ln y′i∂wij

=1

y′i

∂y′i∂wij

=1

y′i

∂y′i∂ui

xj =1

y′iy′i∂G(yi)

∂yixj =

∂G(yi)

∂yixj

For the logistic sigmoidal activation function g(u) = 11+exp (−u)

the derivative G(yi) can be

easily evaluated as G(yi) = y′i = yi(1− yi). Thus, we obtain:

∆wij = η(f(ui) + (1− 2yi))xj. (4.5.42)


The only difference in the ∂F1

∂wijevaluation is the presence of the bias wi0 in (4.5.40).

Therefore, we must require that the expectation of pre-synaptic activations of the hidden

units ui be zero and their second moments be mi2 = E[u2

i ] = 1. This can be achieved

by normalizing u before the calculation of f(ui). Furthermore, the network’s input

is normalized, so that E[x] = 0 and consequently, omitting bias at all (w0 = 0), the

condition E[u] = 0 is automatically satisfied. The second condition mi2 = E[u2

i ] = 1

constrains the norm of W. The same rule, but with the negative η, can be used as a

goal for sparse coding (Olshausen and Field, 1996; Atick, 1992).

Chapter 5

Real world recognition

5.1 Introduction

Real-world object recognition is impeded by natural climate conditions such as fog, rain

or snow and also by other conditions such as partial occlusion and noise. This is fur-

ther complicated by changes in illumination and shadows that are due to movement of

surrounding objects. Some of these factors cause image blur, and all these factors are

crucial for recognition performance and have to be properly addressed during training

and testing.

This chapter addresses face recognition under various image degradations. We com-

pare different regularized recognition networks and different ensembles by testing their

performance on the degraded images. Results on two data-sets under various resolutions

and image degradations are demonstrated. We conclude that a combination that includes

ensembles with reconstruction constraints achieves the best performance on the degraded

images. In addition we show that via saliency maps reconstruction can deemphasize de-

graded regions of the input, thus leading to classification improvement under “Salt and

Pepper” noise.

5.1.1 Face recognition

Face recognition is an active field of research with possible applications in such areas as

man-machine interaction, robotics, access control, automatic search in visual databases

and low bit-rate compression. This task is challenging, since faces do not appear as

fixed image patterns; they can appear anywhere, at any size and orientation and with

varied background (Chellapa et al., 1995). Thus, face detection and normalization are

usually performed that reduce variability caused by these factors. However, such local-

ization preprocessing is not sufficient, since faces are not rigid and lighting conditions

69

Chapter 5: Real world recognition 70

are not uniform. Different facial expressions, changes in hair-style and eyeglasses, and

lighting conditions lead to a large amount of face variability. In some applications, this

normalization may be further complicated by low quality of the images. For example,

systems installed at airports yield foggy, blurred images; cheap cameras, such as those

used for robot navigation, lead to images with low resolution. Thus, face recognition is a

particular case of the training when the variability of the data describing the same class

is comparable with the similarity between different classes (Moses, 1994).

Face recognition approaches can be divided into two basic groups, feature-based meth-

ods (Samal and Iyengar, 1992, survey) and processing images as a whole (Kirby and

Sirovich, 1990; Turk and Pentland, 1991; Moghaddam and Pentland., 1994; Valentin

et al., 1994, survey). Most of the effort in the feature-based methods is focused on finding

individual features (e.g., eyes, mouth, nose, head outline, etc.) and measuring statistical

parameters to describe those features and their relationship. Different methods for feature

extraction were proposed such as template matching (Baron, 1981), deformable templates

(Yuille et al., 1989), combination of perceptual organization and Bayesian networks (Yow

and Cipolla, 1996) and methods using facial symmetry and elementary knowledge of faces

(Reisfeld et al., 1990; Tankus, 1996), etc. However, selecting a set of features that cap-

tures the information required for a face recognition is not easy and there is no a complete

satisfactory solution to it.

An alternative approach, inspired by the Gestalt school of perception (Hochberg, 1974;

Kanizsa and Gaetano, 1970) is to process faces as a whole. One of the method presenting

this approach is PCA, that was used for face recognition (Kirby and Sirovich, 1990; Turk

and Pentland, 1991, see Section 3.1.4 for description). Another way is to process images

via Neural Networks. Under this processing faces are presented as pixel intensity images

and extraction of geometrical relationship, texture and subtle facial details is realized

implicitly. Recognition from intensity images is also sensitive to substantial variations

in lighting conditions, head orientation and size. In order to avoid these problems, an

automatic preprocessing of the faces (i.e., normalization for size and position) is required.

Although this normalization stage is also based on the feature extraction, it is rather

constrained and completed by the definition of eyes and mouth or nose locations.

Among the first network models proposed for face recognition are autoassociative

networks and autoencoders (Valentin et al., 1994, survey). Although these network models

were proposed for recognition, they are trained to reconstruct faces. In autoassociative

networks, the recognition task is constrained to a face familiarity task. The cosine between

every face and its reconstructed version is evaluated and is thresholded to decide if the face

is familiar or not (O’Toole et al., 1991). In the autoencoders, their hidden representation


has been used as an input for the back-propagation sex and identity networks without

hidden layer (Cotrrell and Fleming, 1990).

Radial basis function (RBF) networks in the context of face recognition have been

first implemented by Edelman et al. (1992). The famous data-set (Turk and Pentland,

1991) described below in Section 5.2.4 has been used in their experiments. The faces

were normalized by the same procedure, as described below in Section 5.2.5, to reduce

variability to viewpoint and illumination direction. A set of Gaussian receptive fields

(RFs) of different size and elongation were applied to reduce dimensionality of the input.

These RFs were applied in different locations inspired by observation RFs of the simple

cells in the primary visual cortex of mammals. Every RBF network was intended for a

certain person recognition and was trained only by positive examples for which a single

output neuron had a desired value equal to 1. The face was considered as recognized

by the individual RBF network if its output exceeded some threshold. Later on, when

training of the individual RBF networks was ended, their outputs were used as inputs to

a new RBF network with the number of output units equal to the number of persons.

The desired activities were taken to be equal to 1 for the neuron responsible for a given

input image, and was equal to 0 for others. The misclassification rate equal to 9% vs.

22% for individual networks was achieved by this new RBF network.

Recently, an interest to RBF networks as a tool for face recognition has been revived.

Different novel variants of the RBF network schemes were proposed (Howell, 1997; Sato

et al., 1998; Gutta et al., 1996). In (Howell, 1997), the hyper RBF network, which has

the number of hidden units equal to the number of training samples and trained on the

images of all persons, is reorganized into a group of smaller face recognition unit networks.

Each face recognition unit network is intended for a particular person recognition and has

two output units. The first unit is responsible for the particular person presence and the

second has to be active when an ”anti” person is presented. The network uses views of the

certain person as positive examples and some selected ambiguous images of other people

as negative ones. Although this approach increases complexity, as more networks need to

be trained, it allows to reduce dimensionality of each unit network and it is adaptive to

a new person addition. When a new person is added, only one additional unit network

has to be trained, and perhaps a small number of ambiguous unit networks needs to be

retrained. A way to combine the standard RBF network with face unit networks based

on their confidences was also proposed.

Ensembles of standard RBF networks for face recognition have been proposed in

(Gutta et al., 1996). Two ensemble variants, defined in terms of their specific topol-

ogy (connections and RBF nodes) and the data they are trained on, were considered.


In the first variant (ERBF1), three groups of networks, which were trained separately

on the original data, and on the same original data with either some Gaussian noise or

subject to some degree of geometrical distortion were combined. Inside each group three

networks with the different topology were taken. The decision is based on the averaging

of the networks outputs (see Section 2.2) and takes place if the maximal response is larger

than some threshold. In the second variant (ERBF2), three RBF networks with different

topology were trained on the extended data consisting of original data and their corrupted

versions. Later on these ensembles are combined with inductive decision trees classifiers.

Sato et al. (1998) use as input to RBF networks partial face images, such as ears, eyes

and nose, which are cropped by hand. The network is trained with sub-images of known

and unknown images, taken under uniform lighting condition and with the fixed distance

between a camera and subjects. Each output unit of the RBF network corresponds to

the certain person. The input is recognized according to the unit with a maximal output

response, if the latter is larger than some threshold. This threshold is set by hand due

to separability of the maximal responses of known and unknown sub-images. Thus, a

network is also able to reject unknown faces.

A variant of a hybrid supervised/unsupervised network for automatic face recognition

has been proposed by Intrator et al. (1996). A network is trained using a hybrid training

method. This method is based on a formulation that combines unsupervised (exploratory)

methods for finding structure (extracting features) and supervised methods for reducing

classification error. The unsupervised training is based on the biologically motivated BCM

neuron (Intrator and Cooper, 1995) and is aimed at finding hidden units with a multi-

modal distribution of their activities. The supervised portion is aimed at finding features

(in network hidden units) that minimize classification error on the training set. The same

data-set and normalization as in (Edelman et al., 1992) were used. The classification

result for averaged output of five hybrid BCM/recognition was 99.38%, which is better

than using RBF networks (Edelman et al., 1992).

A new approach to face recognition using Support Vector Machines (SVM) has been

proposed by Phillips (1998). SVM is a binary classification method that finds the optimal

linear decision surface based on the concept of the structural risk minimization (Vapnik,

1995). Since the face classification is a multi-class problem, the task has been previously

reformulated as a two class recognition problem in a difference space (space of differences

between face images). In other words, the multi-class problem is replaced by the prob-

lem of discriminating between within-class differences set (difference of faces of the same

persons) and between-class differences set (difference of faces of different persons). The

extension of SVM to nonlinear decision surfaces has been used and slightly adapted by


introducing a threshold parameter ∆ to a decision surface parameterization. When the

task is recognition of some unknown probe face x, it is converted to a set of difference

faces x− xg, where xg are faces of known individuals, which are called a gallery set. For

each difference face a similarity score δg, which depends on the decision surface parameters

(but does not include ∆), is evaluated. The probe face is identified as a person for which a

face xg from the gallery set has the minimal similarity score δ?g that satisfies the inequality

δ?g < ∆, otherwise the probe face is claimed as unfamiliar. When the probe is verified

rather than identified the task is simplified, since the difference images are constructed as

the difference between the probe face and the faces of a person under verification. Some

results on the FERET database (Phillips et al., 1996; Phillips et al., 1997) are reported,

such as a 77% − 78% classification rate. Although these results are not impressive, it is

marked that only two images per 50 different and the most difficult persons were used for

training.

Another approach to face recognition from live video has been recently proposed by

Atick et al. (1997). Their scheme, called FaceIt, is based on the construction of the

factorial code, by transforming facial images into a large set of simpler statistically inde-

pendent elements. The recognition task then consists of estimating the probability that

a scene contains any pattern that was processed previously.

Another different scheme which attempts to find a new good representation for face

recognition has been proposed in (Bartlett et al., 1998). Bartlett et al. used ICA (see

Section 4) for reduced face representation which was extracted using PCA. Classification

from the extracted independent components is improved compared to classification from

principal components.

Another advanced feature-based method for face recognition using Hidden Markov

Models (HMM) has been proposed by Samaria et al. (1993). HMM models with the

states which are five facial features (forehead, eyes, nose, mouth and chin) are modeled

and the HMM parameters are separately estimated for face images of the same person. For

an unknown face identification, its conditional probabilities given parameters of different

HMM models are evaluated and recognition is done as a label of the model with the

highest value of the conditional probability.

Another advanced feature-based method is the dynamic link approach (Wiskott and

von der Malsburg, 1993; Wiskott et al., 1997). The method proceeds by applying Gabor

filters of 5 different frequencies and 8 orientations in a set of fiducial points (the pupils,

the corners of the mouth, the tip of the nose, the top and bottom of the ears, etc.). The

obtained responses in every point compose the so called bunch Gabor jet. Subsequently

every known face is represented as a labeled graph of these fiducial points and edges


between them. The nodes are labeled by their jets and edges are labeled with vectors

between the nodes, which they connect. The geometrical structure of the graphs unlabeled

by jets is called a grid. It is assumed that different known faces have the same grids and

correspondence between graph nodes of their models is set by hand. The face models

corresponding to the same orientation are joined into FBG (face bunch graph), that has

the average geometrical structure and combination of the bunch jets of all its models.

Therefore, FBG is a representation of all faces with the same orientation.

When an unknown image is given, its fiducial point locations which maximize a simi-

larity between the unknown image graph and FBG are searched. The similarity measure

between face graph and FBG is defined as a sum of jet and geometrical similarity measures,

controlled by a trade-off parameter. The optimization task is simplified by constraining

the group of possible geometrical transformation of FBG to translation, scale, aspect ra-

tios and local distortions. Subsequently, the similarity measure between a found image

graph and image graphs of all FBG faces are evaluated. Recognition is done picking up the

known face with the highest similarity measure. The similarity measure between image

graphs is defined as the average similarity between corresponding jets. In this consider-

ation, it is assumed that the unknown face is normalized, i.e. its position is estimated

before recognition procedure.

In this chapter, we implement hybrid networks that were presented in Chapters 3–4 for

face recognition. Our approach is the succession of the hybrid supervised/unsupervised

network approach (Intrator et al., 1996) with a novel type of unsupervised constraints.

Different types of the bias constraints are given below in Section 5.2, where a regularization

procedure is also presented. The regularization procedure is completed by creation of

various hybrid network ensembles. These ensembles are tested on degraded facial data-

sets. Image degradation, which has been simulated in our experiments, is briefly described

in Section 5.3 and recognition results are presented in Section 5.4. In particular, for the

same data and normalization as in (Edelman et al., 1992; Intrator et al., 1996) (see also

Section 5.2.5), we achieve a misclassification rate of 0.5% despite using smaller training

and larger testing sets.

5.2 Methodology

Face recognition problem requires extrapolation from the training set since its distribution

may be rather different from the distribution of the testing set. Thus this problem requires

an efficient use of a-priori knowledge that can be introduced in the form of bias constraints

during training (Section 2.3).


5.2.1 Different architecture constraints

In Chapter 3, reconstruction constraints were suggested as a learning bias and the hybrid

recognition/reconstruction network was introduced (Figure 3.3). This hybrid network

attempts to improve the low dimensional representation by minimizing concurrently the

mean squared error (MSE) of reconstruction and classification outputs. The proposed

reconstruction/classification network is controlled by a trade-off parameter λ and includes

a conventional classification network for λ = 0. We refer to the networks corresponding

to λ = 0 as unconstrained networks or conventional classification networks. In the special

case of λ = 1, we get a nonlinear autoencoder for nonlinear activation functions and a

linear autoencoder for linear (see Section 3.1.2). As has been discussed in Section 3.1.2,

the linear autoencoder hidden weights span the PCA eigenspace. Below, we refer to the

network obtained in this case as a PCA network. All the networks corresponding to the

trade-off parameter inside the interval [0 1] are called the reconstruction networks.

In Chapter 4, unsupervised constraints were introduced as statistical feature extraction

constraints on the hybrid network. The hybrid neural network with a hidden representa-

tion that is arranged mainly for the classification task and, in addition, has some useful

properties, was considered. We have used such statistical properties as an independence

of hidden neurons or maximum information transfer in the hidden layer. The proposed

unsupervised/classification networks are also controlled by a trade-off parameter λ and

include a conventional classification network for λ = 0. We consider several types of

unsupervised constraints (see also Table 4.1):

• Entropy maximization constraint, which maximizes a joint entropy of the hidden

layer (Section 4.3.1)

• BCM constraints, which emphasize data multi-modality by minimizing a specific

loss function (Section 4.3.4)

• Sum of entropies of the hidden units constraints. We consider four variants of these

constraints (see Table 4.1). Constraints A-C maximize the information carried by

each hidden neuron (Section 4.3.5). The case D corresponds to the sum of entropies

minimization.

• Nonlinear PCA constraints, which extract nonlinear principal components in the

hidden layer (Section 4.3.6)

In the general case, bias constraints are a composition of reconstruction and unsupervised

constraints (see Section 4.5). For simplicity we take these constraints with the same


strength, i.e., the parameter µ in Eq. 4.5.32 is set to 0.5, and only the trade-off parameter

λ is variable. In particular, we consider the combination of reconstruction and entropy

maximization constraints. We refer to the corresponding hybrid networks as reconstruc-

tion with entropy maximization networks. Thus, independent of the applied constraints

networks are controlled by a trade-off parameter λ and regularization is required.

5.2.2 Regularization

Regularization task is to find an optimal parameter λ and corresponding synaptic weights

ωλ which provide the minimal misclassification rate. The choice of the optimal parame-

ter can be done by hold-out, cross-validation or bootstrap methods (see Appendix 2.8).

We have not used cross-validation and bootstrap methods as they are computationally

demanding. Our regularization scheme is a variant of the split-sample validation method.

We split the data into approximately equal portions of training and validation sets. Find-

ing optimal weights ωλ depends on a stopping time in the training stage. The stopping

time has been set observing the behavior of the misclassification rate on the validation

set. Our regularization method includes the following steps (see Figures 5.1, 5.2, 5.3).

1. For every λ, train corresponding network until a minimum misclassification rate is

achieved on the validation set within a predefined number of epochs .

2. Since the misclassification rate is a stepwise function, we further choose a stopping

time, which corresponds to a minimum misclassification rate together with a minimal

recognition MSE on the validation set.

3. A λ-value providing a minimum misclassification rate on the validation set is an

optimal one.

4. Choose an ensemble of networks around the optimal λ value. Later this ensemble

is combined with a zero-λ ensemble.

In order to study solely effect of the trade-off parameter λ on the classification perfor-

mance, we have fixed all other training conditions, such as initial weights and a learning

rate. The initial weights have been chosen at random from a uniform distribution on

the interval [0, µ]. The learning rate has been taken small enough in order to ensure

convergence.

From a practical viewpoint, the choice of the best network is not reasonable, since

it depends on the degradation that is unknown a-priori. Instead of the search of the

optimal λ, we average over several regularization values, that is roughly equivalent to


Misclassification rate time evolution

0 1000 2000 3000 400010

0

101

102

epochs

λ= 0

0 1000 2000 3000 400010

0

101

102

epochs

λ= 0.1

0 1000 2000 3000 400010

0

101

102

epochs

λ= 0.2

0 1000 2000 3000 400010

0

101

102

epochs

λ= 0.3

Figure 5.1: Validation set results vs. the regularization parameter λ. Regularization withλ > 0.3 provide larger error than with λ = 0.3 (see also the top graph of Figure 5.3).


MSE recognition error time evolution

0 1000 2000 3000 400010

−3

10−2

10−1

100

epochs

λ= 0

0 1000 2000 3000 400010

−3

10−2

10−1

100

epochs

λ= 0.1

0 1000 2000 3000 400010

−3

10−2

10−1

100

epochs

λ= 0.2

0 1000 2000 3000 400010

−3

10−2

10−1

100

epochs

λ= 0.3

Figure 5.2: Validation set recognition MSE scaled per sample vs. the regularizationparameter λ.


0 0.1 0.2 0.34.417

6.383x 10

−3

λ

Recognition error

0 0.1 0.2 0.30.0127

0.0205

λ

Reconstruction error

0 0.1 0.2 0.30

34567

Classification based regularization.Misclassification error

Figure 5.3: Classification based regularization (for Pentland data-set in the intermediateresolution (32×32)): The upper graph shows the minimal number of misclassified faces inthe validation set versus λ. The middle graph shows a minimal mean-squared recognitionerror corresponding to the level of misclassification error in the upper graph. In thebottom graph the mean squared reconstruction error corresponding to the upper graphsis shown. All errors are calculated on the validation set per sample.


integrating over a uniform regularizaion distribution between some values. Such averaging

is equivalent to the Bayesian approach (see Section 3.2.3) for combining neural networks

having the same evidences for the chosen interval of the hyper-parameter λ. We have

experimentally found that training several networks on different λ values around several

optimal values that were found once, and then averaging the different network results,

yields a performance that is close to the optimal (a posteriori) λ and sometimes is even

better (see Section 5.4). Thus, we do not regard the need to estimate an appropriate λ

as problematic. In the results described below, we refer to an optimal λ as the one which

gives best test results versus degradation. It is thus clear that this is the upper limit of

performance under this scheme and this limit can be attained and sometimes surpassed

by a simple method of averaging over several λ values.

5.2.3 Neural Network Ensembles

An ensemble of experts is capable of improving the performance of single experts (Sec-

tion 2.2). We have used two types of ensemble classification prediction. The first, is a

majority rule over all the experts in the ensemble. We call this a classification ensemble.

Another rule is based on averaging the real values of the outputs of all the ensemble

members and then producing a decision by the Bayesian classification rule. We call this

a regression ensemble.

It was shown (Section 2.2), that the largest reduction in the variance portion of the

error is achieved when the predictors are independent and this may be achieved by com-

bining networks with different initial weights. We generate such ensemble of unconstrained

nets (λ = 0) and use it as a baseline for ensemble performance comparison.

It turns out that by averaging (in either way) over ensemble members that have been

trained with different values of the trade-off parameter λ (see Section 3.2.3), some ad-

ditional independence is achieved, leading to a useful collective decision. We call these

ensembles regularization ensembles and classify them further according to the training

constraints that were used during training of the ensemble networks. Therefore, ensem-

ble with the networks constrained by the reconstruction task is called the reconstruction

ensemble, by BCM – the BCM ensemble, etc.

Different ensembles are further combined with each other to generate more powerful

predictors. The additional variance reduction is attained due to different constraints used

for network training, that makes them independent. In particular, we have considered the

combination of the reconstruction and unconstrained λ = 0 ensembles, and the combina-

tion of the reconstruction and reconstruction with entropy maximization ensembles. We

refer to the latter ensemble as the reconstruction and entropy maximization ensemble.


5.2.4 Face data-sets

The widely available facial data-set (Turk and Pentland, 1991) as well as a face data-set

locally collected by the Tel-Aviv University Computer Vision Group (Tankus, 1996) were

used in our simulations. While there have been many successful classification approaches

to the Turk/Pentland data, we demonstrate that when the images are given in low res-

olution, or are degraded either by blur or partial occlusion, classification performance

deteriorates dramatically. The Turk/Pentland data-set contains 27 images of 15 male

faces (we omitted the single bearded person). From each face, we randomly chose 14 train-

ing images and 13 validation images (total of 210 training and 195 validation images).

Preprocessing details and previous results studying the effect of background, illumination

and comparison with PCA are given in (Intrator et al., 1996). The preprocessing partially

removes the variability due to viewpoint, by setting (automatically) the eyes and tip of

the mouth to the same position in all images (see Section 5.2.5). Further preprocess-

ing evaluates the difference between each image and an average over all the training set

patterns, leading to the so called “caricature” images (Kirby and Sirovich, 1990). Three

resolutions were used: high - (64×64), intermediate - (32×32) and low - (16×16) pixels.

Examples of a face in three resolutions are shown in Figure 5.4.

“Caricature” faces in three resolutions

Resolution 64*64

Resolution 32*32

Resolution 16*16

Figure 5.4: Pixel resolutions used in the classification results (Pentland data-set).

The second data-set contains images of 37 male and female faces with 10 pictures for

each person in high resolution (84 × 56). We split the data to 6 training images and 4

validation images for each person and used a similar preprocessing as described above,

except that only the eye locations were fixed.

5.2.5 Face normalization

This section describes the face normalization which was used for the facial data-sets. The

normalization is based on finding anchor points: eyes, nose or mouth and then warping the

face images to some predefined locations of these points. The anchor points are identified

using the Generalized Symmetry Transform (Reisfeld, 1993; Tankus et al., 1997).


The method proceeds starting from an edge map and assigning a symmetry measure

at each point, producing a ”symmetry map” of the image. A symmetry measure for

each point and direction is defined as follows. Let pk = (xk, yk) be any image point and

∇I(pk) = ( ∂I∂x, ∂I∂y

)|(x,y)=pk be the gradient of the intensity at point pk. The gradient is

considered in the logarithmic scale, i.e. a vector vk = (rk, θk) is associated with each point

pk, where rk = log(1+ ‖ ∇I(pk) ‖) and θk = arctan( ∂I∂x/∂I∂y

)|(x,y)=pk . For each two points

pi and pj the line l passing through them and the counterclockwise angle αij between it

and horizontal are introduced. The set Γ1(p, ψ), a distance weight function Dσ(i, j) and

a phase weight function P (i, j) are defined by:

Γ1(p, ψ) = (i, j)|(pi + pj)/2 = p, αi,j = ψΓσ(p) = (i, j)|(pi + pj)/2 = p, ‖ pi − pj ‖< 3σDσ(i, j) =

1√(2πσ)

exp(−‖ pi − pj ‖2σ

)

P (i, j) = (1− cos(θi + θj − 2αij))(1− cos(θi − θj))

The first multiplier term of the measure Pij has peak when the gradients at pi and pj

are oriented in the same direction towards each other, while the second term suppresses

P (i, j) when θi = θj = π/2, which occurs for points lying in the straight line. The radial

symmetry measure M(p) and directional symmetry measure Sσ(p, ψ) of each point p in

direction ψ are defined as :

Sσ(p, ψ) =∑

(i,j)∈Γ1(p,ψ)

Dσ(i, j)P (i, j)r(i)r(j)

M(p) =∑

(i,j)∈Γ2(p)

Dσ(i, j)P (i, j)r(i)r(j)sin2((θi + θj)/2− α(p)), where

α(p) = (θi? + θj?)/2 and (i?, j?) = argmax(i,j)∈Γσ(p)Dσ(i, j)P (i, j)r(i)r(j)

The maps produced by these operators are then subjected to detection of the highest

peaks. Geometrical relationship among these peaks, together with the location of the

midline are defined to infer the face position as well as eyes and mouth in the image.

Detection of the midline of the face image is found as a peak in the autocorrelation

function of the edge map. Common information, such as the assumption that eyes should

be on both sides of the midline and the mouth should intersect it, is used.

5.2.6 Learning parameters

We have used hidden layer consisting of 10 units for both data-sets. This number was

chosen by trial. The value of the parameter µ which locates initial weights in the small


vicinity of the weight space origin, was set to µ? = 0.001 for the experiments with the

Pentland data-set in the intermediate resolution 32× 32 and was set to µ = 4µ? = 0.004

and µ = 0.25µ? = 0.00025 for low and high resolution, respectively, in order to obtain the

consistent results in all three resolutions. The number of predefined training steps was ad

hoc to 5000 epochs for intermediate resolution, 3000 epochs for high and low resolutions.

The learning rate η has been adjusted according to the bias constraints. In experiments

with reconstruction constraints, the learning rate was equal 0.2. For the TAU data-set µ?

was equal 0.001, the learning rate η was set to 0.05 and number of epochs about 10000

epochs was used.

5.3 Type of image degradations

For the Pentland data-set, we have performed experiments in three resolutions: low (16×16), intermediate (32×32) and high (64×64). The test images were obtained by simulating

degradation on the validation-set only, i.e. all results are based on networks that were

trained on ”clean” data and were tested with either clean or degraded validation data. A

few examples of degraded faces and their reconstructed versions by different networks are

shown in Figure 5.5.

Below, we briefly describe the type of degradations that were used. For a comprehen-

sive treatment of degradation see Chapter 6.

“Clean” data: The original test set without any image degradation.

Blurring with Gaussian filter: Blurring with a Gaussian filter is one of the simplest

types of image degradations. We used a Gaussian blurring with a standard deviation σ =

2. This scale of smoothing retains many details needed for human perceptual recognition

for high resolution images, but for intermediate and low resolutions, many details around

the eyes and mouth appear to be lost.

Blurring with DOG filter: Difference of Gaussians (DOG) filter, which produces a

Mexican hat type receptive field, is a form of image preprocessing known to be present

in early mammal vision (center-surround cells) (Marr, 1982; Kandel and Schwartz, 1991)

(see also Section 6.2.1). Standard deviations of the on and off center (positive and negative

Gaussians) were 1 and 2 respectively. This type of preprocessing is known to enhance

edges.


Image degradation and reconstruction (TAU data-set)

Figure 5.5: Reconstruction is done using an architecture with reconstruction constraints.The faces in each row from left to right represent: A “clean” face, a corresponding “car-icature”, a degraded version, a reconstruction of the degraded version obtained by thefirst 10 Principal Components, a reconstruction by a single unconstrained Network withλ = 0; Reconstruction by a network ensemble with reconstruction constraints and tradeoffparameters λ = 0.04, 0.3.Degraded faces from top to bottom:Upper row: “Salt and Pepper” noise with 20% degradation. Middle row: nose area wasreplaced by average intensity in that area. Bottom row: DOG-blur with the deviation ofon and off center equal 1 and 3.

Partial occlusion: This is achieved by replacing the pixel values at a certain rectan-

gular area of arbitrary size in any part of the face by the average intensity of the pixels

in that rectangle.

“Salt and Pepper” noise: This degradation replaces pixel intensities by either the

maximum or minimum grey-level value at random locations of a certain percentage of the

image (Rosenfeld and Kak, 1982). Results presented here were done with 10% and 20%

replacement.

5.4 Experimental results

Table 5.1 presents results on classification schemes generated by networks with recon-

struction constraints and their combination into ensembles. The results are in three im-


Classification results for Pentland data-setClassification Low Intermediate High

scheme Resolution Resolution Resolution16× 16 32× 32 64× 64

λ = 0 3.1 2.6 1.5λopt 3.1 1.5 1.0

classificationensemble 2.6 0.5 1.0regressionensemble 2.6 1.0 0.5

PCA 15.9 13.8 17.9

Classification Low Intermediate Highscheme Resolution Resolution Resolution

16× 16 32× 32 64× 64λ = 0 5.1 12.3 16.9λopt 4.6 7.2 11.8


PCA 22.1 35.4 50.8


16× 16 32× 32 64× 64λ = 0 5.1 3.6 1.5λopt 3.6 2.6 1.5


PCA 16.4 14.8 17.9


16× 16 32× 32 64× 64λ = 0 36.4 13.8 5.6λopt 34.4 11.7 3.6


PCA 46.7 33.3 26.7

Table 5.1: Percent misclassification rate for Turk-Pentland data-set in three resolutions.Top left: on the “clean” testing set. Top right: Blurred images with a DOG-filter withσ1 = 1 σ2 = 2. Bottom left: Results of partial occlusion around the nose. Bottom right:Results of a “Salt and Pepper” noise of 20% of the image. For 32× 32 resolution, singleunconstrained net with λ = 0 and reconstruction ensemble correspond to initial “weightsB” of Table 5.2. PCA stands for PCA network.

age resolutions with different image degradations. They show that constrained networks

which may not show significant performances difference on tests with original, undegraded

test-set, do show a significant improvement when tested with degraded images. Below,

we highlight some consequences of Table 5.1.

Single PCA network When λ = 1 and the activation functions of the hidden and

output units are linear, the hidden weights of the network span the space of principal

eigenvectors (Section 3.1.2). Classification results for network PCA representations are

presented in Table 5.1 (bottom rows). These results are inferior to other methods and

demonstrate that the first few principal components may be inefficient for classification1.

Ensemble combination Classification ensemble, or voting, is quite common in com-

putational learning theory (Section 2.2). We find that a regression ensemble is superior

to classification ensemble especially in higher image resolutions. We note that for a re-

gression ensemble variance reduction by averaging is achieved when the errors of the

different classifiers are independent. It appears that the use of different λ values leads to

1It is known however, that a larger number of PCA produces improved results (Kirby and Sirovich,1990).


some independence in misclassification and thus, the regression ensemble produces better

results.

Different image resolutions Generally, the results from the 16 × 16 resolution are

only slightly worse than results with higher resolutions. This resolution is less sensitive to

difference of Gaussians blur, but very sensitive to “Salt and Pepper” noise which produces

significantly worse results. This is a strong indication to the usefulness of multi-resolution

detection as a means to improve performance under various image degradations.

In short, Table 5.1 indicates that reconstruction constraints under regression ensemble

produce more robust results. In the following set of experiments, we consider other

network constraints.

5.4.1 Different architecture constraints and regularization en-sembles

Table 5.2 presents results on different classification schemes that were generated by various

network constraints and regression ensemble combinations. All results were obtained on

the Turk-Pentland data-set with an intermediate (32 × 32) resolution, using networks

trained with two sets of initial random weights A and B.

Unconstrained ensembles The first two rows in Table 5.2 represent two single (con-

ventional) unconstrained networks, corresponding to training with different initial weights

A and B. This serves as a base-line comparison and demonstrates the increased sensitivity

of single networks to image degradation, in particular to blur. Before concentrating on the

effects of additional constraints we note that ensemble without additional constraints (1st

numbered row of Table 5.2) is already significantly better than a single network. Similar

results for the TAU data-set are presented in the second row of Table 5.3.

Reconstruction networks and their ensembles The next two rows (numbered 2

and 3, in Table 5.2) show the variability of the reconstruction ensemble results due to

a different initial set of weights (A and B). Classification results of the reconstruction

ensemble for the TAU data-set are shown in the 3rd row of Table 5.3. For this (more

difficult) data-set, the ensemble of unconstrained networks is always inferior to the en-

semble with reconstruction constraints. The largest difference between the unconstrained

and reconstruction ensembles is observed for blurred images.

The 4th row of Table 5.2 represents the reconstruction ensemble composed from the

networks of two reconstruction ensembles with weights A and B. The main observation


Different Ensemble Types (Pentland data-set)

Ensembles: Optimal NN Regression Gaussian DOG Occlusion “Salt andType of for ensemble filter filter nose half Pepper noise”

regularization testing on testing σ1 = 1 area face d=0.1 d=0.2constraints: set set σ = 2 σ2 = 2 area

Single unconstrained netwith initial “weights A” 1.0 * 10.3 8.2 1.0 7.2 4.6 10.8Single unconstrained netwith initial “weights B” 2.6 * 12.8 12.3 3.6 8.7 6.2 13.8

1. Ensemble for λ = 0 anddifferent initial weights 1.0 0.5 6.7 7.7 0.5 5.6 2.1 7.22. Reconstruction withinitial “weights A” 2.1 2.1 8.2 4.1 1.5 5.6 6.7 12.8λ : 0.05 0.1 0.3 0.353. Reconstruction withinitial “weights B” 1.5 1.0 8.7 8.2 1.5 6.2 3.1 13.3λ : 0.1-0.3, step 0.054. Reconstruction ensemble withinitial “weights A,B” 1.5 1.5 6.2 4.6 2.1 4.6 4.1 9.7

5. Reconstruction (A+B) andλ = 0 ensembles 1.0 0.5 5.6 4.6 0.5 4.6 2.6 6.7

6. Reconstruction withentropy maximization 1.5 2.1 7.2 4.6 2.6 4.6 4.1 8.7λ : 0.05 - 0.3, step 0.057. Reconstruction and entropymaximization ensembles 1.0 1.5 5.6 3.1 1.5 4.6 4.1 7.2

8. Entropy maximizationλ : 0 - 0.4, step 0.05 0.5 1.5 8.7 4.1 2.1 6.7 3.6 11.39. BCMλ : 0.05 -0.3, step 0.05 2.1 2.6 11.3 5.1 2.1 8.2 4.1 11.8

10. Sum of entropies Aλ : 0.05 0.1 0.2 0.25 0.3 1.5 2.1 8.2 7.7 2.6 3.6 4.1 10.8

11. Sum of entropies Bλ : 0.05 0.1 0.2 0.25 0.3 1.5 2.1 8.2 7.7 2.6 3.6 3.6 13.3

12. Sum of entropies Cλ : 0.05 0.1 0.2 0.25 0.3 0.5 2.1 7.7 7.7 2.1 6.7 2.1 9.7

13. Sum of entropies Dλ : 0.05 - 0.3, step 0.05 1.0 2.1 8.2 6.7 2.1 5.6 3.6 8.7

14. Nonlinear PCAλ : 0-0.3, step 0.1 2.6 3.6 20 21.5 7.6 26.2 51.3 74.4

Table 5.2: Percent classification error for different image corruptions on the Turk-Pentlanddata-set in intermediate (32× 32) resolution. All results are for an ensemble of networksthat includes the indicated λ values. The column optimal NN refers to the single best inthe ensemble λ-network. In the Salt and Pepper experiments, either 10% or 20% of theimage were corrupted. Information that is not relevant for single networks is marked with*.


is that this combined reconstruction ensemble is better than the unconstrained and re-

construction ensembles with either weights A or B in classification of Gaussian and DOG

blurred images. As can be seen the unconstrained (λ = 0) ensemble is slightly better

than this combined reconstruction ensemble when distortion is small. This motivated

us to combine the unconstrained and the combined reconstruction ensemble (5th row of

Table 5.2). We note that this joined ensemble leads to robust results and is superior to

other ensembles. Similar results for the TAU data-set, with reconstruction ensemble, are

presented in the 3rd and 4th rows of Table 5.3.

Different Ensemble Types (TAU data-set)

Type of Best single Regression Gaussian DOG Occlusion “Salt andregularization net results ensemble filter blur nose half Pepper noise”constraints: on test on test σ = 2 with σ1 = 1 area face d=0.1 d=0.2

set set σ2 = 2 areaBest singleunconstrained net, λ = 0 15.5 * 23.0 44.6 29.1 35.1 24.3 31.1Ensemble for λ = 0 anddifferent initial weights 15.5 12.8 19.6 31.8 18.2 20.9 16.2 22.3Reconstruction ensembleλ = 0, 0.04, 0.1, 0.2, 0.3 15.5 12.8 16.2 26.4 18.2 26.4 16.2 14.9

Joined reconstructionand unconstrained ensemble 15.5 13.5 16.9 23.6 16.9 20.9 14.2 16.2

Entropy maximizationλ = 0 : 0.4, step 0.05 20.3 12.8 18.2 32.4 16.9 23 13.5 20.3

Table 5.3: Misclassification error (in percent) for various ensembles and joined reconstruc-tion and unconstrained ensembles. Results are given for the TAU data-set with differentimage corruptions. In the “Salt and Pepper” experiments, 10% or 20% of the image werecorrupted. Information that is not relevant for single networks is marked with *.

Ensembles with unsupervised constraints In a manner similar to networks with

reconstruction constraints, we have generated different families of networks with unsu-

pervised feature extraction constraints (see Section 5.2.1). The entropy maximization

constraint is superior (on the TAU data-set) to an unconstrained ensemble under image

occlusions and all types of image degradations (the last row of Table 5.3). The BCM

constraint (9th row of Table 5.2) and the sum of entropies constraints were useful under

image degradations using a DOG filter.

Ensembles with reconstruction and unsupervised constraints

The reconstruction ensemble with entropy maximization (6th row of Table 5.2) is bet-

ter than the unconstrained ensemble and the reconstruction constraint ensembles ”with

weights B” in classification of DOG blurred images. Joining this ensemble with both

reconstruction ensembles attains results that are better than the joined ensemble with re-


construction and no constraints ensemble for DOG blurred images (7th row of Table 5.2).

In general, however, merging of ensembles with reconstruction constraints and with no

constraints (λ = 0) leads to more robust results and is superior to the joined reconstruc-

tion and entropy constraints ensemble.

Figure 5.6 summarizes most results of Table 5.2 and compares between different en-

semble averaging schemes and different learning constraints on the original and degraded

images. It shows that the “joined reconstruction ensemble” (pink, fifth bar) performs

better than each reconstruction ensemble from which it is composed. Additional merging

with the unconstrained (λ = 0) ensemble (black, seventh bar), gives a better performance

in most of the cases. The same useful property of the reconstruction ensemble can be

observed when merging reconstruction and entropy maximization ensembles (yellow, sixth

bar). This ensemble is superior under DOG blurred images.

5.5 Saliency detection

This section presents a way to improve recognition of corrupted images using network

generalization ability to reconstruct. Due to the bottleneck structure of the network,

reconstruction is efficient even when images contain a large amount of noise or are partially

occluded by contrast objects. Reconstructed images, which we call prototypes, are able to

recover partially degraded or occluded areas of the input. However, there is a difficulty to

find these degraded areas in the input, or more generally, to define relevance (confidence)

of the image areas. This task is common in artificial intelligence and robotic vision. It is

referred to as a saliency detection or intelligent filtering (Baluja, 1996). The definition of

relevance depends on the desired task and the learning algorithm.

For example, for an autonomous vehicle navigation (Baluja and Pomerleau, 1995) a

saliency map derived from a specific Neural Network representation (see Section 3.1.5)

was designed to highlight significant (salient) regions of the input and deemphasize unim-

portant regions. Their saliency map is based on the difference between an input image

and its prediction by the network from the previous video frame. Below, we present a

saliency map construction for still images that is suitable for the classification task and

uses the hybrid network, which was introduced in Chapter 3.

5.5.1 Saliency map

After training a difference map (image) xd given by the difference between the input

image x and its prototype xp: xd = abs(x − xp), can be used for extracting unreliable

areas (areas with a large noise or unexpected objects) in the input image. Due to the


Summary of different networks and different image degradations

1

2.5641

a: "Clean" dataMisclas. rate %

Av Clas Reg

1

10.2564

b: DOG blurring

Misclas. rate %

Av Clas Reg

1

17.4359

c: "Salt and Pepper" noise

Misclas. rate %

Av Clas Reg

1

9.7436

d: Half face cropMisclas. rate %

Av Clas Reg

Figure 5.6: Misclassification rate (%) for different network ensembles and different typesof image degradation for Pentland data-set 32 × 32: “Av”- average performance of net-work ensembles. “Clas”- performance of the classification ensembles. “Reg”- performanceof the regression ensembles. The bars in the group from left to right correspond to thefollowing ensembles of Neural Networks: 1. Ensemble of unconstrained Neural Networkswith λ = 0 and different initial weights. 2. Reconstruction with initial weights A. 3.Reconstruction with initial weights B. 4. Reconstruction with entropy maximization. 5.Joined reconstruction ensemble (merged 2 and 3 ensembles). 6. Joined reconstruction andentropy maximization ensemble (merged 2,3 and 4 ensembles). 7. Joined reconstructionand unconstrained λ = 0 ensemble (merged 1,2 and 3 ensembles). For comparison, clas-sification errors of single Neural Networks with initial weights A are shown by horizontallines of dashed type.


bottleneck structure of the network, the output of the reconstruction layer has to be better

for recognition than the original signal, in areas where xd is large, i.e., the original signal

x is messy. Thus, we propose before recognition to replace the original image x by the

image xn using a saliency map Φ(xd):

xn = Φ(xd)x+ (1− Φ(xd))xp, (5.5.1)

where all operations are pixel-wise.

We have constrained a saliency map Φ(xd) to be a decreasing function, such that

Φ(0) = 1 and have considered two types of saliency maps. The first type of saliency maps

is given by: Φ(x) = exp(−µx2) and parameter µ, tuned to µ = 0.9. The second saliency

map was taken as:

Φ(x) =

1 if x < x0.5 otherwise

where a threshold x was adjusted to 0.3. Figure 5.7 shows examples of the xn images

obtained using two saliency maps. Classification was improved for some types of the

Saliency map construction

Input Reconstruction Difference

Saliency Reconstruction Saliency Reconstructionmap-1 with map-1 map-2 with map-2

Figure 5.7: Reconstruction using saliency maps for network with reconstruction con-straints and trade-off parameter λ = 0.04 (TAU data-set). The white pixels of the firstmap (map-1) correspond to intensity equal to 1 and black to zero intensity. In map-2 thewhite pixels have intensity equal to 0.5 and black 0.

degradation process, especially for “Salt and Pepper” noise (Tables 5.4–5.5). For other


Recognition using saliency maps (Pentland data-set)

Types of Regression Ensemblesdegradation Unconstrained Reconstruction A Reconstruction B Joined

“Salt and Pepper”noise with d = 0.1input 1.5 2.1 3.1 1.0map-1 1.5 3.1 3.1 1.0map-2 1.5 3.1 4.1 1.5“Salt and Pepper”noise with d = 0.2input 7.2 11.3 11.3 6.7map-1 2.6 4.6 4.1 3.6map-2 2.6 4.6 5.6 3.6“Salt and Pepper”noise with d = 0.3input 25.1 26.2 30.8 23.6map-1 11.3 13.3 13.3 11.8map-2 12.8 14.9 15.9 12.3“Right eye”with ν = 3input 3.1 1.5 2.1 2.1map-1 3.1 1.5 2.1 0.5map-2 2.6 2.1 2.6 1.0“Half face”with ν = 3input 15.9 26.2 22.1 18.5map-1 16.9 26.2 22.1 16.9map-2 16.9 25.6 19.5 17.4“DoG 1-2”input 7.7 4.1 9.2 4.6map-1 7.7 3.6 9.7 4.1map-2 8.2 3.6 8.7 5.6

Table 5.4: Percent misclassification error results for images obtained using two types ofsaliency maps. Reconstruction ensembles A and B correspond to training with weightsA and B. In contrast with reconstruction ensembles A–B in Table 5.2 (2-3 rows), theycontain also one unconstrained network (λ = 0). Rows marked with “input” show standfor the input degraded images. Pentland data-set at 32× 32 resolution.


Recognition using saliency maps (TAU data-set)

Types of Regression Ensemblesdegradation Unconstrained Reconstruction Joined

“Salt and Pepper”noise with d = 0.1input 16.2 16.9 13.5prototype 23.0 16.2 14.9map-1 14.2 14.2 12.8map-2 14.2 14.9 12.2“Salt and Pepper”noise with d = 0.2input 25.0 20.3 20.3prototype 31.1 18.2 20.3map-1 18.2 14.2 12.8map-2 18.2 15.5 12.8“Salt and Pepper”noise with d = 0.3input 37.8 31.8 31.1prototype 43.2 31.1 32.4map-1 25.7 18.9 20.3map-2 21.6 17.6 18.9“Right eye”with ν = 3input 14.2 15.5 13.5prototype 14.9 14.9 13.5map-1 14.2 15.5 13.5map-2 13.5 15.5 13.5“Half face”with ν = 3input 43.9 36.5 34.5prototype 43.2 41.9 36.5map-1 42.6 36.5 34.5map-2 41.9 35.8 36.5“DoG 1-2”input 31.8 26.4 23.6prototype 33.1 26.4 27.0map-1 32.4 27.7 24.3map-2 33.1 27.0 25.0

Table 5.5: Percent misclassification error results for images obtained using two types ofsaliency maps. Rows marked by “prototype” stand for the reconstructed images (TAUdata-set).


types of image degradation, classification improvement was not significant. To enforce

the efficiency of the saliency map, in the experiments with partially occluded images, the

occluded region was enhanced by multiplying the average intensity over the occluded area

by some factor ν.

5.6 Conclusions

We have shown that constraints on the properties of the low-dimensional internal repre-

sentation of the images, such as entropy maximization, BCM and the sum of entropies,

are useful and can be considered in conjunction with reconstruction constraints, to im-

prove generalization for classification. It was further shown that an averaging of Neural

Networks with different constraint strengths is preferable to a simple choice of the optimal

regularized network parameters. The best classification results were obtained by merging

the ensemble with reconstruction constraints and the unconstrained, λ = 0 ensemble.

Reconstruction constraints significantly improve classification results under partial

occlusion, lossy compression, “Salt and Pepper” noise and some image blur operations. In

addition, we have shown that via saliency maps, reconstruction can deemphasize degraded

regions of the input, thus leading to classification improvement under “Salt and Pepper”

noise. In the next chapter, we investigate the influence of the reconstruction constraints

on image recognition under a wide family of image blur and consequent deblur operations.


5.7 Appendix to Chapter 5: Hidden representation

exploration

Image recognition improvement is based on the extraction of a good hidden data rep-

resentation. Although recognition performance is a single reliable measure that allows

one to judge the hidden representation quality, it may be interesting to consider some

statistics of the hidden layer units. Statistics of the hidden unit activities characterize

the data distribution after projection on the hidden weight directions. Some properties

of the hidden representation are presented below.

In Figures 5.8–5.9, the hidden unit activities per classes and different bias constraints

are shown. As can be seen, in both networks images of the same class excite similar

activation patterns in their hidden space and at the same time there is a big difference

between patterns corresponding to different classes. It is clear that such a representation

has to be good for recognition. However, from the observations it is difficult to decide

which type of constraints is preferable.

The pdfs of the hidden unit activities are presented in Figure 5.10. As can be seen,

they are multi-modal for unconstrained network and multi-modal or super-Gaussian for

reconstruction network. Both these properties are useful for recognition (Chapter 4).

Another way to get some impression about hidden layer structure is to look at the

hidden weights as images (Figure 5.11). We note, however, that network ensemble hidden

representation is not well defined.


Hidden unit activities vs. classes - for an unconstrained network

Unconstrained network λ = 0

2468

10class−1

Ne

uro

n 2468

10class−2

Ne

uro

n 2468

10class−3

Ne

uro

n

2468

10class−4

Ne

uro

n 2468

10class−5

Ne

uro

n 2468

10class−6

Ne

uro

n

2468

10class−7

Ne

uro

n 2468

10class−8

Ne

uro

n 2468

10class−9

Ne

uro

n

2468

10class−10

Ne

uro

n 2468

10class−11

Ne

uro

n 2468

10class−12

Ne

uro

n

2468

10class−13

Ne

uro

n 2468

10class−14

Ne

uro

n 2468

10class−15

Ne

uro

n

Figure 5.8: Results on “clean” Pentland data set at intermediate resolution 32×32. Eachsquare area represents a neuron activity magnitude vs. different inputs (such representa-tion is similar to Hinton diagrams for network weights representation). The color indicatesa magnitude sign: red for negative and green for positive values (in non colored printers,a red color appears more dusk than a green color).


Hidden unit activities vs. classes - for a reconstruction network

Reconstruction network λ = 0.3

2468

10class−1

Ne

uro

n 2468

10class−2

Ne

uro

n 2468

10class−3

Ne

uro

n

2468

10class−4

Ne

uro

n 2468

10class−5

Ne

uro

n 2468

10class−6

Ne

uro

n2468

10class−7

Ne

uro

n 2468

10class−8

Ne

uro

n 2468

10class−9

Ne

uro

n

2468

10class−10

Ne

uro

n 2468

10class−11

Ne

uro

n 2468

10class−12

Ne

uro

n

2468

10class−13

Ne

uro

n 2468

10class−14

Ne

uro

n 2468

10class−15

Ne

uro

n

Figure 5.9: Results on “clean” Pentland data set at intermediate resolution 32× 32.


Pdf’s of hidden unit activities


neuron−1 neuron−2 neuron−3 neuron−4 neuron−5





Figure 5.10: Hidden unit activity pdfs - for unconstrained λ = 0 and reconstructionnetwork λ = 0.3 for “clean” Pentland data set at intermediate resolution 32× 32.


Hidden weight representation



Figure 5.11: Pentland data set at intermediate resolution 32× 32.

Chapter 6

Blurred image recognition

This chapter studies a case where the required generalizations are for data which may

be “far” from data in the training set, namely data with a different distribution than

the training set. In the previous chapter, we considered unsupervised and particularly

reconstruction constraints, as a mechanism to impose useful bias during training. We

have shown that these constraints improve generalization performance for various image

degradations, such as “Salt and Pepper” noise, low resolution and partial occlusion. How-

ever, sensitivity to image blur was still too high. This chapter is devoted to performance

improvement under various types of image blur.

6.1 Methodology

Recognition of blurred images requires a substantial amount of training data processed

by different blur operators. Unfortunately, such data is not available, and therefore, an

alternative way to solve the problem is to impose a priori information about possible

degradation transformations.

For example, in the character recognition problem, the possible transformations are

geometrical, such as shift, rotation and scaling (Simard et al., 1992; Baird, 1990). The

regularization there appears as the invariance tangent prop constraints in the form of

the penalty term to the cost function or using the distortion model, i.e. by data driven

regularization (Section 2.3.2).

We choose to add Gaussian blurred images to the training set as a representative of all

blur operations and recognition is done on a wide variety of blur operations. We further

propose to enforce reconstruction of blurred images to either their copy or to the original

non-blurred images. Such training causes the hidden representation to become insensitive

to blur operation.

Another obvious way to improve classification of the blurred images is to restore the

100

Chapter 6: Recognition of blurred images 101

blurred images beforehand. In this case, before testing the recognition system on blurred

images, their degradation is reduced via image restoration techniques.

6.1.1 Experimental design

Training schemes

In Chapter 5, hybrid networks were trained to classify and reconstruct “clean” images

(Figure 6.1 A, training stage), i.e., the reconstruction of a copy of the input in the output

layer was used. Below, we refer to this training scheme as training scheme A. This

training encourages internal representation where patterns of the same class are clustered

together (due to the reconstruction part of the learning), while the distance between

patterns of different classes is stretched (due to the discriminative/classification part of

learning) (Gluck and Myers, 1993). As a result, classification in this hidden space is

simpler and is more robust to various forms of degradation.

To further improve recognition of degraded images, we have added Gaussian blurred

images (with standard deviation σ = 2) during training. This data expansion procedure

gives two additional types of the training procedure with reconstruction constraints. The

first training scheme B enforces reconstruction of the original “clean” images from the

blurred inputs (Figure 6.1 B, training stage) and the second scheme C, is a simple dupli-

cation of the inputs at the output (Figure 6.2). Both training schemes B and C encourage

internal representation to be more robust to blurring, but training scheme B introduces

additional invariance constraints on the image reconstruction task.

As in Chapter 5, three types of ensembles are studied for each of the training schemes

A–C: unconstrained, with reconstruction constraints and joined. The number of networks

in the unconstrained ensemble of all schemes A–C is equal to 6. Ensembles with recon-

strcution constraints of all schemes A–C have been composed from networks with the

trade-off parameter λ, which changes from 0 till 0.3 with an increment of 0.05.

Testing schemes

Two testing schemes were used to evaluate the generalization ability of networks and

their ensembles. The first testing scheme A is the same as in Chapter 5, i.e., various

image degradations are simulated and a misclassification rate for different ensembles is

evaluated (Figure 6.1 A, testing stage). In testing scheme B, the degraded images are

first preprocessed using several restoration methods and only then classification is carried

out (Figure 6.1 B, testing stage). Our experiments consist of several groups which differ

by simulated degradation types and applied restoration techniques. In the next section,


Experimental design schemesA B

clean imageClass label

clean image

Classification

Reconstruction

Training stage

Training stage

clean image clean image

Blurred image clean image

Reconstruction

Class label

Classification

Reconstruction

A B

Blurred image

Testing stage

Class

?

Testing stage

RestorationClass

?

Degraded image Restored image

Figure 6.1: (A): In the training stage, networks are trained to classify and reconstruct“clean” images. In the testing stage A, generalization ability to classify artificially de-graded images is tested; (B): Artificially blurred images are added to the training stage.Networks are trained to classify images and reconstruct their “clean” prototypes. In thetesting stage B, restoration preprocessing is used before recognition schemes.

we review image degradation operations and restoration methods which we apply.

6.2 Image degradation

Usually degradation process is modeled as both a space-invariant blurring with a convo-

lution operator h and a corruption with an additive noise n:

g = h ∗ f + n, (6.2.1)

where f is the original image. The major known causes for image blur are misfocus, camera

jitter, object motion and atmospheric turbulence. These types of blur lead to a low pass

operation on the image. Of particular interest is a difference of Gaussians (DOG) filter,

which is a band-pass filter, and is known to be present in early mammal vision (Kandel

and Schwartz, 1991). This operator is equivalent to simultaneous image smoothing and

enhancement. A third family of image filters is the high pass filter which leads to image

sharpening. This filter is common in medical imaging, industrial inspection and military


Training scheme C

Training stage

clean image clean image

Blurred image clean image

Reconstruction

Class label

Classification

Reconstruction

Figure 6.2: In the training stage, the network is trained to classify and reconstruct “clean”and blurred images. Reconstruction is a copy of the input in the output sublayer.

applications. The presence of noise in images is inevitable. It may be a result of image

generation, recording, transmission, etc. Noise corruption complicates image acquisition

and even a small amount of it is harmful for restoration of blurred images. We consider

two types of additive noise: Gaussian white noise and pulse noise. We limit ourselves to

Gaussian noise that acts independently on each pixel, with zero mean and some variance σ.

Pulse noise (otherwise called “Salt and Pepper” noise) replaces pixel intensities by either

the maximum or minimum grey-level values with some probability (Rosenfeld and Kak,

1982), producing separate high contrast black-and-white points. This noise is common in

video transmission.

6.2.1 Main filters

Filtering may be done both in the frequency and spatial domains. Convolution in the

spatial domain is equivalent to multiplication of the Fourier transforms of the image and

the filter in the frequency domain. In each particular case we indicate in which domain

filtering is done and represent point spread function or its Fourier transform (referred to

as a transfer function) as required. Examples of images with various degradations are

shown in Figure 6.3.

Ideal filters

Ideal filters represent a class of frequency domain filtering that are easy to simulate.

Transfer functions of these filters are radially symmetric about the origin and though

they are not physically realizable, they are widely used in image processing for comparing

the behavior of different types of filters. The name ideal indicates that some specified


Degraded Images

original a b c d

e f g h i

Figure 6.3: a) Result of Gaussian noise with σ = 2; b) Result of pulse noise with density20%; c) Result of replacement of the nose area with average intensity over this area; d)Result of the root filter with α = 0.6; e) Result of the out-of-focus filter with the blurradius R = 5; f) Motion blur with blur propagation on 7 pixels; g) Result of Gaussianblur with σ = 2; h) Result of the DOG filter with on and off centers equal to σ1 = 1 andσ2 = 2 i) Result of the ideal high pass filter with cutoff w = 3

frequencies are completely eliminated. Depending on the eliminated frequencies ideal low,

band and high pass filters are known (Gonzalez and Wintz, 1993). A transfer function of

the ideal filter in the frequency domain (u, v) is given by the expression:

H(u, v) =

1 if (u, v) ∈ D0 otherwise,

where the area of the unchanged frequencies D is:

a)√u2 + v2 ≤ W ?, b)

√u2 + v2 ≥ W 0, c) W ? <

√u2 + v2 < W 0,

for low, high and band-pass filters respectively, W ?, W 0 are called cutoff frequencies.

Motion blur

Motion blur is a form of image degradation that may degrade recognition performance

(Figure 6.3f). It is due to a relative motion between the camera and the object. Assuming

that a relative camera motion is horizontal and uniform and the total displacement during

the exposure time T is a, the transfer function H(u, v) (Gonzalez and Wintz, 1993) is

given by:

H(u, v) =T

πuasin(πua) exp(−πiua). (6.2.2)


H vanishes at values of u given by u = na, where n is a nonzero integer. In general, the

amplitude of H(u, v) is characterized by periodic lines of zeros, which are orthogonal to

the direction of motion and are spaced at intervals of 1a

in both sides of the frequency

plane.

Out-of-focus blur

The point spread function (PSF) of a defocused lens with a circular aperture is approxi-

mated by the cylinder whose radius R depends on the extent of the focus defect (Cannon,

1976):

h(x, y) =

1

πR2 if x2 + y2 ≤ R2

0 otherwise,

where R is the “blur radius” which is proportional to the extent of defocusing. The Fourier

transform of h(x, y) in this case is H(u, v) = J1(πRr)/(πRr), where J1 is the first-order

Bessel function and is characterized by “almost-periodic” circles with zero valued H(u, v).

This occurs for r satisfying: 2πRr = 3.83, 7.02, 10.2, 13.3, 16.5 . . .

The well-defined structure of H(u, v) zeros in the case of motion and misfocus blur is

used for the identification of the blur parameter (Cannon, 1976; Fabian and Malah, 1991)

for the purpose of image restoration. However, these methods are sensitive to noise. To

overcome this drawback, some preprocessing stage for noise reduction and estimation were

used (Fabian and Malah, 1991). An example of a misfocus image with blur radius R = 5

is shown in Figure 6.3e.

Gaussian blur

Gaussian blur may be caused by atmospheric and optical blur. It is known that the

eyes’ lenses cause such blur. Computer tomography images also suffer from Gaussian blur

(Kimia and Zucker, 1993). The Gaussian convolution filter written in polar coordinates

h(r, φ) in the spatial domain is given by:

h(r, φ) = Cσ−2 exp(−r2

2σ2), (6.2.3)

where C is a normalization constant. The lack of zero crossing of the Gaussian filter in the

frequency domain makes its identification very difficult. Moreover, Gaussian deblurring

is numerically unstable (Humel et al., 1987; Kimia and Zucker, 1993). An example of an

image blurred by this filter with σ = 2 is shown in Figure 6.3g.


DOG filter

The difference of Gaussian (DOG) filter is a good approximation to the circular symmetric

Mexican hat type receptive fields (center-surround) found in early mammal vision (Marr,

1982; Kandel and Schwartz, 1991). It performs a band-pass filter that is the result of

applying the Laplacian operator ∇2 to an image which is blurred with a Gaussian filter.

The zero-crossings of the resulting convolved image are commonly used for edge detection

and segmentation. The DOG filter written in polar coordinates is described by:

h(r, φ) = Cσ−21 exp(

−r2

2σ21

)− Cσ−22 exp(

−r2

2σ22

), (6.2.4)

where σ1 < σ2 and are the standard deviations of the on and off center (positive and

negative Gaussians). An image blurred with a DOG filter is shown in Figure 6.3h.

Root filter

Root filter is commonly used for image enhancement and deblurring (Jain, 1989). It affects

the magnitude of the frequency response of an image V as given by: ‖ V ‖=‖ V ‖α . For

small values of α < 1, it acts as a high pass filter, increasing the ratio between amplitudes

in the high and low frequencies. An image enhanced with a root filter (α = 0.6) is shown

in Figure 6.3d.

6.2.2 Other types of degradation

Noise

We consider two types of additive noise: Gaussian white noise and pulse noise. Gaussian

white noise is commonly used to model sensor noise and quantization process. We limit

ourselves to Gaussian noise that acts independently on each pixel with zero mean and

some variance σ2 (Figure 6.3a). Pulse noise replaces pixel intensities by either the

maximum or minimum grey-level value with some probability (Rosenfeld and Kak, 1982),

producing separate high contrast black-and-white points. This explains why pulse noise

is called otherwise ”Salt and Pepper” noise. Pulse noise often appears during TV image

transmission (Figure 6.3b).

Occlusion

Occlusion occurs as a result of motion, when two or more objects touch or overlap one

another. Another cause for occlusion in 2D images is the change of viewpoint, when

part of an object is occluded by another one. We simulate occlusion by replacing pixel


intensities at a certain rectangular area in any part of the image by some constant intensity

in that rectangle (Figure 6.3c). A level of occlusion is characterized by a factor ν to the

average intensity of an occluded area.

6.3 Image restoration

Image restoration refers to the problem of recovering an image from its blurred and noisy

version, using some a priori knowledge of the degradation phenomenon and the image

nature. It is well-known that the restoration problem is an ill-posed problem (Gonzalez

and Wintz, 1993; Jain, 1989; Stark, 1987), i.e. a small noise in the observed image results

in an unbounded perturbation in the solution. This instability is often addressed by

a regularization approach (Tikhonov and Arsenin, 1977; Katsaggelos, 1989; Sezan and

Tekalp, 1990; Rudin et al., 1992; You and Kaveh, 1996) that includes restricting the set

of admissible solutions and introducing some a priori knowledge about the image and the

degradation model.

6.3.1 MSE minimization and regularization

Assuming the blur operator H is known, a natural criterion for estimating an original

pixel image f from an observed pixel image g in the absence of any knowledge about

noise, is to minimize the difference between the observed image and a blurred version of

the restored image:

minf

M(f) = minf‖ g −Hf ‖2. (6.3.5)

Often, gradient or conjugate gradient descent methods are used for M(f) minimization

(Katsaggelos, 1989; Sezan and Tekalp, 1990). An application of the gradient method to

the minimization problem (6.3.5) produces the following iterative scheme:

fk+1 = fk + β(Htg −HtHfk), f0 = 0. (6.3.6)

When the blur matrix H is nonsingular and β is sufficiently small, the iterative scheme

converges to the f = H−1g. This solution is known as the inverse filter method. In the

frequency domain, it corresponds to the following estimation of the ideal image frequency

response:

F (u, v) =G(u, v)

H(u, v). (6.3.7)

As mentioned before, blur such as motion or defocusing leads to a singular H matrix. In

this case, the above optimization method yields an iterative scheme that converges to the


minimum norm least square solution H+g of Eq. 6.3.5 (Katsaggelos, 1989; Jain, 1989),

where H+ is the generalized inverse of matrix H.

In the presence of noise the iterative algorithm converges to H+gb + H+n (where gb

is a blurred image without noise interference) and thus contains noise filtered by the

pseudo-inverse matrix. Often, H is a low-pass filter, therefore, the noise is amplified and

the obtained solution may be very far from the desired one.

To overcome this sensitivity to noise, some a priori information about the noise or

the ideal image is often introduced as a quantitative constraint that replaces an ill-posed

problem by a well-posed one. This method is called regularization. The most well known

regularization methods (Tikhonov and Arsenin, 1977; Sezan and Tekalp, 1990) have a

general formulation as a minimization of the function:

L(f) =‖ Hf − g ‖2 +α ‖ Cf ‖2,

where the regularization operator C is chosen to suppress the energy of the restored image

in the high frequencies, that is equivalent to an assumption about the smoothness of the

original image in the spatial domain. Since usually the H filter is a low pass filter, in

order to obtain the smooth original image, the regularization operator C is taken to be

a Laplacian ∇ · ∇f , where ∇ – is a differential operator. A regularization parameter α

may be known a priori or estimated, but theoretically it is inversely proportional to the

signal to noise ratio (SNR).

Although regularization of the MSE criterion with smoothness constraint ‖ Cf ‖ is

the basis for most of the work in image restoration, it often leads to unacceptable ringing

artifacts around sharp intensity transitions. This effect is due to image blurring around

lines and edges. Some solution to this problem is given by the following functional mini-

mization (Katsaggelos, 1989):

L(f) =∑

x∈Ω

[g(x)− h(x) ∗ f(x)]2 + λ∑

x∈Ω

ω(x)[c(x) ∗ f(x)]2. (6.3.8)

The first term in (6.3.8) represents the fidelity of the restored image with respect to

an observation and the second represents a smoothness constraint, ∗ – is a convolution

operator. The space adaptivity is achieved through the introduction of the weight function

ω. The weight function ω is set to be small around the edge areas, larger near the smooth

areas and usually is taken in practice as the inverse of the local variance of the image.

The space adaptivity approach has been extended to the case of an unknown blur

operator (You and Kaveh, 1996; Chan and Wong, 1997). The method incorporates a

priori knowledge about the image and the point spread function (PSF) simultaneously. It

proceeds by minimizing the cost function, which consists of a restoration error measure


and two regularization terms for the image and the blurring kernel; under constraints on

the blur filter energy.

You et al. (You and Kaveh, 1996) formulate the problem as a minimization of the

function dependent on the discrete image and filter values (2D image and filter functions

are quantized on the grid):

L(f, h) =∑

x∈Ω

ω(x)[g(x)− h(x) ∗ f(x)]2 +

λ1

∑

x∈Ω

ω1(x)[c1(x) ∗ f(x)]2 + λ2

∑

x∈Ω

ω2(x)[c2(x) ∗ h(x)]2 (6.3.9)

In (6.3.9) the first term is responsible for the image fidelity and the second and third terms

represent smoothing constraints on the image and the blur filter, respectively. Smoothness

is introduced adaptively via the weights ω1(x) and ω2(x).

Though the gradient descent method is commonly applied for minimization, an al-

ternating minimization (AM) algorithm is used, which is a particular realization of the

coordinate descent method (Luenberger, 1989). The filter and the image are considered

as dual variables. The algorithm alternately minimizes a cost function by descending with

respect to the filter or the image, while fixing the dual variable. In every alternating step,

a quadratic cost function L(f, h|f) or L(f, h|g) is minimized by the conjugate gradient

method. We note that this formulation is equivalent to minimization of a functional:

L(f ,h) =‖ ω(h ∗ f − g) ‖2L2 +λ1 ‖ √ω1C1 ∗ f ‖2L2 +λ2 ‖ √ω2C2 ∗ h ‖2L2 ,

where f and h are image and blur kernel 2D real functions and ‖ · ‖L2 is an L2 – norm.

Regularization with another form of constraint has been considered in (Chan and

Wong, 1997), where the problem is formulated as a minimization of the functional:

L(f ,h) =‖ h ∗ f − g ‖2L2 +α1

∫

Ω|∇f |dx + α2

∫

Ω|∇h|dx. (6.3.10)

The proposed method is called total variation blind deconvolution (TV regularization).

In Eq. (6.3.10) the regularization term has the form∫

Ω |∇f |dx, called a total variation

(TV) norm (Rudin et al., 1992). It follows the idea that the image consists of the smooth

patches, instead of being smooth everywhere, thus providing better recovering of image

edges.

6.3.2 Image restoration in the frequency domain

All the restoration methods considered up to this point were derived in the space domain,

though historically the first methods were designed in the frequency domain. Herein we

survey briefly the most widely spread frequency domain restoration methods.


Wiener filter

A fundamental result in filtering theory used commonly for image restoration is a Wiener

filter. Wiener filtering has been successfully used to filter images corrupted both by noise

and blurring. This filter gives the best estimate of the object from the observations in

the MSE sense. The Wiener filter frequency response is given as (Jain, 1989):

HW =H?Sff

‖ H ‖2 Sff + Sηη=Sgg − SηηHSgg

. (6.3.11)

In the case where only one observation is available, Sff and Sgg are power spectrums of

ideal and observed images, respectively, and Sηη is a power spectrum of the noise. Since

the phase of the Wiener filter coincides with the phase of the inverse filter, it does not

compensate for phase distortions due to noise in the observations.

In the absence of the blurring, the Wiener filter becomes:

HW =Sff

Sff + Sηη=

snr

snr + 1, (6.3.12)

where snr = Sff/Sηη is a signal-to-noise ratio. In practice, snr is defined as a ratio

between variances of the blurred image and the noise (or 10 log10 snr, if signal-to-noise

ratio is measured in Db) .

This filter (6.3.12) is called the Wiener smoothing filter. It suppresses all frequency

components in which the signal-to-noise ratio is small and does not change the frequency

components when snr is large (snr 1). For images, Sff is usually very small for high

frequencies, therefore the noise smoothing filter is a low pass filter. Another marginal

case is the absence of noise, in which the Wiener filter coincides with the inverse filter

HW = H−1. Since the blurring process is usually a low pass filter, the Wiener filter acts

in this case as a high pass filter.

In the presence of noise and blur, the Wiener filter achieves a compromise between

low-pass noise smoothing and high-pass inverse filtering, resulting in a band-limited filter.

It is clear, nevertheless, that the Wiener filter is also unstable (like the inverse filter), if

the frequency response is zero or close to it.

Inverse and pseudo-inverse filters

As has been already mentioned, in the case of the noise absence, the Wiener filter becomes

an inverse one and requires stabilization. A standard stabilized version of the inverse filter

is described by the following equation:

H−1(w1, w2) =

1

H(w1,w2)if H(w1, w2) ≥ ε1

0 otherwise


Instead we have used the next version of the pseudo-inverse filter in our simulations

H−1(w1, w2) =

1H(w1,w2)

if H(w1, w2) ≥ ε11

H(w1,w2)+ε2otherwise

The choice of the ε1 and ε2 parameters defines the quality of the deblurred image. In our

simulations, they have been chosen by trial once for all the data set. It is known that

great care must be taken to obtain approximate solutions that achieve the proper balance

between accuracy and stability. (Stark, 1987). Another nonlinear deblur filter is a root

filter (see Section 6.2.1) that is also used for image enhancement.

6.3.3 Denoising

Denoising may be considered a particular restoration method when the PSF of the blur

operator is a delta function. Thus, some of the methods described above are appropriate

for denoising (Rudin et al., 1992; You and Kaveh, 1996). We also consider two examples

of the rank algorithms (Yaroslavsky and Eden, 1996). Rank algorithms are especially

designed for noise reduction. They are based on the statistics extracted from the vari-

ational row, that is a sequence of central pixel and its neighbors, ranked in increasing

order of their intensities. Different definitions of the neighborhood and variational rank

statistics lead to diverse rank algorithms. Rank statistics may be also obtained from local

histograms and are rather computationally efficient, when applied recursively. The main

advantage of the rank algorithms is local adaptivity. Different denoising algorithms may

be also applied in the cascade.

First, we consider an averaging technique, called peer group averaging (PGA), in which

a central pixel intensity is replaced by an average intensity of some predefined neighboring

pixels, which are closest by intensity value. The number of pixels over which averaging is

performed is called the peer group size and it controls the amount of smoothing.

The second method – the median filter, replaces the gray level intensity of each pixel

by the median of its neighboring pixel intensities. This method is particularly effective

when the noise is spike-like. It is nonlinear, is very robust and preserves edge sharpness.

6.4 Results

Our experiments have shown that training with both schemes B and C (see Section 6.1.1)

leads to recognition improvement compared with the training scheme A. We have also

observed that scheme B is superior to scheme C, but the difference between them is in-

significant. Therefore, below we concentrate on ensembles obtained by using two training


schemes A and B, and postpone with summary comparison results for all three schemes

until Section 6.4.6. All experiments are carried out on the TAU facial data-set.

6.4.1 Image filtering

In the first group of experiments, the abilities of different ensembles to classify images

processed by ideal and some typical low, band and high pass filters have been compared.

Classification results are presented in Table 6.1 and some degraded images in Figure 6.3.

Classification results for filtered dataTypes of Training scheme A Training scheme B

corruption with extra blurred imagesUnconstrained Reconstruction Joined Unconstrained Reconstruction Joined

λ = 0 ensemble ensemble λ = 0 ensemble ensemble”Clean data” 12.8 12.8 13.5 9.5 10.8 8.8

Ideal low-pass 15.5 14.2 13.5 9.5 10.1 9.5cutoff w = 10Gaussian blur 19.6 16.2 16.9 14.2 11.5 10.8with σ = 2Out-of-focus blurwith r = 5 20.9 20.9 17.6 16.2 10.8 11.5Motion blur in thediagonal direction 32.4 26.4 26.4 29.7 24.3 19.6with d = 5Motion blur in thehorizontal direction 21.6 24.3 19.6 16.9 14.9 12.2with d = 5

Ideal band-pass 41.2 41.2 35.8 39.2 31.8 28.43 <‖ w ‖≤ 10DOG filter withσ1 = 1 and σ2 = 2 31.8 26.4 23.6 23.0 26.4 20.9

Ideal high pass 39.2 35.1 32.4 33.8 31.8 27.7‖ w ‖> 3Root filter with 16.9 17.6 12.8 10.1 10.8 8.1α = 0.6Root filter with 12.8 13.5 12.8 9.5 9.5 8.1α = 0.8

Table 6.1: Percent classification error for filtered data (TAU data set)

Low-pass filtering

We have considered the ideal low-pass filter with cutoff w = 10 , the Gaussian blur with

standard deviation σ = 2, motion blur in diagonal and horizontal directions and the

out-of-focus blur, all with blur propagation on 5 pixels.

We note that for each of training schemes A–B, the unconstrained (λ = 0) ensemble

is inferior to the reconstruction and joined ensembles in the blurred image recognition.

In turn, the reconstruction ensembles are superior to the unconstrained ensembles. For

example, for Gaussian blurred images the unconstrained ensemble of the training scheme

A yields the misclassification rate of 19.6%, while the reconstruction ensemble produces


16.2%. For ensembles trained with the training scheme B, the misclassification rate falls

from 14.2% for the unconstrained ensemble to 11.5% for the reconstruction ensemble.

Merging of the unconstrained and reconstruction ensembles improves classification re-

sults further on. For example, for out-of-focus images, the joined ensemble of the training

scheme A has the misclassification rate of 17.6%, while the reconstruction ensemble pro-

duces 20.9%. For diagonal motion the joined ensemble of the training scheme B has the

misclassification rate of 19.6% compared with 24.3% for the reconstruction ensemble.

We note that reconstruction ensembles often give better classification results than

unconstrained ensembles and joined ensembles improve classification further on.

Band-pass filtering

Band-pass filtering is presented by the DOG filter with the size of on and off receptive

fields equal to 1 and 2 pixels respectively, and ideal band-pass filtering with inner and

outer cutoff radiuses equal to 3 and 10 respectively. Our experiments show that joined

ensembles are better than reconstruction ensembles, which in most of the cases are better

than unconstrained (λ = 0) ensembles. Therefore, for the training scheme A with testing

on DOG filtered images, the misclassification rate falls from 31.8% for the unconstrained

ensemble, to 26.4% for the reconstruction ensemble, and then to 23.6% for the joined

ensemble. For the training scheme B the reconstruction ensemble is inferior to the un-

constrained ensemble, but the joined ensemble is superior. Its classification performance

is 2.1% more than for the unconstrained λ = 0 ensemble. Finally, the joined ensem-

ble with the scheme B improves the results by 10.9%, in comparison with the classical

unconstrained ensemble of the training scheme A.

High-pass filtering

High pass filtering is presented by the ideal high pass filter wih cutoff w = 3 and by the

root filter. Though images degraded with the high pass filter bear a resemblance to original

images (Figure 6.3i), they are difficult for recognition. The smallest misclassification rate

on this data is achieved by the joined ensemble of the training scheme B (27.7%). When

degradation becomes less, recognition improves and even may be useful. Classification

results on root filtered images are slightly better than the results for “clean” images.

Surprisingly, humans also recognize slightly enhanced images better than the original

images. Remarkably, joined ensembles are best in recognition of differently degraded

images.


6.4.2 Classification of noisy data

In the following section, we shall test the performance of our scheme under realistic noise

and blur degradations. We first test the performance under various noise operations on

non-blurred objects in order to have a base line for comparison with the blurred results.

Results of an ensemble of networks on noisy and restored images are presented in Table 6.2.

Two kinds of noise, “Salt and Pepper” and Gaussian noise of small and large levels are

considered. “Salt and Pepper” noise is implied with density parameters d = 0.2 and

d = 0.6. Gaussian noise corresponds to snr = 10 and snr = 1. Median filter with a

window size 3× 3 is used to denoise images corrupted with “Salt and Pepper” noise. To

denoise images degraded with Gaussian noise, peer group averaging (PGA) has been used.

PGA window size 3× 3 and group size ng = 5 have been chosen for snr = 10 and ng = 6

for snr = 1.

Noise and RestorationTypes of Training scheme A Training scheme B



“Salt and Pepper”noise with d = 0.2 25.0 20.3 20.3 20.3 18.2 14.2Median filterdenoising 13.5 12.8 12.8 9.5 8.8 8.8“Salt and Pepper”noise with d = 0.6 70.3 66.9 69.6 81.8 76.4 74.3Median filterdenoising 25.0 20.3 21.6 20.9 20.9 14.9

Gaussian noisewith snr = 10 13.5 13.5 12.8 8.1 10.8 8.1PGA denoisingwith ng = 5 13.5 14.9 13.5 8.8 10.8 8.8Gaussian noisewith snr = 1 15.5 16.9 12.8 10.1 10.8 8.1PGA denoisingwith ng = 6 14.9 15.5 12.8 10.8 12.8 8.8

Table 6.2: Percent classification error for noisy data (TAU data set)

Examples of noisy and restored images are presented in Figure 6.4. We note that

classification is more sensitive to “Salt and Pepper” noise than to Gaussian noise, which

may be explained by the quasi-linear type of MLP network transformations.

For a “Salt and Pepper” noise of density d = 0.6, 60% of the image pixels intensities

are replaced by marginal intensity values, which leads to a very high misclassification

rate. Additional preprocessing by median filter significantly improves classification and

gives the mild misclassification rate of 14.9% for the best joined ensemble of the training

scheme B.

Sensitivity of the network ensembles to Gaussian noise is small. Moreover, the joined


Noisy Images

a b c d

Figure 6.4: a) An image contaminated with “Salt and Pepper noise” at 20% corruption.b) Results of the median smoothing in a window of size 3× 3. c) An image contaminatedwith Gaussian noise with snr = 1. d) Results of the peer group averaging in a window ofsize 3× 3 and with a peer group of size ng = 6.

ensembles of both schemes A and B are insensitive to Gaussian noise and denoising, which

is carried out beforehand, even slightly spoils classification results.

6.4.3 Gaussian blur

The classification results for Gaussian blurred images without noise interference and for

their restored images are presented in Table 6.3. The Gaussian operator has the standard

deviation equal to σ = 2.

Gaussian Blur and RestorationTypes of Training scheme A Training scheme B



Gaussian blur 19.6 16.2 16.9 14.2 11.5 10.8with σ = 2Pseudoinversefilter:with σ = 1.5 : 15.5 13.5 14.2 8.1 10.1 8.8with σ = 2.0 : 13.5 13.5 12.8 9.5 10.8 8.8with σ = 2.5: 15.5 15.5 12.8 9.5 10.1 7.4Root filter:α = 0.6: 12.8 13.5 12.8 14.9 12.8 10.8α = 0.8: 13.5 14.2 14.2 12.2 12.2 8.8

Table 6.3: Percent classification error for deblurred data

The most sensitive to the Gaussian blur is the unconstrained λ = 0 ensemble of the

training scheme A and the best is the joined ensemble of the training scheme B.

For deblurring, pseudo-inverse and root filters have been used. In pseudo-inverse

filter, the standard deviation of the Gaussian kernel is assumed to be known only approx-

imately. The inverse Gaussian operator with an approximated standard deviation σ in


the frequency domain is given by:

H−1σ (w) = exp(−2π2σ2w2). (6.4.13)

Thus two main cases exist. In the first case, the guessed value is less than the original

σ < σ and image remains partially blurred with Gaussian filter. In the second case, the

guessed value exceeds the original (σ > σ), which corresponds to filtering with high-pass

filter that is given in the frequency domain by:

Hβ(w) = exp(2π2β2w2), β =√σ2 − σ2. (6.4.14)

This analysis does not consider computational problems connected with the asymptotic

behavior of H−1σ (w) as w tends to infinity.

Classification results with pseudo-inverse filtered images are presented in Table 6.3 in

the rows marked with “Pseudo-inverse filter” and restored images are given in Figure 6.5

(d-f). Pseudo-inverse filter has been applied three times with approximated standard

deviations σ = 1.5, 2, 2.5. As expected, deblurring improves the classification results and

Gaussian blur and restoration

a b c

d e f

Figure 6.5: a) Image blurred with Gaussian filter with standard deviation σ = 2 b)Enhancement with root filter with α = 0.8 c) Enhancement with root filter with α = 0.6d) Pseudo-inverse filter with guessed σ = 1.5 e) Pseudo-inverse filter with guessed σ = 2f) Pseudo-inverse filter with guessed σ = 2.5

the best one are for the joined ensemble trained with the scheme B. We note that both


joined ensembles classify pseudo-inverse deblurred images with σ = 2.5 slightly better

than ”clean” data. We have observed a similar behavior for high-pass filtered data. A

simple enhancement with root filter also improves the classification results.

6.4.4 Motion blur

Motion Blur and RestorationTypes of Training scheme A Training scheme B



d = 5 pixels 21.6 24.3 19.6 16.9 14.9 12.2snr=inf (no noise)deblurring 12.8 12.8 12.8 9.5 10.8 8.8

d = 5 pixels andGaussian noise 20.9 24.3 19.6 16.2 15.5 12.2snr=100deblurring 13.5 14.2 13.5 9.5 10.8 8.8

d = 5 pixels andGaussian noise 21.6 23.6 19.6 16.9 15.5 12.2snr=10smoothing anddeblurring 14.9 14.9 14.2 9.5 10.8 9.5

d = 7 pixels 27.0 29.1 23.6 20.3 23.0 16.2snr=inf (no noise)blinddeconvolution 13.5 15.5 12.8 10.8 11.5 9.5

Table 6.4: Percent misclassification rate for motion blurred and restored images. Motiontakes place in the horizontal direction and Gaussian noise is added. Motion propagationis given as a parameter d. Noise level is indicated as a signal-to-noise ratio snr, if noiseis absent snr = inf . MSE minimization with adaptive Tikhonov regularization is usedfor restoration. Lines marked with “deblurring” stand for deblurring with a known bluroperator.

Table 6.4 presents classification results for images degraded as a result of horizontal

motion and additive Gaussian noise (Figure 6.6). As expected, with increase of the blur

propagation, classification declines. As we have already seen, the influence of noise is

less dramatical, in particular, for joined ensembles. Indeed, negative role of the noise is

revealed during image restoration. The blur propagation may be estimated from the well-

defined periodic structure of zero-crossing line locations of motion filter in the frequency

domain. However, this method is highly sensitive to noise.

For restoration, MSE minimization with Tikhonov adaptive regularization is used. In

all experiments with motion propagation on d = 5 pixels, a motion filter is assumed to be

known. For noise degradation with snr = 10 (10 Db), a simple smoothing (averaging) in

the window of size 3 × 3 pixels is carried out before restoration. Classification after de-

blurring of images degraded with small noise snr = 100 (20 Db) is the same as for “clean”


Motion blur and deblur

a b c d

Figure 6.6: a) Motion blur with propagation on 5 pixels and Gaussian noise with snr = 10.b) Motion deblur using the constrained regularization method with the known blur filterand with the simple averaging in the window 3× 3 before its application. c) Motion blurwith blur propagation equal to 7 pixels. d) Blindly restored image.

images for both joined ensembles. For larger noise with snr = 10 (10 Db) classification is

slightly worse.

To restore the images blurred as a result of motion with blur propagation parameter

d = 7 pixels in the absence of noise, the Tikhonov regularization for both image and

filter is applied. Since the direction of motion blur can be easier estimated than the

motion propagation parameter, it is assumed to be known. The kernel support of the

blur filter is taken to be 9 pixels in the motion direction. Initial guesses are the observed

blurred image for an image and a delta function for a blurring operator. The results

of this experiment are presented in the two last rows of Table 6.4. Though deblurred

images differ slightly visually from the “clean” data, their classification is the same as for

“clean” data. The joined ensemble obtained using the training scheme B is the best in

classification of motion blurred and restored images. The classification results for images,

blurred with Gaussian filter, and contaminated with Gaussian noise, along with their

deblur using blind deconvolution are presented below.

6.4.5 Blind deconvolution

This section presents classification results for blindly deconvolved images. The blurred

images are obtained as a spatial convolution of the original images with Gaussian kernel

with standard deviation equal to σ = 2 and pruned to have a support 7 × 7 pixels.

Blind deconvolution is done using the regularization approach to image identification and

restoration (You and Kaveh, 1996). The filter and image are assumed to be positive

and a kernel support is taken to be 15 × 15 pixels. The sum of filter kernel coefficients

and summary image intensity are normalized to 1. The initial guess for an image is


the degraded face and we start from a delta function filter, no symmetry constraints

(Chan and Wong, 1997) are used. The regularization parameters are set by hand from

visual appearance once and for all images. An image blurred, with a truncated Gaussian

filter, and contaminated with Gaussian noise of snr = 100, and its blind deconvolution

are presented in Figure 6.7. Classification results for two cases, with and without noise

Blind deconvolution

a b c d

Figure 6.7: a) Image blurred with Gaussian filter with standard deviation σ = 2, prunedto a support area 7 × 7 and Gaussian noise with snr = 100. b) Blind deblurring of thedegraded image. c) Original filter. d) Found filter, pruned to the same support as theoriginal filter.

interference, are presented in Table 6.5.

Blind DeconvolutionTypes of Training scheme A Training scheme B



Blur with pruned 18.2 16.2 17.6 10.1 10.1 10.8Gaussian filterBlinddeconvolution 12.8 13.5 12.8 9.5 10.8 8.1

Blur with prunedGaussian filterand Gaussian 18.9 16.2 18.2 10.8 10.1 10.8noise, snr = 100Blinddeconvolution 12.8 14.2 12.8 10.1 10.1 8.8

Table 6.5: Percent misclassification rate for blurred and blindly deblurred images. Theimages are blurred with pruned Gaussian filter.

We note that between ensembles obtained with the training scheme A, reconstruction

ensemble is the less sensitive to blurring and noise. Ensembles obtained with the training

scheme B are less sensitive to noise and blur. The joined ensemble obtained with the

training scheme B has the best classification performance.


6.4.6 All training schemes

Recognition of blurred images via schemes A–C

BdB

CdC

AdA

fe

dc

ba

0

10

20

30

Mis

clas

sific

atio

n ra

te %

Figure 6.8: Percent classification error bar graph for reconstructed images. Regressionensembles A-C correspond to joined ensembles obtained using recognition schemes A-C,respectively. Heights of the bars marked with Ad-Cd show misclassification of ensemblesA-C respectively on restored images. See also corresponding Table 6.6 for description ofdegradation types a-f.

Summary classification results for joined ensembles, corresponding to all training

schemes A–C, are presented in Figure 6.8 and Table 6.6. First, we observe that en-

sembles of networks trained using the expanded training data-set are superior to the

joined ensemble trained without blurred images. Secondly, we note that both ensembles

B-C have about the same classification performance and ensemble B is slightly better.

This may be explained by the drastic compression rate, that causes reconstructed images

to look blurred in both cases and results in the similarity of two types of reconstruction

constraints (see Figure 6.9).

The third and important observation is that training with blurred images seems to be

more important than restoration preprocessing. Indeed, recognition of restored images

using scheme A (column and bars marked with Ad) is inferior to degraded image recogni-

tion using schemes B–C. However, as was already marked, usage of image preprocessing


Blurred image recognition via joined ensembles

Image degradation types Joined ensemblesand deblur type

Bd B Cd C Ad Aa) clean and rootfilter α = 0.8 8.1 8.8 9.5 8.1 12.8 13.5b) Gaussian blur σ = 2root filter α = 0.8 8.8 10.8 8.8 10.8 14.2 16.9c) Truncated Gaussianblur σ = 2 andGaussian noise snr=20 Db 8.8 10.8 8.1 10.1 12.8 18.2and blind deconvolutiond) Motion blur d = 7and blind deconvolution 9.5 16.2 10.8 16.9 12.8 23.6e) Out-of-focusblur with a = 5 * 11.5 * 12.8 * 17.6f) DoG filterσ1 = 1 and σ2 = 2 * 20.9 * 20.9 * 23.6

Table 6.6: Percent classification error for reconstructed images. Regression ensembles A-C correspond to joined ensembles obtained using recognition schemes A-C, respectively.Columns marked with Ad-Cd show misclassification of ensembles A-C respectively onrestored images. Information where restoration process was not done is marked with *.Experiments with TAU data-set

before recognition schemes leads to improved classification results.

6.5 Conclusions

Two ways to improve the challenging problem of blurred image recognition were proposed:

(i) Preprocess the blurred images using blind deconvolution methods before recognition;

(ii) Apply our regularized reconstruction constraints technique (Chapter 3) to a training

set that has been expanded by blurred images of some form. This forces the reconstruction

operator that is estimated during training to become less sensitive to the blur operation.

For this reason, training without reconstruction using the expanded training set does not

improve results.

Two training schemes with and without blurred images have been compared and dif-

ferent network ensembles have been considered. The best classification scheme is the

scheme that includes both the hybrid recognition/reconstruction architecture and usage

of blurred images. The best network ensemble is the joined ensemble, obtained by merging

of the unconstrained and the reconstruction ensembles trained with blurred images.


We have shown that the combination of both ways, the restoration and regularized

classification approach are superior to each one separately. Since restoration techniques

are very sensitive to noise and require a priori knowledge or visual human interaction, it is

important that the hybrid classification/reconstruction is less sensitive to the restoration

parameters.


Reconstruction of Gaussian blurred images

Training scheme B

Training scheme C

Figure 6.9: Reconstruction of Gaussian blurred images by Neural Networks obtained usingtraining schemes B–C. Images in the top row from left to right are an original image, its“caricature” image and Gaussian blurred image. In the middle row, images reconstructedby Neural Networks with λ = 0.05, 0.15, 0.25 and with reconstruction defined by thetraining scheme B are presented. In the bottom row, images reconstructed by NeuralNetworks with λ = 0.05, 0.15, 0.25 and with reconstruction defined by the training schemeC are presented. Note that though images in the middle row are sharper than images inthe bottom row, they nevertheless look blurred.

Chapter 7

Summary and future work

In this final chapter, we summarize the main contribution of the thesis and present several

possible directions for future work.

7.1 Summary

Our primary goal in this thesis was to improve the performance of a high dimensional

image recognition task, by extracting a good hidden representation of the image data.

We developed several approaches to achieve a good generalization in image recognition.

First we developed a novel hybrid feed-forward reconstruction/recognition network

architecture, with two output sublayers for reconstruction and recognition, and one com-

mon hidden layer shared by both tasks (Chapter 3). The network was trained to minimize

concurrently MSE of reconstruction and recognition output sublayers.

Though, a similar architecture was used previously (see Section 3.1.5), we first used

it for improving image recognition and gave a new interpretation of the hybrid network

as a tool to control bias via imposing a novel type of reconstruction bias constraints. In

addition, we introduced a trade-off parameter λ that defines the influence of each of the

tasks and is unknown a-priori. We have considered networks with different values of λ,

instead of considering only a single value, as has been proposed previously.

In addition, the network and its learning rule were interpreted in the MDL and

Bayesian frameworks. In Bayesian formulation, the network is trained to maximize the

conditional joint probability of the reconstructed image and its class label given the ob-

served image. In the proposed architecture, the reconstructed image and its class label

are independent given the observed image and under the assumption of a Gaussian distri-

bution of the errors, this maximization leads to the proposed learning rule. The trade-off

parameter λ emerges as a hyper-parameter and according to the Bayesian theory, the

right approach is to integrate predictors over this parameter. If the initial weights of the

124

Chapter 7: Summary and future work 125

feed-forward network are also considered as hyper-parameters, then the predictor f is

given by:

f(x) =∫ ∫

fλ,w0(x)p(λ,w0|X )dw0dλ (7.1.1)

This interpretation has led us to the second approach to improve image recognition.

We have proposed to replace the integration in Eq. 7.1.1 by a rough approximation via

ensemble network averaging. Networks with a good recognition performance were included

in the ensemble and their posterior probabilities p(λ,w0|X ) were assumed to be equal.

It is well known, that ensemble averaging can reduce the variance portion of the

prediction error. We have considered three ensemble types (Chapter 5):

• Unconstrained ensemble, which corresponds to integration over w0 for λ = 0

• Reconstruction ensemble, which corresponds to integration over λ for fixed w0

• Joined ensemble, which is a combination of unconstrained and reconstruction en-

sembles and corresponds to integration over both parameters

We have shown that the joined ensemble is superior to the reconstruction ensemble, which

in turn is superior to the unconstrained ensemble, in recognition of images degraded by

Gaussian and pulse noises as well as by partial occlusion or image blur.

Our third contribution concerns especially in improving recognition of blurred images.

It is well known, that in many practical recognition tasks, images appear blurred due to

motion, bad weather conditions and defocusing of cameras. Three ways were proposed

for improving blurred image recognition:

1. Expanding the training set with Gaussian blurred images

2. During training, constraining reconstruction of the blurred images to the original

clean images

3. Application of state of the art restoration methods to the blurred images before

using the hybrid architecture

The first two ways have led to two additional joined ensembles that we trained with extra

Gaussian images and reconstruction constraints. Reconstruction was either to the blurred

image or the clean (non-blurred) image (Chapter 6). We have shown that ensembles that

were trained on extra (blurred) images had improved recognition performance on different

image degradation types. In addition, we have shown that training with extra images


combined with restoration techniques achieved robust and best recognition performance

under a wide range of blur operators and parameters.

Additional contribution of the thesis is developing hybrid networks with unsupervised

learning constraints (Chapter 5), which were mainly used for comparison with reconstruc-

tion constraints. We have shown that these constraints can also be used for improving

the recognition performance instead, or in parallel with reconstruction constraints.

In addition, we addressed the issue of a network interpretability by investigating the

network hidden representation and hidden weights (Appendix 5.7), and by the saliency

map construction (Section 5.5). In contrast, to explicit understanding what information

is encoded in the hidden space, the saliency map allows one to decide which features in the

input are more important. We showed that usage of the saliency maps further improves

recognition of images degraded with “Salt and Pepper” noise.

7.2 Directions for future work

Non face data sets We have tested the proposed hybrid system on facial data sets.

Faces, however, are a special type of stimuli where all pixels are important (Biederman and

Kalocsai, 1997). It should be interesting to test the hybrid architecture performance on

data sets of similar objects, such as military images (different kinds of tanks, ships, cars,

etc.), medical images (different kinds of tumor cells) and astronomical images (images of

different stars and galaxies).

Ensemble interpretation In Appendix 5.7, hidden representations of single Neural

Networks with reconstruction constraints were investigated. In addition it was noted,

that network ensemble hidden representation is not well defined. However, another form

of interpretation using the mean derivative (over networks and images) with respect to

the inputs for each of the classes (Intrator and Intrator, 1993) may be very interesting.

Recurrent network architecture Images reconstructed by Neural Networks (which

we called prototypes, see Section 5.5) are reduced representation of the original images,

since a drastic compression occurs via the bottleneck architecture (see Figure 5.5). How-

ever, as can be seen, prototypes corresponding to the same class look similar, while

prototypes corresponding to different classes look different. It is also clear that a good

reconstruction/recognition network has to be able to recognize its own prototype images.

Table 7.1 presents recognition performance of the unconstrained and reconstruction en-

sembles (see Chapter 3), when they are tested on the prototype images. These results


Classification error for reconstructed images

Types of degradation Regression EnsemblesUnconstrained Reconstruction A Reconstruction B

”Clean data”input 0.5 1.5 1prototype 1.5 2.1 2.6“Salt and Pepper”noise with d = 0.2input 7.2 12.8 13.3prototype 7.2 11.8 12.8“nose” occlusioninput 0.5 1.5 1.5prototype 1.5 2.1 2.1“half face”occlusioninput 5.6 5.6 6.2prototype 6.2 6.7 7.2”DOG 1-2”input 7.7 4.1 8.2prototype 8.2 4.1 8.2

Table 7.1: Errors are given in percent (Pentland data-set).

show that networks are better in recognition of the original images than their own pro-

totypes. This can be corrected by propagating reconstructed images back to the input

layer during learning. In other words, during learning we propose to extend the training

set with extra images xe, which are a linear combination of the input x and its prototype

image xp:

xe = ρ(t)x+ (1− ρ(t))xp, ρ ∈ [0, 1],

where ρ(t) is a non increasing function of the training epoch number t, equal to 1 at the

beginning and 0 at the infinity. This procedure may give better results that should be

tested by simulation.

Network ensembles We considered ensembles corresponding to the simplest version

of integration (7.1.1) with equal posterior probability p(λ,w0|X ). Though it is impossible

to find posterior probability p(λ,w0|X ) analytically, it may be heuristically postulated.

Therefore, integration (7.1.1) may be replaced by the weighted network ensemble aver-

aging. We tried to use weights based on different error types between input and output

reconstruction layers, such as Euclidean metric or correlation measure and their soft ver-

sion using the exp(−x) function. However, our preliminary experiments do not show

significant recognition improvement.

Since the hybrid networks solve both recognition and reconstruction tasks, it is reason-

able to use the ensemble of hybrid networks for reconstruction. The obtained prototype

may be used for recognition by all the networks.


Degradation invariance constraints We considered the simplest version of invari-

ance constraints expanding the data with Gaussian blurred images. Another type of

invariance constraints is the tangent prop constraint, that was used for a group of geo-

metrical transformations (see Section 2.3.2). This type of constraints may be adapted for

different types of blur operations for both recognition and reconstruction tasks.

Generalization It would be interesting to generalize the hybrid architecture in the di-

rection taken by other generative models (Hinton and Ghahramani, 1997; Ullman, 1995).

Bibliography

Abu-Mostafa, Y. (1993). A method for learning from hints. In Touretzky, D. S. and Lipp-

mann, R. P., editors, Advances in Neural Information Processing Systems, volume 5,

pages 73–80. Morgan Kaufmann, San Mateo, CA.

Abu-Mostafa, Y. (1994). Learning from hints. J. of Complexity, 10:165–178.

Amari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind

signal separation. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in

Neural Information Processing Systems, volume 8, pages 757–763. MIT Press.

Atick, J. (1992). Entropy minimization: A design principle for sensory perception. In-

ternational Journal of Neural Systems. Proceedings of the Neural Networks: From

biology to High Energy Physics, 3:81–90.

Atick, J. J., Griffin, P. A., and Redlich, A. N. (1995). Face recognition from live video for

real-world applications. Advanced Imaging.

Atick, J. J. and Redlich, N. (1992). What does the retina know about natural scenes.

Neural Computation, 4:196–210.

Baird, H. (1990). Document image defect models. In IAPR, Workshop on Syntatic and

Structural Pattern Recognition, pages 38–46. Murray Hill, NJ.

Baluja, S. (1996). Expectation-based selective attention. PhD thesis, School of computer

science, CMU.

Baluja, S. and Pomerleau, D. A. (1995). Using the representation in a neural network’s

hidden layer for task-specific focus of attention. In Proceedings of the International

Joint Conference on Artificial Intelligence, Montreal, Canada.

Barlow, H. B. (1961). Possible principles underlying the transfomations of sensory mes-

sages. In Rosenblith, W., editor, Sensory Communication, pages 217–234. MIT Press,

Cambridge, MA.

129

Bibliography 130

Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3):295–311.

Baron, R. J. (1981). Mechanisms of human facial recognition. The international Journal

of Robotics research, 15:137–178.

Bartlett, Stewart, M., Lades, Martin, H., and Sejnowski (1998). Independent component

representations for face recognition. Proceedings of the SPIE Symposium on Elec-

tronic Imaging: Science and Technology. Conference on Human Vision and Elec-

tronic Imaging III.

Bell, A. J. and Sejnowski, T. J. (1995). An information-maximisation approach to blind

separation and blind deconvolution. Neural Computation, 7(6):1129–1159.

Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press, Princeton,

NJ.

Bhanu, B. and Ming, J. C. (1987). Recognition of occluded objects: A cluster-structure

algorithm. Pattern Recognition, 20(2):199–211.

Biederman, I. and Kalocsai, P. (1997). Neurocomputational bases of object and face

recognition. Philosophical Transactions of the Royal Society of London B, 352:1203–

1219. Comparison of face and other object recognition.

Bienenstock, E. L., Cooper, L. N., and Munro, P. W. (1982). Theory for the development

of neuron selectivity: orientation specificity and binocular interaction in visual cortex.

Journal Neuroscience, 2:32–48.

Bishop, C. M. (1995a). Neural Networks for Pattern Recognition. Oxford University Press.

Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.

Neural Computation, 7(1):108–116.

Blais, B. S., Intrator, N., Shouval, H., and Cooper, L. N. (1998). Receptive field forma-

tion in natural scene environments: comparison of single cell learning rules. Neural

Computation, 10(7):1797–1813.

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and

singular value decomposition. Biological Cybernetics, 59:291–294.

Brunelli, R. and Poggio, T. (1992). Face recognition through geometrical features. Pro-

ceedings of the 2nd European Conference on Computer Vision.

Bibliography 131

Brunelli, R. and Poggio, T. (1993). Face recognition: Features versus templates. IEEE

PAMI, 15:1042–1052.

Cannon, M. (1976). Blind deconvolution of spatially invariant image blurs with phase.

icassp, 24:58–63.

Caruana, R. (1995). Learning many related tasks at the same time with backpropagation.

In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information

Processing Systems, volume 7, pages 657–664. Morgan Kaufmann, San Mateo, CA.

Chan, T. F. and Wong, C. K. (1997). Total variation blind deconvolution. Technical

report, UCLA.

Chellapa, Wilson, and Sirohey (1995). Human and machine face recognition of faces.

Proceedings of the IEEE, pages 704–740.

Chopper (1997). Autonomous helicopter project. CMU.

Coifman, R. R. and Wickerhauser, M. (1992). Entropy-based algorithms for best basis

selection. IEEE Trans. Info. Theory, 38(2):713–719.

Comon, P. (1994). Independent component analysis, a new concept? Signal Processing,

36:287–314.

Connolly (1997). Robotics internet resources page. UMass Laboratory for Perceptual

Robotics.

Cotrrell, G. W. and Fleming, M. K. (1990). Categorization of faces using unsupervised

feature extraction. Proc. Int. Conf. on Neural Networks, 2:65–70.

Cottrell, Munro, P., and Zipser., D. (1987). Image compression by back propagation: An

example of extensional programming. Advances in Cognitive science, 3.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-

Interscience, New York.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from

incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society,

B-39:1–38.

Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John

Wiley, New York.

Bibliography 132

Edelman, S., Reisfeld, D., and Yeshurun, Y. (1992). Learning to recognize faces from

examples. In Proceedings of the 2nd European Conference on Computer Vision, pages

787–791, Santa Margherita Ligure, Italy.

Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and

Hall, New York.

Elman, J. L. and Zipser, D. (1988). Learning the hidden structure of speech. Journal of

the Acoustical Society of America, 4(83):1615–1626.

Fabian, R. and Malah, D. (1991). Robust identification of motion and out-of-focus blur

parameters from blurred and noisy images. cvgip, 53(5):403–412.

Field, D. J. (1994). What is the goal of sensory coding. Neural Computation, 6:559–601.

Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statis-

tical Association, 82:249–266.

Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic press,

London.

Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias-

variance dilemma. Neural Computation, 4:1–58.

Girolami, M. and Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit indices

provide generalised ica algorithms. Preprint.

Gluck, M. A. and Myers, C. E. (1993). Hippocampal mediation of stimulus representation:

A computational theory. Hippocampus, 3(4):491–516.

Gonzalez, R. C. and Wintz, P. (1993). Digital Image Processing. Addison-Wesley Pub-

lishing Company.

Grenander, U. (1978). Pattern analysis. Springer-Verlag, New York.

Gutta, S., Huang, J., Imam, I. F., and Wechsler, H. (1996). Face and hand gesture

recognition using hybrid classifiers. In Proceedings of the second IEEE International

Conference on Automatic Face and Gesture Recognition, pages 164–169, Washington,

Brussels, Tokyo.

Hancock, J. B., Baddeley, R. J., and Smith, L. S. (1992). The principal components of

natural images. In Network: Computation in Neural Systems, 3(1):61–70.

Bibliography 133

Hansen, L. K. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions

on Pattern Analysis and Machine Intellignce, 12(10):993–1001.

Haykin, S. (1994). Neural Networks. Imprint MACMILLAN, New York.

Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural

Computation. Addison Welsley, New York.

Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The ”Wake-Sleep” algorithm

for unsupervised neural networks. Science, 268:1158–1161.

Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse

distributed representations. Philosophical Transactions of the Royal Society B,

352:1177–1190.

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

helmholtz free energy. Advances in Neural Information Processing Systems.

Hirukawa (1997). Etl robotics research group. Japan.

Hochberg, J. (1974). Organization and the gestalt tradition. In Carterette, C. and Fried-

man, M., editors, Handbook of Perception, New York. Academic Press.

Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1):1–42.

Howell, A. J. (1997). Automatic Face Recognition using Radial Basis Function Networks.

PhD thesis, University of Sussex.

Huber, P. J. (1985). Projection pursuit. (with discussion). The Annals of Statistics,

13:435–475.

Humel, R. A., Kimia, B. B., and Zucker, S. W. (1987). Deblurring Gaussian blur. cvgip,

38(1):66–80.

Hyvarinen (1997a). Independent component analysis by minimization of mutual infor-

mation. Technical report a46, Laboratory of Computer and Information Science,

Helsinki University of Technology.

Hyvarinen (1997b). New approximations of differential entropy for independent compo-

nent analysis and projection pursuit. Technical report a46, Laboratory of Computer

and Information Science, Helsinki University of Technology.

Bibliography 134

Intrator, N. (1993a). Combining exploratory projection pursuit and projection pursuit

regression with application to neural networks. Neural Computation, 5(3):443–455.

Intrator, N. (1993b). On the combination of supervised and unsupervised learning. Physica

A, pages 655–661.

Intrator, N. (1999). Robust prediction in many parameter models: Specific control of

variance and bias. In Kay, J. W. and Titterington, D. M., editors, Statistics and

Neural Networks: Advances at the Interface. Oxford University Press. To appear.

Intrator, N. and Cooper, L. N. (1992). Objective function formulation of the BCM the-

ory of visual cortical plasticity: Statistical connections, stability conditions. Neural

Networks, 5:3–17.

Intrator, N. and Cooper, L. N. (1995). BCM theory of visual cortical plasticity. In Arbib,

M., editor, The Handbook of Brain Theory and Neural Networks, pages 153–157. MIT

Press.

Intrator, N. and Edelman, S. (1996). Making a low-dimensional representation suitable

for diverse tasks. Connection Science, Special issue on Reuse of Neural Networks

Through Transfer, 8(2):205–224. Also in Learning to Learn, S. Thrun and L. Pratt

(ed., Kluwer press.).

Intrator, N., Reisfeld, D., and Yeshurun, Y. (1996). Face recognition using a hybrid

supervised/unsupervised neural network. Pattern Recognition Letters, 17:67–76.

Intrator, O. and Intrator, N. (1993). Interpreting neural-network models. Preprint.

Jacobs, R. (1997). Bias/variance analyses of mixtures-o-experts architectures. Neural

Computation, pages 369–383.

Jain, A., Zheng, Y., and Lakshmanan, S. (1996). Object matching using deformable

templates. T-PAMI, 18:267–278.

Jain, A. K. (1989). Fundamentals of Digital Image Processing. Prentice Hall, London.

Japkowicz, N., Myers, C., and Cluck, M. (1995). A novelty detection approach to clas-

sification. In the proceedings of the Fourteenth International Joint Conference on

Artificial Intelligence, pages 518–523.

Jordan, M., editor (1999). Learning in Graphical Models. The MIT Press, Cambridge,

Massachusetts, London, England.

Bibliography 135

Kandel, E. R. and Schwartz, J. H. (1991). Principles of Neural Science. Elsevier, New

York, third edition.

Kanizsa and Gaetano (1970). Organization in Vision: Essays on Gestalt Perception.

Praeger, New York.

Katsaggelos, A. K. (1989). Iterative image restoration algorithms. Optical Engineering,

28(7):735–748.

Kimia, B. B. and Zucker, S. W. (1993). Analytic inverse of discrete Gaussian blur. Optical

Engineering, 32(1):166–176.

Kirby, M. and Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for

characterization of human faces. PAMI, 12(1):103–108.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and

model selection. International Joint Conference on Artificial Intelligence.

Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative

neural networks. AIChe Journal, 37(2):233–243.

Kung, S. Y. and Diamantaras, C. I. (1990). A neural network learning algorithm for adap-

tive principal component extraction (apex). International Conference on Acoustics,

Speech and Signal Processing, 2:861–864.

Lai, K. F. (1994). Deformable Contours: Modeling, Extraction , Detection and Classifica-

tion, Phd Thesis. PhD thesis, University of Wisconsin-Madison,Electrical Engineer-

ing.

Lee, T., Girolami, M., Bell, A. J., and Sejnowski, T. J. (1998). A unifying information-

theoretic framework for independent component analysis. International Journal on

Mathematical and Computer Modeling. in press.

Leen, T. (1995). From data distribution to regularization in invariant learning. In Touret-

zky, D. S. and Lippmann, R. P., editors, Advances in Neural Information Processing

Systems, volume 7, pages 223–230. Morgan Kaufmann, San Mateo, CA.

Lesher, G. W. (1995). Illusory contours: Toward a neurally based perceptual theory.

Psychonomic Bulletin abd Review, 2(3).

Linsker, R. (1988). Self-organization in a perceptual network. IEEE. Computer, 88:105–

117.

Bibliography 136

Liu, H.-C. and Srinath, M. D. (1984). Partial shape classification using contour matching

in distance transformation. IEEE Transactions on PAMI, 12(11):1072–1078.

Luenberger, D. G. (1989). Linear and nonlinear programming. ADDISON-WESLEY,

READING, MASS. second edition.

MacKay, D. (1992). Bayesian intertpolation. Neural Computation, 4(3):415–447.

MacKay, D. (1996). Maximum likelihood and covariant algorithms for independent com-

ponent analysis. Technical report draft 3.7, Cavendish laboratory, University of Cam-

bridge.

Marr, D. (1982). Vision. Imprint FREEMAN, New York.

Meir, R. (1994). Bias, variance and the combination of estimators: The case of linear least

squares. ftp://archive.cis.ohio-state.edu/pub/neuroprose/meir.bias-variance.ps.Z.

Miller, K. D. (1995). Receptive fields and maps in the visual cortex: Models of ocular

dominance and orientation columns. In Domany, E., van Hemmen, J. L., and Schul-

ten, K., editors, Models of Neural Networks 3, pages 55–78. Springer-Verlag, New

York.

Moghaddam, B. and Pentland., A. (1994). Face recognition using view-based and mod-

ular eigenspaces. Automatic Systems for the identification and Inspection of Hu-

mans,SPIE, 2227.

Moses, Y. (1994). Face recognition: the problem of compensating for illumination changes.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):721–732.

Mougeot, M., Azencott, R., and Angeniol, B. (1991). Image compression with back

propagation: Improvement of the visual restoration using different cost functions.

Neural Networks, 4:467–476.

Munro, P. (1997). Noisy encoders. NIPS*97 Workshop: Advances in Autoencoder-

Autoassociator Based Computations.

Murase, H. and Nayar, S. K. (1993). Learning object models from appearance. Proceedings

of the Eleventh National Conference on Artificial Intelligence.

Murray, A. F. and Edwards, P. J. (1993). Synaptic weight noise during multilayer per-

ceptron training: Fault tolerance and training improvements. IEEE Transactions on

Neural Networks, 4(4):722–725.

Bibliography 137

Nadal, J.-P. and Parga, N. (1994). Nonlinear neurons in the low-noise limit: a factorial

code maximizes information transfer. Network, 5:565–581.

Naftaly, U., Intrator, N., and Horn, D. (1997). Optimal ensemble averaging of neural

networks. Network, 8(3):283–296.

Neal, R. M. and Hinton, G. E. (1993). A new view of the EM algorithm that justifies

incremental and other variants. Submitted to Biometrica.

Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight-

sharing. Neural Computation, 4:473–493.

Oja, E. (1982). A simplified neuron model as a principal component analyzer. Math.

Biology, 15:267–273.

Oja, E. (1995a). The nonlinear pca learning rule and signal separation - mathematical

analysis. Technical Report A26, Helsinki University, CS and Inf. Sci. Lab.

Oja, E. (1995b). Principal and independent components in neural networks - recent

developments. In Proc. VII Italian Workshop on Neural Nets WIRN’95, Vietri sul

Mare, Italy.

Olshausen, B. A. and Field, D. J. (1996). Natural image statistics and efficient coding.

Network, 7:333–339.

O’Toole, A. J., Valentin, D., and Abdi, H. (1991). Categorization and identification of

human face images by neural networks: A review of the linear autoassociative and

principal component approaches. Journal of Biological Systems, 2(3):413–430.

O’Toole, A. J., Valentin, D., and Abdi, H. (1993). A low dimensional representation of

faces in the higher dimensions of the space. Journal of the Optical Society of America,

series A, 10:405–411.

Perrone, M. P. (1993). Improving Regression Estimation: Averaging Methods for Variance

Reduction with Extensions to General Convex Measure Optimization. PhD thesis,

Brown University, Institute for Brain and Neural Systems.

Phillips, P. J. (1998). Support vector machines applied to face recognition. In Touretzky,

D. S. and Lippmann, R. P., editors, Advances in Neural Information Processing

Systems, pages 1–7. Morgan Kaufmann, San Mateo, CA. To appear.

Bibliography 138

Phillips, P. J., Moon, H., Rauss, P. J., and Der, S. Z. (1997). The feret september 1996

database and evaluation procedure. In The First International Conference on Audio

and Video-based Biometric Person Authetication, Crans-Montana, Swizerland.

Phillips, P. J., Rauss, P. J., and Der, S. Z. (1996). Feret (face recognition technology)

recognition algorithm development and test results. Technical report, arl-tr-995, US

Army Research Laboratory Technical Report.

Pomerleau, D. A. (1993). Input reconstruction reliablility estimation. In Giles, C. L.,

Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing

Systems, volume 5, pages 279–286. Morgan Kaufmann.

Raviv, Y. (1998). Bootstrapping with noise ensemblening. Master’s thesis, The De-

partment of Computer Science. The Raymond and Beverly Sackler Faculty of Exact

Sciences. Tel-Aviv University.

Raviv, Y. and Intrator, N. (1996). Bootstrapping with noise: An effective regularization

technique. Connection Science, Special issue on Combining Estimators, 8:356–372.

Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning.

Neural Computation, 5:289–304.

Reisfeld, D. (1993). Generalized Symmetry Transforms: Attentional Mechanisms and Face

Recognition. PhD thesis, Tel-Aviv University.

Reisfeld, D., Wolfson, H., and Yeshurun, Y. (1990). Detection of interest points using

symmetry. In Third International Conference on Computer Vision, pages 62–65,

Osaka, Japan.

Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1996). Spikes:

Exploring the Neural Code (Computational Neuroscience). The MIT Press, London.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Oxford Press.

Rissanen, J. (1985). Minimum description length principle. Encyclopedia of Statistical

Sciences, pages 523–527.

Rosenfeld, A. and Kak, A. C. (1982). Digital Picture Processing. Academic press, New

York.

Roweis, S. (1997). Em algorithms for pca and spca. NIPS97.

Bibliography 139

Roweis, S. and Ghahramani, Z. (1997). A unifying review of linear gaussian models.

Submitted for Publication.

Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based noise

removal algorithms. Physica D, 60:259–268.

Samal, A. and Iyengar, P. A. (1992). Automatic recognition and analysis of of human

faces and facial expressions: A survey. Pattern Recognition, 25:65–77.

Samaria, F. and Fallside, F. (1993). Face identification and feature extraction using hidden

markov models. Elsevier.

Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward

neural network. Neural Networks, 2:459–473.

Sato, K., Shah, S., and Aggarwal, J. K. (1998). Partial face recognition using radial basis

function networks. In Proceedings of the third IEEE International Conference on

Automatic Face and Gesture Recognition, pages 288–293, Nara, Japan.

Schwenk, H. and Milgram, M. (1995). Transformation invariant autoassociation with

application to handwritten character recognition. Neural Information Processing

Systems (NIPS 7), pages 991–998.

Sezan, M. I. and Tekalp, A. M. (1990). Survey of recent developments in digital image

restoration. Optical Engineering, 29(5):393–404.

Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J.,

27:379–423 and 623–656.

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop – a formalism

for specifying selected invariances in an adaptive network. In Moody, J., Lippman,

R., and Hanson, S. J., editors, Neural Information Processing Systems, volume 4,

pages 895–903. Morgan Kaufmann, San Mateo, CA.

Stark, H. (1987). Image recovery: Theory and application. Academic press, San Diego.

Stuart, A. and Ord, J. K. (1994). Kendall’s Advanced Theory of Statistics. Edward

Arnold.

Taniguchi, M. and Tresp, V. (1997). Averaging regularized estimators. Neural Computa-

tion, 9:1163–1178.

Bibliography 140

Tankus, A. (1996). Automatic face detection and recognition. Master thesis, Tel-Aviv

University.

Tankus, A., Yeshurun, Y., and Intrator, N. (1997). Face detection by direct convexity

estimation. Pattern Recognition Letters, 18(9):913–922.

Tesauro, G., Touretzky, D., and Leen, T., editors (1995). Neural Network Ensembles,

Cross Validation, and Active Learning. The MIT Press, London.

Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. V. H.

Winston and Sons, Washington.

Tresp, V. and Taniguchi, M. (1995). Combining estimators using non-constant weighting

function. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural

Information Processing Systems, volume 7. MIT Press.

Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. J. of Cognitive Neuro-

science, 3:71–86.

Turk, M. and Pentland, A. (1993). Experiments with eigenfaces. Looking At People

Workshop, IJCAI’93, pages 1–6.

Ullman, S. (1989). Aligning pictoral descriptions: an approach to object recognition.

Cognition, 13:13 – 254.

Ullman, S. (1995). Sequence-seeking and counter-streams: a model for information flow

in the cortex. Cerebral Cortex, 5:1–11.

Valentin, D., Abdi, H., O’Toole, A. J., and Cottrell, G. W. (1994). Connection models of

face processing: A survey. Pattern Recognition, 27:1209–1230.

Vapnik, V. N. (1995). Statistical learning theory. Springer, Check this, New York.

Wahba, G. (1990). Splines Models for Observational Data. Series in Applied Mathematics,

Vol. 59, SIAM, Philadelphia.

Wiskott, L. and von der Malsburg, C. (1993). A neural system for the recognition of

partially occluded objects in cluttered scenes. Int. J. of Pattern Recognition and

Artificial Intelligence, 7(4):935–948. Also in NIPS 7, 1994.

Wiskott, L., von der Malsburg, C., Fellous, and Kruger, N. (1997). Face recognition by

elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 19(7):775–779.

Bibliography 141

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241–259.

Yang, H. and Amari, S. (1997). Adaptive on-line learning algorithms for blind separa-

tion – maximum entropy and minimum mutual information. Neural Computation,

9(7):1457–1482.

Yaroslavsky, L. and Eden, M. (1996). Fundamentals of digital optics. Imprint Birkhauser,

Boston.

You, Y.-L. and Kaveh, M. (1996). A regularization approach to joint blur identification

and image restoration. IEEE Transactions on Image Processing, 5(3):416–427.

Yow, K. C. and Cipolla, R. (1996). A probabibilistic framework for perceptual grouping

of features for human face detection. In Proceedings of the second IEEE International

Conference on Automatic Face and Gesture Recognition, pages 16–21, Washington,

Brussels, Tokyo.

Yuille, A. L., Cohen, D., and Hallinan, P. W. (1989). Feature extraction from faces using

defomable tempaltes. In Proc. Computer Vision and Pattern Recognition, pages 104–

109, Washington. IEEE Computer Society Press.

Zemel, R. (1993). A Minimum Description Length Framework for Unsupervised Learning.

PhD thesis, University of Toronto.

Education

Trade-off between recognition an reconstruction: Application of Robotics Vision to Face Recognition