Neural Networks to Discover Image Patterns

Image recognition with convolutional neural networks

Introduction

The main objective of this work is to understand the models used to train algorithms for image pattern recognition, which are also used in many other fields, such as social data mining or speech and music recognition.

We first do an introduction to neural networks with a basic approach, which is going to be network training with gradient descent method.

What is a neural network?

A neural network consists of a combination of logical units called perceptrons or artificial neurons, which are programmed to emulate the behaviour of a human neuron by receiving an input and computing an output using weights and biases.

The idea of perceptron can be expressed with the following diagram and equation:

output={0 if∑j w j x j<threshold0 if∑jw j x j≥ threshold

were x_i are the inputs coming from the given data or other perceptrons and w_i are the weights of those inputs. We could also express previous output function as

The output of a perceptron is 0 if the scalar product of the inputs with their weights is lower or equal to a given threshold or 1 if it’s higher. Perceptrons are classified by layers, and many layers together form a neural network. From now on we use the following notation z:=w·x+b.

The first layer is called input layer, which receives the information that has to be processed by the network (image pixels, song beats, audio frames etc.). After that there are the hidden layers, which process the input layer to give the output layer a result according to a target.

A problem with such a perceptron output function is that slight changes on the weights and thresholds (from now on biases) induce big changes on the output, which is not desirable. So we use as an output function the sigmoid function, which works similarly but has a smooth shape, it takes all real values between 0 and 1 instead of the previous described discrete function.

That ensures continuity behaviour, that is, small changes on the input cause small changes on the output. So instead of defining the output as above: we improve the output function as follows.

This way, when the product of weights and inputs is much greater than the bias then the resulting output is going to tend to 1, but when its much lower, the output is going to tend to 0.

So we have briefly defined which elements form a neural network. Let’s see now how it works.

We want the network to give a given output when we introduce a given input. That is, for example we give as an input a matrix of pixels with a handwritten digit picture, and we want the network to return the number from 0 to 9 that is written on the picture.

The output that the network should give is called target value.

In the first example we are going to train the network algorithm by approximating the most suitable biases and weights so that when we give the network an input image with a digit it returns the same number that we see in the picture.

To do that, the first approach is the Stochastic Gradient Descent method. It consists of a method for minimizing multivariate functions. We are going to use it to find the minimum of a cost function that represents how different are the network outputs from the target outputs:

where w and b are the weights and biases of the perceptrons, n is the number of inputs, x are the inputs, y are output targets, and a are the network outputs when x is the input (which also depend from w and b but that’s omitted for sake of simplicity).

Gradient descent method consists of getting close to the minimum of the function by stepwise subtracting the function gradient vector, since gradient of a multivariate function indicates the direction of maximum growth. So if we subtract the gradient multiplied by a small scalar to the function variables we move in the direction of maximum decrease and lower values of the function. So once subtracting the gradient doesn’t give us a lower value for C we are close to minimum and stop the algorithm. Expressed in formula, for each step:

where v represents the vector with weights and biases and nu represents a small quantity that we are going to multiply by the gradient to approach to the minimum stepwise without move to .

So the next would be to calculate the gradient. To do that, we introduce the backpropagation algorithm.

Backpropagation algorithm uses the following equalities and notations:

is the weight from the kth neuron in the (l-1)-th layer to the jth neuron in the lth layer.

is the activation of the jth neuron in the lth layer.

blj is bias of the jth neuron in the lth layer.

L is the total number of layers

The error of the output layer:

and in matrix based form

where the circle with dot is Hadamard product, which is a elementwise product.

After knowing those equations, we have a way of calculating the gradient that we are going to subtract each step multiplied by a given learning rate, so we get closer to the optimum (w,b).

Weights and biases initialization: We initialize weights and biases vectors with normal Gaussian distribution, with mean 0 and standard deviation 1.

The steps of the backpropagation algorithm are:

1. Initialize the weights 2. Set the activation for the input layer a^1.3. For each layer compute the corresponding activations4. Compute the output error vector sigma^L5. Backpropagate the error. For layers L-1 to 2 calculate the error6. Output: The gradient vector is given by

∂C∂wljk=al−1kδlj and ∂C∂blj=δlj.

Once we know how to calculate the gradient, we would continue with gradient descent method for mini-batches of training datasets.

Cross-entropy function

After learning the most basic neural network model we are going to see that this approach has some inconveniences and how to solve them.

The main problem is that this algorithm will learn slow for very low or large values of z(=wx+b) the sigmoid function derivate is close to zero so the algorithm learning will be very slow, since

and

so we introduce the Cross-Entropy cost function, which solves the problem

The partial derivative with respect to w is this time

simplifying

since

Which is a better expression for the learning speed. The greater the error

the faster the algorithm will learn, since

Similarly we compute the derivative with respect to the bias which is

This changes will improve learning speed for large initial errors.

Softmax

Another approach to the problem of learning slowdown is Sotfmax layers. Softmax layer is a modified version of output layers in which we don’t apply sigmoid function to our weighted inputs z’s, we apply softmax function to the z_j^L instead. The activation of the jth neuron of the output layer L is going to be

This function has the property of forming a probability distribution, which means that all outputs of a softmax layer are values between 0 and 1 and they sum 1.

Why is softmax good for learning slowdown?

Consider the log-likelihood cost function

C≡−ln ayL

where y is the desired output for a and L is the last layer. This cost function has the needed properties, i.e. if the network is confident that output is y then a y

L is going to be close to 1, so −lna y

L is going to be close to 0.

After derivation of log-likelihood cost function we get an error expression similar to the one derived from cross-entropy cost function.

We know from previous error equations that these expressions of the cost derivative imply that we don’t have learning slowdown problem.

Overfitting

With the described model, we find another problem, which is that our network has many parameters, and the predictions won’t improve after a limited number of epochs, because of the overfitting phenomenon. The idea is that if we have enough parameters we can build a model that fits any dataset, but that doesn’t mean our model is good for predicting new data.

To solve overfitting problem we introduce regularization term in the cost function.

We can also add regularization term to cross-entropy cost function

Or to a general cost function

Although there’s no theoretical widely accepted theory of why regularization gives better results for data fitting, the idea of regularization is to “penalize” large weights, because heuristically they would take in account details of the training set instead of learning the general pattern of the training data set.

There are many other modifications of cost function regularization techniques that we aren’t going to explain here, but the one explained above technique is widely used.

Another regularization technique is dropout, which consists of training the network modified by randomly deleting half of hidden layer neurons and taking average output of the differently modified network. That is, if we get different results and most modifications of the network get a specific result we think that other networks are making mistakes.

We can also reduce overfitting by artificially expanding the training data, taking for example small rotations of the images. This technique will improve goodness of the fit for test data.

Weight initialization to avoid slowdown:

As we told before the first idea of initialize the weights and biases was to set them to values described by normal Gaussian distribution. It turns out that if we do so it’s likely that abs(z) will be very large so the output of our neuron sig(z) will be very close to 0 or 1, this way the neuron will be saturated, and the learning is going to be slow.

If we initialize instead the weights and biases as random variables with Gaussian distribution with mean 0 and standard deviation 1/sqrt(n_inputs) we get much better results, since the most values of z are close to 0 and this way the derivative of sigma is not close to 0, which makes learning faster.

Documents

Neural Networks to Discover Image Patterns