You have all heard about deep neural networks, and maybe used it. In this post, I will try to explain the mathematics behind neural nets with a basic classification example. This tutorial has two parts, check out the part 2 here!

## tl;dr¶

1. A neural network is defined by a cascade of perceptron models, each with tunable parameters.
2. Activations are used to compute the probability or make a prediction from the features
3. The feed-forward pass evaluates the output(s) given the input(s)
4. Back-propagation computes the gradients, used in the optimization process to update the perceptron parameters

Deep learning is a tool that scientists use to solve a central problem in data science, how to model from data ? Depending of the context, data can be very complex and has diffrent forms: 1D signals (audio, finance), 2D images (camera, ultrasound), 3D volumes (X-ray, computer graphics) or videos, 4D moving volumes (f-MRI).

One way of infering predictions from these data is using machine learning. After designing a feature extractor by hand, you choose a system that predict from the feature whether the space is linear or not (SVM with kernel trick). On the other hand, deep learning strives to combine these two steps at once! It extracts the high-level, abstract features from the data itself and use them to do the prediction.

This of course comes at a cost both in term of mathematical understanding or even money.

## 1. Background¶

### 1.1. Perceptron model¶

McCulloch et al. [1] were the first to introduce the idea of neural network as computing machine. It is a simple model defined by two free parameters, the weight vector $\mathbf{w}$ and bias $b$:

$$$$\label{eq:perceptron} f(\mathbf{x}) = \mathbf{x}\mathbf{w} + b$$$$

where the input vector $\mathbf{x}$ has size $[1\times D]$ with $D$ the number of features, and $\mathbf{w}$ has size $[D\times 1]$. This simple model can classify input features that are linearly separable.

Looking at its derivatives w.r.t the parameters $\mathbf{w}$ and $b$ is usefull when we want to analyse the impact of the parameters on the model.

$$$$\label{eq:softmax} \frac{\partial f(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{x};\quad \frac{\partial f(b)}{\partial b} = 1$$$$

Things are getting more complicated for multiple neurons $C$ with parameters $\mathbf{W}$ of size $[D\times C]$ and $\mathbf{b}$ $[1\times C]$. Because the neuron activation is a function $f(\mathbf{W}):\mathbb{R}^{D\times C}\rightarrow \mathbb{R}^C$, the resulting jacobian matrix would have a size of $[D\times C\times D]$. There are different strategies to handle this but one way of doing it is to flatten $\mathbf{W}$ so it becomes a vector $\mathbf{w}$ with size $DC$, then the jacobians are: $$$$\frac{\partial f(\mathbf{W})}{\partial \mathbf{W}} = \begin{bmatrix} x_1\\ \vdots & \ddots & x_1\\ x_D & \ddots & \vdots\\ & & x_D\\ \end{bmatrix}_{[DC \times C]} \frac{\partial f(\mathbf{b})}{\partial \mathbf{b}} = \begin{bmatrix} 1\\ & \ddots\\ & & 1 \end{bmatrix}_{[C \times C]}\\$$$$

### 1.2. Activations¶

Having a model is not enough to come up with a complete algorithm, we need a classifying rule that will compute the response based on the model output. Rosenblatt [2] used a simple activation function (also reffered as heavyside), so the inputs would be classified as target $t=1$ if $\mathbf{w} \mathbf{x} + b > 0$ and target $t=0$ otherwise:

$$$$f(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$$$$

Nowadays, we used the widely adopted softmax [3] (sigmoid for binary). It has two nice properties that make it a good function to model probability distributions: each value ranges between 0 and 1 and the sum of all values is always 1. Given the input vector $\mathbf{x}$ of size $[C\times 1]$ classes:

$$$$f(\mathbf{x}) = \frac{1}{\sum_{c=1}^C e^{x_c}} \begin{bmatrix} e^{x_{i=1}} \\ \vdots \\ e^{x_C} \\ \end{bmatrix}$$$$

To compute the jacobian matrix, we need to find the derivative of each output $f_{i=1}(\mathbf{x}) \dots f_C(\mathbf{x})$ (lines) w.r.t. each inputs $x_{j=1} \dots x_{j=D}$ (columns). The resulting jacobian matrix for the softmax function is :

$$$$\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} y_{c=1}(\mathbf{x})\cdot (1 - y_{d=1}(\mathbf{x})) & -y_1(\mathbf{x})\cdot y_2(\mathbf{x}) & \dots & -y_1(\mathbf{x})\cdot y_D(\mathbf{x}) \\ -y_2(\mathbf{x})\cdot y_1(\mathbf{x}) & \ddots & & \vdots \\ \vdots & & & \\ -y_C(\mathbf{x})\cdot y_1(\mathbf{x}) & & & y_C(\mathbf{x})\cdot (1 - y_D(\mathbf{x})) \\ \end{bmatrix}_{[C\times D]}$$$$

Note that $C=D$ in this case (the input and output vector has the same length).

## 1.3. Cost function¶

The fitness of the data to the model is defined by the cost function, and can be calculated using the likelihood function. Given the weight vector $\mathbf{w}$ (which defines our model), the ground truth vector $\mathbf{t}$ $[1\times C]$ (which is either $0$ or $1$ in classification), the $c$ th activation of the softmax $y_c(\mathbf{x})$ and the input data $\mathbf{x}$ $[1\times C]$, the likelihood is: $$$$\label{eq:cost} \mathcal{L}(\mathbf{w}|\mathbf{t}, \mathbf{x}) = \prod_{c=1}^{C} y_c(\mathbf{x})^{t_c}$$$$

After minimizing the negative $\log$ likelihood, the loss function over one sample looks like:

$$$$\label{eq:dv_cost} \xi(\mathbf{t}, y(\mathbf{x})) = - \sum_{c=1}^{C} t_{c} \cdot \log( y_c(\mathbf{x}))$$$$

As we could expect, this is similar to the cross entropy function!

We can also look at the derivative of $\xi$ w.r.t. the output of the model $y(\mathbf{x})$ (softmax function in our case):

$$$$\frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial y(\mathbf{x})} = (-1)\cdot \begin{bmatrix} t_{c=1}/y_{c=1}(\mathbf{x}) & \cdots & t_{C}/y_{C}(\mathbf{x}) \end{bmatrix}_{[1 \times C]}\\$$$$

For multiple samples $n$ (each has its own specific features $\mathbf{x}_i$), you have different strategies. A basic strategy is to add each sample's fitness, you can do the same for the jacobian of the cost function.

## 1.4. Optimization¶

The standard approach to extract the optimal weights for the model is through the gradient descent algorithm. Knowing the gradient of the cost function w.r.t each parameter, $\frac{\partial \xi}{\partial \mathbf{w}}$, one can extract the optimal set of weights. This equation can be decomposed into multiple simpler derivatives by looking at how the data flows from the input $\mathbf{x}$ to the model output $y(\mathbf{x})$.

We will design the complete equation later after defining the whole model. For now just remember that to compute $\frac{\partial \xi}{\partial \mathbf{w}}$, we need the intermediate derivative of $\xi$ w.r.t. the outputs $y(\mathbf{x})$ that we just definied previously.

The proces of calculating all the gradient and updating the parameters for the neural network is called the backpropagation.

## 2. Hands on¶

### 2.1. Input data¶

Our goal will be to predict three different classes $C : \{c_a, c_b, c_c\}$, given some data with two features $x_1$ and $x_2$. We suppose that the features can indeed be used to discriminate the samples, and are statistically independent. The data will be a $N\times 3$ vector where each line is a sample $X = [x_1, x_2]$. The three classes are issued from three different noisy gaussians distribution.

In [1]:
In [2]:

Let's take a look at the data.

In [3]:

### 2.2. Model¶

As the name suggests, a neural network is an architecture combining many neurons! Because the 2D input data is linearly separable, we will design a model with just one layer that has three neurons (one for each class). We will then compute the probabilities for the three classes through the softmax activation function, where the highest probability will gives us our class.

We will start by coding the different blocks that we will need for this task. First the perceptron model from eq.\ref{eq:perceptron}, which will allow us to adjust our model (via the weight and bias parameters).

In [4]:
# Neuron class definition
class Neuron:
def __init__(self, weights, bias):
# first dim is the number of features
# second dim is the number of neurons
self.weights = weights
self.bias = bias

def output(self, input):
return input @ self.weights + self.bias

D = np.kron(np.eye(self.weights.shape[1]), input)
return [D, np.eye(self.weights.shape[1])]


Now we can code the softmax activation function using eq.\ref{eq:softmax} (so we can predict the class), and its derivative (used for the gradient descent).

In [5]:
# Activation function
def softmax(input):
return np.exp(input) / np.sum(np.exp(input))

# Derivative of the activation function
def dv_softmax(input):
diag = softmax(input)*(1-softmax(input))
xj, xi = np.meshgrid(softmax(input), softmax(input))
jacob = (-1)*xi*xj
np.fill_diagonal(jacob, diag)

return jacob


Let's take a look at the activation function:

In [6]:

It is time to define the whole model described previously! Given the weights $\mathbf{W}$ and bias $\mathbf{b}$ from the three neurons, the output $y(\mathbf{x})$ over the input vector $\mathbf{x}=\{x_1, x_2\}$ is:

$$$$y(\mathbf{x}) = \frac{1}{e^{x_1w_{11} + x_2w_{21} + b_1} + e^{x_1w_{12} + x_2w_{22} + b_2} + e^{x_1w_{13} + x_2w_{23} + b_3}} \begin{bmatrix} e^{x_1w_{11} + x_2w_{21} + b_1}\\ e^{x_1w_{12} + x_2w_{22} + b_2}\\ e^{x_1w_{13} + x_2w_{23} + b_3}\\ \end{bmatrix}$$$$

This is called the feed-forward pass of the model, i.e. the evaluation of the output(s) probability(ies) given the input(s).

In [7]:

Let's compute the feed-forward pass on our model, with zero bias and all weights $\mathbf{w} = [1, 0.5]$.

In [8]:
# Feedforward pass

weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
feed_forward(inputs=X, weights=weights, bias=bias)

Out[8]:
array([[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333]])

The output probability is $33,3\%$ for each class! This is because each neuron contribute equally to the output (they have the same weights), so there is no information in this model.

### 2.3. Parameter optimization¶

We will use the gradient descent to infer the best model paramaters. Even if we will compute the gradient by hand for the purpose of this example, it is of course not feasible in practice when you have billions of parameters. Estimating efficiently the derivative of a (very) deep neural network numerically is possible thanks to a big achievement in AI which is automatic derivatives.

Given the ground truth class $\mathbf{t}$ and the model output $\mathbf{y}$ (which is the input of the cost function), we can code the cost function from eq.\ref{eq:cost} and its derivative with eq.\ref{eq:dv_cost} to extract the optimal parameters.

In [9]:
# Optimization of the model's parameters

# cost function given ground truth t and model output y (here refered as x)
def cost_function(input, t):
return (-1)*np.sum(t * np.log(input))

# derivative of the cost function for gradient descent
def dv_cost_function(input, t):
return (-1)*(t/input)


The derivative of the cost function w.r.t the output of the model $\frac{\partial \xi}{\partial \mathbf{y}}$ is not sufficient alone. Remember, for the gradient descent we need the derivative of the cost function w.r.t the output of the model parameters $\frac{\partial \xi}{\partial \mathbf{w}}$. Using the chain rule, one can decompose this derivative into "smaller" consecutive ones, from the output of the network $\mathbf{y}$ to the input $\mathbf{x}$:

$$$$\frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial \mathbf{W}} = \frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial y(\mathbf{x})} \frac{\partial y(\mathbf{x})}{\partial z(\mathbf{x})} \frac{\partial z(\mathbf{x})}{\partial \mathbf{W}},$$$$

where $z(\mathbf{x})$ is the outputs of the neurons (just before the activation function). The same derivative formula stands for the bias parameters $\mathbf{b}$.

The process of computing the gradient for a given neural network is called the backward pass (more often refered as backpropagation). Let's create a new class for our model,

In [10]:

Now, we can try to update our parameters and compute the new probabilites.

In [11]:
# backpropagation

weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
my_model = LinearModel(weights=weights, bias=bias)
weights = weights - my_model.back_propagation(inputs=X, t=C)[0]
bias = bias - my_model.back_propagation(inputs=X, t=C)[1]
my_model.update_params(weights, bias)
my_model.feed_forward(X)

Out[11]:
array([[1.00000000e+000, 2.18897098e-053, 3.86972453e-088],
[1.83893138e-126, 4.63180127e-050, 1.00000000e+000],
[6.72623122e-041, 1.90754792e-013, 1.00000000e+000],
[1.00000000e+000, 2.18823989e-051, 2.53382152e-093],
[9.99996472e-001, 2.66593897e-006, 8.62099197e-007],
[1.34019568e-119, 9.37291478e-050, 1.00000000e+000],
[9.99999947e-001, 5.24797675e-008, 9.26893852e-010],
[1.00000000e+000, 1.61700675e-032, 3.59585167e-053],
[5.54852239e-043, 1.76871610e-018, 1.00000000e+000],
[5.32618861e-084, 2.33883378e-037, 1.00000000e+000],
[1.38369979e-059, 1.41578864e-020, 1.00000000e+000],
[1.30450301e-110, 6.68934311e-041, 1.00000000e+000],
[1.00000000e+000, 2.00196074e-029, 1.35305807e-055],
[1.00000000e+000, 4.13133909e-030, 3.99483019e-057],
[1.39798246e-170, 6.99345017e-065, 1.00000000e+000],
[7.62449988e-045, 7.19330957e-017, 1.00000000e+000],
[1.70476439e-100, 4.90001706e-043, 1.00000000e+000],
[1.00000000e+000, 1.00320346e-029, 9.80030962e-062],
[1.00000000e+000, 1.07104756e-096, 1.22186184e-164],
[1.00000000e+000, 3.32295122e-048, 2.52721574e-080],
[1.06216301e-097, 7.35670171e-041, 1.00000000e+000],
[3.67243700e-061, 7.02709582e-027, 1.00000000e+000],
[2.51367362e-135, 6.46563062e-053, 1.00000000e+000],
[2.61598282e-053, 1.37541638e-020, 1.00000000e+000],
[1.71227095e-083, 7.11752357e-035, 1.00000000e+000],
[2.33374097e-071, 5.93010236e-030, 1.00000000e+000],
[1.00000000e+000, 9.59458261e-067, 2.92405415e-110],
[1.00000000e+000, 1.20037907e-027, 1.43023527e-049],
[7.33687574e-045, 1.66389239e-014, 1.00000000e+000],
[1.34520001e-024, 1.84116207e-007, 9.99999816e-001],
[2.11919553e-064, 4.43836272e-022, 1.00000000e+000],
[1.00000000e+000, 1.66147846e-048, 2.61316307e-077],
[1.18403709e-144, 4.81973027e-058, 1.00000000e+000],
[3.16663031e-048, 2.03323019e-020, 1.00000000e+000],
[1.86682595e-094, 1.17201777e-033, 1.00000000e+000],
[1.00000000e+000, 1.30221183e-010, 9.43522069e-027],
[4.08824710e-057, 2.25630225e-015, 1.00000000e+000],
[4.91208029e-094, 4.36511322e-035, 1.00000000e+000],
[9.98566545e-001, 1.43345459e-003, 1.61962958e-014],
[4.62552806e-033, 5.21292619e-006, 9.99994787e-001],
[3.05527173e-097, 8.29663347e-032, 1.00000000e+000],
[4.62810114e-058, 8.79882546e-022, 1.00000000e+000],
[3.17685799e-022, 3.61920398e-004, 9.99638080e-001],
[9.81432652e-100, 1.83979279e-035, 1.00000000e+000],
[9.97650514e-001, 2.34948610e-003, 1.93467597e-010],
[1.00000000e+000, 3.26076583e-055, 4.42109210e-095],
[8.97319446e-024, 1.79787067e-007, 9.99999820e-001],
[9.99945769e-001, 5.42311684e-005, 2.88724821e-010],
[8.55340506e-064, 4.27764745e-021, 1.00000000e+000],
[1.00000000e+000, 2.26983276e-085, 1.17553878e-135],
[1.00000000e+000, 8.32109455e-016, 4.05561921e-022],
[9.35070044e-133, 6.85648631e-053, 1.00000000e+000],
[2.41141789e-047, 6.44203155e-017, 1.00000000e+000],
[1.00000000e+000, 1.24242362e-045, 4.51207052e-078],
[1.00000000e+000, 1.32116287e-072, 1.76190579e-125],
[1.00000000e+000, 3.35539117e-052, 6.05543518e-088],
[4.02687989e-059, 1.32740346e-026, 1.00000000e+000],
[1.00000000e+000, 1.42411499e-037, 1.79253425e-063],
[1.00000000e+000, 3.69158852e-013, 1.01124852e-030],
[1.00000000e+000, 1.95193804e-059, 3.07220245e-101]])

We see that updating the parameters starts to give differents probabilities for each sample. In the next section, we will update the parameters inside a loop to gradually improve our weights.

## 3. Results¶

### 3.1 Cost function¶

We will first check out the cost at each epoch (step) during the training of the model. An important parameter is the learning rate as it impact the size of the gradient step at each iteration, lot of noise are introduced during training if it is too large. Chossing the optimal learning rate is part of what we call hyperparameter selection.

Here, we will fix it at $\Lambda=0.025$.

In [12]:
# learning phase

# hyper-parameters and model instanciation
lr = 0.025
n_iter = 50
weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
my_model = LinearModel(weights=weights, bias=bias)
cost=np.array([])
for i in range(n_iter):
# backpropagation step
weights = weights - lr*my_model.back_propagation(inputs=X, t=C)[0]
bias = bias - lr*my_model.back_propagation(inputs=X, t=C)[1]
my_model.update_params(weights=weights, bias=bias)
# cost function
probs = my_model.feed_forward(X)
cost = np.append(cost, cost_function(input=probs, t=C))

In [13]:

### 3.2 Qualitative analysis: decision function¶

Now let's take a look at the decision boundary of our classifier, by creating inputs ranging from $-5$ to $5$ in $x_1$ and $x_2$ space. It can be interresting to see at the same time the importance of the bias parameter.

In [14]:

It is clear that without the bias, some blue crosses are not well categorized.

### 3.3 Quantitative analysis: accuracy¶

By comparing the accuracy between the model with and whithout bias, the difference is even more drastic.

In [15]:
# compute probabilities

probs = my_model.feed_forward(X)
probs_no_bias = my_model_without_bias.feed_forward(X)
Y = np.argmax(probs, axis=1)
Y_no_bias = np.argmax(probs_no_bias, axis=1)
Y_truth = np.argmax(C, axis=1)
acc_Y = np.sum(Y == Y_truth)/len(Y_truth)
acc_Y_no_bias = np.sum(Y_no_bias == Y_truth)/len(Y_truth)

print("Accuracy for model with bias: %.2f" %acc_Y)
print("Accuracy for model without bias: %.2f" %acc_Y_no_bias)

Accuracy for model with bias: 0.88
Accuracy for model without bias: 0.70


Indeed, having the bias allows the model to have a better decision point since the bias offset the decision function.

## Conclusion¶

We made our own neural network from end to end, isn't that fantastic!? We saw the different and fundamental steps, and how to compute the gradient w.r.t the parameters. The data used here was quite simple since it was highly linear separable in space (gaussian), but what happened if this is not the case? If you are curious, check now the second part on classifying non-linear data!

## To go further¶

A first and one of the most important ressource is the official deep learning book, co-authored by the people who put forward this field. More on the perceptron mode in this book. Check this nice overview of how one can compute derivatives automatically here, and most important for deep learning.

## Acknowledgements¶

Thanks to peter roelants who owns a nice blog on machine learning. It helped me to have deeper understanding behind the neural network mathematics. Some code were also inspired from his work. To design the networks, I used this web tool developped by Alexander Lenail. There are also other styles available: LeNet or AlexNet.

## References ¶

1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5, 115–133 (1943).

2. Rosenblatt, F.: The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory (1957).

3. Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing. 227–236 (1990).