In the previous post, we saw an example of a linear neural network were the data was clearly linearly sperabale. This time, we will try to classify non-linearly separable data. Before showing how we could do that, it is important to introduce some important functions.


  1. Importance of data analysis for modeling
  2. Adding non-linear activations for non-linearly separable data
  3. Importance of gradient checking

1. Background

1.1. Introducing non-linearity in a system

Considering the basic McCulloch's model (with parameters $w$ and $b$) [1], one would maybe have the idea that stacking multiple linear layers could introduce non-linearity to the system. But it is important to understand that stacking mutliple linear layers would not have any effect on the linearity of the system. Indeed, imagine that you stack together $3$ linear layers, then the resulting output would be :

$$ \begin{equation} \begin{split} f_1\circ f_2\circ f_3(x) & = ((x w_3 + b_3) w_2 + b_2) w_1 + b_1 \\ & = x w_3 w_2 w_1 + b_3 w_2 w_1 + b_2 w_1 + b_1 \end{split} \end{equation} $$

This is exactly the same as a simple linear layer with $w = w_3 w_2 w_1$ and $b = b_3 w_2 w_1 + b_2 w_1 + b_1$ ! Hence the importance of introducing activation function, that projects the data into a space where it will be linearly-separable.

1.2. Limits of the softmax

We already used the softmax popularized by Bridle et al. [2]. If we would stack mutliple layers with softmax outputs, then our model would be non-linear. The issue here is that the softmax is close to zero when its inputs are less equal. This can lead to the vanishing gradient problem, where the learning is slowed down because the gradient does not update enough.

In [1]:
In [2]:
In [3]:

1.3. Rectified linear unit function

The vanishing gradient issue brought the idea of introducing the ReLu activation function back in 2010 [3]. It is defined as follow,

$$ \begin{equation} f(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \end{equation} $$

One clear advantage is that this activation is fast, easy to use and not computationnaly intensive. The gradient is always one for positive values, but it can still "die" if the input is negative. That is why some uses the leaky ReLu.

Be carefull because the ReLu is non-derivable at zero, hence the introduction of the softplus activation.

2. Hands on

2.1. Input data

In [4]:

Let's take a look at the data.

In [5]:

It is quite clear that the linear model from the part 1 isn't adapted for this data. As an exercice, you can try to plug this data using the linear model and see how it performs!

2.2. Model design and parameters optimization

The model that we will design will be close to the previous one, the only difference will be the addition of one layer: the relu activation. Such layers are called hidden layers in the deep learning community. After, we compute the probabilities for the three classes through the softmax activation function. Our final classifying rule will be that the highest probability gives us our class.


We first need to implement the relu function and its derivative, which are quite easy to code. It is worth also comparing it with the logistic activation.

In [6]:
# Activation function
def relu(input):
    return input * (input > 0) + 0.1 * input * (input < 0)

# Derivative of the activation function
def dv_relu(input):
    alpha = 0.3
    offset = alpha * np.ones_like(input) - alpha * np.eye(len(input))
    diag = np.ones_like(input) * (input > 0)
    return np.eye(input.shape[-1]) * diag + offset

def logistic(input):
    """Logistic function."""
    return 1. / (1. + np.exp(-input))

def dv_logistic(input):
    """Logistic function."""
    return np.eye(input.shape[-1]) * (logistic(input) * (1 - logistic(input)))

Given the weights $\mathbf{W_h}$ and bias $\mathbf{b_h}$ from the two hidden neurons, the relu activation of the hidden layer $h(\mathbf{x})$ over the input vector $\mathbf{x}=\{x_1, x_2\}$ is:

$$ \begin{equation} h(\mathbf{x}) = \begin{bmatrix} relu(x_1w_{h_{11}} + x_2w_{h_{21}} + b_{h_1}) \\ relu(x_1w_{h_{12}} + x_2w_{h_{22}} + b_{h_2}) \end{bmatrix} \end{equation} $$

The softmax activation of the output layer $y(h(\mathbf{x}))$ can be calculated given the weights $W_o$ and bias $\mathbf{b_o}$ from the three output neurons:

$$ \begin{equation} y(\mathbf{x}) = \frac{1}{e^{y_1w_{o_{11}} + y_2w_{o_{21}} + b_{o_1}} + e^{y_1w_{o_{12}} + y_2w_{o_{22}} + b_{o_2}} + e^{y_1w_{o_{13}} + y_2w_{o_{23}} + b_{o_3}}} \begin{bmatrix} e^{y_1w_{o_{11}} + y_2w_{o_{21}} + b_{o_1}}\\ e^{y_1w_{o_{12}} + y_2w_{o_{22}} + b_{o_2}}\\ e^{y_1w_{o_{13}} + y_2w_{o_{23}} + b_{o_3}}\\ \end{bmatrix} \end{equation} $$

Given the targets $\mathbf{t}$ and following the chain rule, we can decompose the derivative of the cost function $\xi(\mathbf{t}, y)$ w.r.t the output neuron parameters:

$$ \begin{equation} \frac{\partial \xi(\mathbf{t}, y)}{\partial \mathbf{W_o}} = \frac{\partial \xi(\mathbf{t}, y)}{\partial y} \frac{\partial y}{\partial z_o} \frac{\partial z_o}{\partial \mathbf{W_o}}, \end{equation} $$

where $z_o$ is the output of the neurons for the output layer (just before the activation function).

The derivative is quite different for $\mathbf{W_h}$ since we need to go "deeper" onto the model to compute the derivative. But it is still possible to reuse some previous results to avoid redundancy:

$$ \begin{equation} \begin{split} \frac{\partial \xi(\mathbf{t}, y)}{\partial \mathbf{W_h}} & = \frac{\partial \xi(\mathbf{t}, y)}{\partial h} \frac{\partial h}{\partial z_h} \frac{\partial z_h}{\partial \mathbf{W_h}} \\ & = \frac{\partial \xi(\mathbf{t}, y)}{\partial z_o} \frac{\partial z_o}{\partial h} \frac{\partial h}{\partial z_h} \frac{\partial z_h}{\partial \mathbf{W_h}}, \end{split} \end{equation} $$

with $$ \begin{equation} \frac{\partial z_o}{\partial h} = \mathbf{W_o} \end{equation} $$

The same process stands for the bias parameters of the output $\mathbf{b_o}$ and hidden layer $\mathbf{b_h}$.

With all the previous code, we can now design the entire model:

In [7]:

It is now time to train the model!

In [8]:
# learning phase

# hyper-parameters and model instanciation
lr = 0.01
n_iter = 1000
weights = [np.random.randn(2, 3), np.random.randn(3, 3)]
bias = [np.zeros((1, 3)), np.zeros((1, 3))]
cost_relu = np.array([])
cost_logits = np.array([])
relu_model = NoneLinearModel(weights=weights, bias=bias, hidden_activation="relu")
logits_model = NoneLinearModel(weights=weights, bias=bias, hidden_activation="logits")

for i in range(n_iter):
    # backpropagation
    Jw, Jb = relu_model.back_propagation(inputs=X, t=C)
    weights = [relu_model.weights[l] - lr * Jw[l] for l in range(relu_model.n_layers)]
    bias = [relu_model.bias[l] - lr * Jb[l] for l in range(relu_model.n_layers)]
    relu_model.update_model(weights=weights, bias=bias)
    # cost function
    probs = relu_model.feed_forward(X)[-1]
    cost_relu = np.append(cost_relu, cost_function(input=probs, t=C))

    # backpropagation
    Jw, Jb = logits_model.back_propagation(inputs=X, t=C)
    weights = [logits_model.weights[l] - lr * Jw[l] for l in range(logits_model.n_layers)]
    bias = [logits_model.bias[l] - lr * Jb[l] for l in range(logits_model.n_layers)]
    logits_model.update_model(weights=weights, bias=bias)
    # cost function
    probs = logits_model.feed_forward(X)[-1]
    cost_logits = np.append(cost_logits, cost_function(input=probs, t=C))
In [9]:

2.3. Quantitative and qualitative analysis

By comparing the decision functions for both relu and logits, we see that relu is faster to converge but its decision boundary is straight. Because of the smoothness of the logits activation, the decision function is less prone to error.

In [10]:


You should now have a deeper understanding of the mathematics behind a fully-connected neural network! In the real world, dense layers are not really usable because of the huge increase in the number of parameters. This is why nowadays everyone uses convolutionnal neural network, which helps decreasing the number of parameters and hence improving the learning!

To go further

Writing mathematically the gradients yourself like we did is prone to errors (especially for huge networks), that is why it is interresting to compute numerically the gradients. This is what all deep learning packages (like tensorflow or pytorch) does. Peter roelants's blog post has a good explanation about that!

Check these nice animations if you want to understand how convolutions are used in neural networks.

The topic of understanding deep learning is hot, if you are interrested you should definitively check this distill blog.


Thanks to peter roelants who owns a nice blog on machine learning. It helped me to have deeper understanding behind the neural network mathematics. Some code were also inspired from his work.


1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5, 115–133 (1943).

2. Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing. 227–236 (1990).

3. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Icml (2010).