A Linear Algebraic Perspective of Neural Networks

Introduction

With the increased popularity of neural networks, some of the mathematical intuition and foundations can be abstracted away thanks to frameworks like Tensorflow, which make neural networks readily available due ease-of-implementation. However, in order to be able to use such tools effectively or to simply appreciate their effectiveness, it is important to remind ourselfs of their humble foundations.

Linear Algebra – A Preface

Linear algebra provides use with a framework for working with linear transformations via matrix multiplication. In fact, neural networks provide us with another framework for doing the same thing (and more). For example, the following linear transformation that takes a vectory $(x,y)$ and produces $(2x, y)$: $\begin{bmatrix} 2 & 0\\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 1 \end{bmatrix}$

This is how the following equation could be viewd as a neural network with one input layer and one output layer:

In the neural network diagram above, each output unit produces the linear combination of the inputs and the connection weights, which is the same thing we do with matrix multiplication.

We’d like handle affine transformations, which are basically linear transformations with translations allowed. With linear algebra, we usually handle affine transformations using vector addition:

$$ \begin{bmatrix} 2 & 0\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 \ 1 \end{bmatrix} +

\begin{bmatrix} 0 \ 4 \end{bmatrix} = \begin{bmatrix} 6 \ 5 \end{bmatrix} $$ In our neural network, this affine transformation would take the form of bias inputs:

Like before, each output unit performs a linear combination of the incoming weights and inputs. This time though, the units have a constant bias input, which each output unit can weight independently to achieve the effect of a translation vector. In this case we use the weight 0 for the first output unit to zero out the bias. We use the weight 4 for the second output unit to scale the bias accordingly.

Neural networks as general nonlinear transformers

Let’s take a simple yet classic classification problem, something which is not linearly separable, to demonstrate the superior capabilities of neural networks over linear models. We can synthetically generate a spiral dataset as shown below with 2 classes and a 1000 points.

A linear model is not going to be able to separate these classes, since linear models will have a straight line as a decision boundary. However, it is very simply to train a Neural Network on this data to classify them with high accuracy. Using just two hidden layers with 128 and 2 units, respectively, gives us an accuracy of 95%. As we can see below, the neural network created a non-linear decision boundary.

We know that neural networks are a bunch of linear transformations with non-linearities spread in between, with the final layer usually being a linear transformation. So, if we consider all the layers before the last layer, all they are doing is apply matrix multiplication and activation functions so that the data becomes linearly separable for the final layer. From the geometric interpretation of linear algebra, this is just a linear transformation of the vector space.

Let us again visualize the points in original space as well as the output of the final hidden layer of the network network.

We can see that the hidden layers have learnt is a transformation from the original input space to another space in which the points are linearly separable! This “learning” is essentially the warping and folding of space by the neural network so that the input points become linearly separable. This is made possible by the activation function (in this case ReLU), which introduces non-linearity. If there was no activation function, the total transformation would still be linear and would not be able to resolve the non-linearly distributed points to linearly separable ones.