drawing

Singular value decomposition (SVD) is a widely used method that plays a huge role in data preprocessing for machine learning. It is mostly used to filter out noise from the data, reduce its dimensionnality/complexity and to have uncorrelated features. SVD will extract the best meaningfull mathematical basis that describe much the data using linear algebra.

tl;dr

  1. Organize your input data as a $d\times n$ matrix.
  2. Center the data by subtracting the mean from the matrix.
  3. Use SVD to extract eigen vectors from the data, the first vector is the direction of the line.
  4. The parametric form of the line is described by the direction and average.
  5. Calculate any point of the line by fixing the free parameter.

1. Theory

The key idea behind SVD is that any matrix $M$ with positive determinant can be factorized in the form : \begin{equation} M = U\Sigma V^* \end{equation} Where $U$ and $V^*$ are rotation matrices and $\Sigma$ is a scale matrix. Here $V^*$ wil give you the set of vector to project the data onto the new dimension space.

Check in the following figure, the axis $y_1$ best explain the data because when you project the data into $y_1$, the variance is higher than for $y_2$.

manifold

For the rest of this tutorial, we will use SVD to fit a 3-dimensionnal line based on some random 3D points (but this method can be extended to $n$-d).

2. Data generation

First, we generate $n$ points with a random gaussian noise. We organize the input data according to a $d$×$n$ matrix, where $d$ is the number of dimensions (or features) and $n$ the number of samples.

In [1]:
In [2]:
# input data
n = 25
points = np.array( [5*np.arange(n),5+5*np.arange(n),5*np.ones(n)] ).T + 0.5-np.random.rand(n,3)
points
Out[2]:
array([[-4.88135039e-02,  4.78481063e+00,  4.89723662e+00],
       [ 4.95511682e+00,  1.00763452e+01,  4.85410589e+00],
       [ 1.00624128e+01,  1.46082270e+01,  4.53633724e+00],
       [ 1.51165585e+01,  1.97082750e+01,  4.97110508e+00],
       [ 1.99319554e+01,  2.45744034e+01,  5.42896394e+00],
       [ 2.54128707e+01,  3.04797816e+01,  4.66738015e+00],
       [ 2.97218432e+01,  3.46299879e+01,  4.52138166e+00],
       [ 3.47008414e+01,  4.00385206e+01,  4.71947082e+00],
       [ 4.03817256e+01,  4.48600790e+01,  5.35664671e+00],
       [ 4.45553311e+01,  4.99781517e+01,  5.08533806e+00],
       [ 5.02354444e+01,  5.47257663e+01,  5.04384967e+00],
       [ 5.49315661e+01,  6.04812102e+01,  4.88236450e+00],
       [ 5.98879043e+01,  6.48830660e+01,  4.55625192e+00],
       [ 6.48181797e+01,  7.01404921e+01,  5.06296805e+00],
       [ 6.98023688e+01,  7.54397745e+01,  4.83323328e+00],
       [ 7.48293621e+01,  8.02896174e+01,  5.37107370e+00],
       [ 8.01845716e+01,  8.51362892e+01,  4.92980323e+00],
       [ 8.50613985e+01,  8.95116262e+01,  5.39795519e+00],
       [ 9.02911232e+01,  9.53386905e+01,  4.84689167e+00],
       [ 9.52467084e+01,  1.00033689e+02,  5.25557441e+00],
       [ 1.00341030e+02,  1.05389625e+02,  4.84367041e+00],
       [ 1.05361817e+02,  1.10303418e+02,  5.13127483e+00],
       [ 1.09679007e+02,  1.15402899e+02,  4.66205509e+00],
       [ 1.15403902e+02,  1.19523541e+02,  5.03134880e+00],
       [ 1.19523239e+02,  1.24895154e+02,  4.76073642e+00]])

3. Performing SVD

Before performing SVD, it is necessary to center the data by subtracting the mean of the data. Without centering, the first eigen vector would explain all the data. This is because this method performs just scaling (as we saw earlier), but cannot take into account the bias of the data (i.e. the intercept in linear regression).
Sometimes it can also be usefull to normalize the input, because our data has not huge difference in each dimension we don't need to. We will use linear algebra package from numpy for the SVD.

In [3]:
# calculating the mean of the points
avg = np.mean(points, axis=0)

# subtracting the mean from all points
subtracted = points - avg

# performing SVD
_, _, V = np.linalg.svd(subtracted)

4. Finding the line

To estimate the equation of a line, we need its direction (a vector) and one point that goes trough that line. Previously, we performed SVD and extracted $V^*$ matrix that describe the eigen vectors. The first eigen vector is the one that best describe the data, which in our case is the line that best fits all the points!

One example of a point that can go through this line is the average of the sample that we calculated previously. Then, any point of the line can be given by: \begin{equation} p(t) = p_0 + dt \end{equation} Where $t$ is the free parameter that is allowed to be any real number.

In [4]:
# find the direction vector (which is the right singular vector corresponding to the largest singular value)
direction = V[0, :]

# A line is defined by the average and its direction
p0 = avg
d = direction
print(d)
[0.70644472 0.70776785 0.00072705]

We can calculate the angle $\alpha$ between two lines with direction $d_0$ and $d_1$ using: \begin{equation} \alpha = \arccos\Big(\frac{d_a.d_b}{\|d_a\|.\|d_b\|}\Big) \end{equation}

For example, this is the angle between our line and the normal axis $(0, 0, 1)$.

In [5]:
d0 = np.array([0, 0, 1])
angle = np.arccos(np.dot(d0,d)/(np.linalg.norm(d0) * np.linalg.norm(d)))
print(angle*180/np.pi)
89.95834289649719

5. Plotting the line

Using the parametric form of the line, we can extract two different points by fixing the free parameter (make sure to choose a big one).

In [6]:
pa = p0 + (-100)*d
pb = p0 + 100*d

To plot the 3D line, we will use plotly that have really good html embeddings and smooth 3D rendering.

In [7]:

To go further

If you want improve your understanding of SVD and its relation with PCA, check this nice paper on the web. On the importance of data normalization, check this thread.