Some mathematical concepts and derivations that form the basis behind some simple machine learning algorithms. It can be viewed as a high level summary of introductory machine learning.

Helpful Prior Knowledge

  • Basic linear algebra, fundamental operations with vectors and matrices, transpose and inverse
  • Basic conceptual calculus, taking derivatives

Notation

Common idiomatic notation used in machine learning and linear algebra, used throughout these notes.

  • $\theta$ - Vector of parameters
  • $x$ - Matrix of features
  • $n$ - Number of features
  • $m$ - Number of training examples
  • $y$ - The correct values for each set of features in the training set
  • $h_{\theta}(x)$ - Hypothesis function
  • $J(\theta)$ - Cost function

Given a matrix $A$:

  • $A_{ij}$ - the entry in the $i^{th}$ row, $j^{th}$ column
  • $A^T$ - the transpose of $A$
  • $A^{\prime}$ - the inverse of $A$

For our matrix of features $x$:

  • $x^{(i)}$ - vector of the features in the $i^{th}$ training example
  • $x^{(i)}_{j}$ - value of feature $j$ in the $i^{th}$ training example

Linear Regression

Basic Concepts

  • Features: The numerical inputs for the algorithm from which it makes predictions.
  • Parameters: The weights which we multiply to the features to produce the final output (corresponds to the prediction we make). These are the values we are trying to learn.
  • Hypothesis function: Function which uses the parameters to map the input features to the final prediction. It is represented as $h_{\theta}(x)$, taking the features $x$ (could be a single value or a vector) as input.
  • Some ways to learn $\theta$ values:
    • Gradient Descent:
      • Cost function: Function that returns the error of the hypothesis function and the actual correct values in the training set. It is represented as $J(\theta)$, taking our parameters used to make the prediction ($\theta$) as input.
      • We try to find values of $\theta$ which minimize the cost function, because the lower the cost function, the lower our error is. We do this by finding the global minimum.
    • Normal Equation: An equation to calculate the minimum without the need for gradient descent. We will talk about this later.

How it works

Linear regression fits a model to a straight line dataset, therefore our hypothesis function for univariate (one feature) linear regression is:

$$h_{\theta}(x) = \theta_0 + \theta_1x$$

This is a basic 2D straight line, where $x$ is our feature and we are trying to learn the bias parameter $\theta_0$ and the weight parameter $\theta_1$. We only have a single feature parameter because we only have one feature.

If we do the same for multiple features, we will get a linear multi-dimensional equation:

$$h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_nx_n$$

Where $n$ is the number of features. Each feature (the $x$ terms) has a weight parameter ($\theta_1$ through $\theta_n$), and each of the individual bias terms are collected into one term $\theta_0$. Notice how the input $x$ is no longer a single value, and is instead a collection of values $x_1, x_2, x_3 \cdots x_n$, which we can represent as a column vector. We can do the same thing with our $\theta$ values:

$$x = \begin{bmatrix} x_0 = 1 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \ \ \ \ \ \ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}$$

Notice how we have added an extra $x_0$ term into the start of the $x$ vector, and we set it equal to 1. This corresponds to the bias term $\theta_0$, which we used in the hypothesis equations. We do this because $\theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n= \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n$ if $x_0 = 1$. This also matches the dimensions of both vectors, enabling us to do operations such as multiplication with them.

With our two matrices, we can write out a vectorized version of the hypothesis function as $\theta^T x$, which we can see is equivalent to our original equation:

$$h_{\theta}(x) = \theta^T x = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \cdots & \theta_n \end{bmatrix} \times \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$

Gradient Descent

One way to learn the values of $\theta$ is gradient descent. In order to implement this, we need a cost function which calculates the error of our hypothesis function above. There are a variety of cost functions that could be used, but the typical one for simple regression is a variation on the average of the squared error:

$$J(\theta) = \dfrac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$$

Recall that $m$ represents the number of training examples we have, and $y$ represents the actual correct predictions for each set of $x$ features in our training set. What this function does is for each of our training sets, take the value of the hypothesis for that set ($h_{\theta}(x^{(i)})$), and calculate the difference between it and the corresponding actual value, then square that difference. This guarantees a positive value. We then sum up each one of these squared positive values, then divides by $2m$, a slight variation on calculating the squared mean error (which would just be dividing by $m$ only). The reason we also divide by 2 is because it makes the derivative nicer, as the term inside the summation is squared. When we derive this, will end up with a coefficient of 2 in front, which will nicely cancel with the 2 in the denominator.

The actual gradient descent step comes from finding values of $\theta$ that minimize this function the most, in other words, the global minimum. At the minimum point, the derivative (in this case the partial derivative) of the cost function in terms of $\theta$ will be 0. We can calculate the derivative as follows:

\begin{align*}
\dfrac{\delta}{\delta\theta} J(\theta) &= \dfrac{1}{2m} \cdot \dfrac{\delta}{\delta\theta} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2  &\text{(Note: } m \text{ is a constant)} \\
&= \dfrac{1}{2m} \cdot \sum_{i=1}^n \dfrac{\delta}{\delta \theta} (\theta^T x^{(i)} - y^{(i)})^2  &(h_{\theta}(x^{(i)}) \text{ is substituted for } \theta^T x^{(i)}) \\
&= \dfrac{1}{2m} \cdot \sum_{i=1}^m \:2(\theta^Tx^{(i)} - y ^{(i)}) \cdot x^{(i)} &\text{(Note: } y^{(i)} \text{ is a constant)} \\
&= \dfrac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}  &\text{(Simplify and substitute back } h_{\theta}(x^{(i)}))
\end{align*}

One way to get to the minimum is to repeatedly subtract the value of the derivative from the old $\theta$ value. By doing this, when the derivative is positive (indicating we are to the right of the minimum), $\theta$ will be lowered (move to the left), when the derivative is negative (indicating we are to the left of the minimum), $\theta$ will be raised (move to the right). Thus, with many iterations of this, we will eventually approach the minimum. Here is the mathematical representation (the $:=$ is used to show that we are updating the value, rather than as an equality operator):

\begin{align*} & \text{For } j = 0, \cdots, n \\ & \text{repeat until convergence \{} \\ & \qquad \theta_j := \theta_j - \alpha \dfrac{\delta}{\delta \theta_j} J(\theta) \\ &\}\end{align*}

Substituting the derivative we took above. $x^{(i)}$ is replaced with $x^{(i)}_j$ because when dealing with multiple features, we mean to say the feature set for the specific training example:

\begin{align*} & \text{For } j = 0, \cdots, n \\ & \text{repeat until convergence \{} \\ & \qquad \theta_j := \theta_j - \dfrac{\alpha}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_j \\ &\}\end{align*}

We have added a new variable: $\alpha$. This is called the learning rate, and as you can probably guess from the equation, it corresponds to the size of step we take with each iteration. A large $\alpha$ value will lead to subtracting or adding larger values to $\theta_j$ each time. Too small of a learning rate will lead to gradient descent taking too long to converge, because we are taking very small steps each time. Too large of a learning rate can cause our algorithm to never converge because it will overshoot the minimum each time.

One important point is that we are repeating this step for multiple variables. If we were to write it out fully, assuming we have 50 features (meaning that $x \in \mathbb{R}^{51}$ and $\theta \in \mathbb{R}^{51}$):

\begin{align*} & \text{repeat until convergence \{} \\ & \qquad \theta_0 := \theta_0 - \dfrac{\alpha}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_0 \\ & \qquad \theta_1 := \theta_1 - \dfrac{\alpha}{m} \sum_{i=1}^m (h_{\theta} (x^{(i)}) - y^{(i)})x^{(i)}_1 \\ & \qquad \theta_2 := \theta_2 - \dfrac{\alpha}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_2 \\ & \qquad \qquad \vdots \\ & \qquad \theta_{51} := \theta_{51} - \frac{\alpha}{m} \sum_{i=1}^m (h_{\theta} (x^{(i)}) - y^{(i)})x^{(i)}_{51} \\ &\}\end{align*}

Because our $h_{\theta}(x^{(i)})$ is dependent on the values of the parameter vector $\theta$, we need to make sure we are updating our values simultaneously after we are done with the computations. Consider the following incorrect psuedocode for a single gradient descent step on a three parameters:

# assume:
#   theta_0 is the bias term
#   theta_1 is the 1st parameter, theta_2 is the second parameter, ... etc.
#   alpha is the learning rate
#   dcost_1, dcost_2, ... etc. is the partial derivative of the cost function for each respective theta

theta_0 = theta_0 - ((alpha / m) * dcost_0)
theta_1 = theta_1 - ((alpha / m) * dcost_1)
theta_2 = theta_2 - ((alpha / m) * dcost_2)

This is wrong because we are updating the values before we are finished using all of them yet! Here is a correct implementation, where we update the $\theta$ values simultaneously after the computation:

temp0 = theta_0 - ((alpha / m) * dcost_0)
temp1 = theta_1 - ((alpha / m) * dcost_1)
temp2 = theta_2 - ((alpha / m) * dcost_2)

theta_0 = temp0
theta_1 = temp1
theta_2 = temp2