Linear Regression

Regression is fitting a line to our data; Discovering the hidden relationship between input and output.

Univariate linear regression

Linear regression with one independent variable (or feature).
a training example: $(x^{(i)},y^{(i)})$, where $i=1,\ldots,m$. $m$ represents the number of training examples.
$X$ - space of input variables; $Y$ - space of output variables
$h: X \mapsto Y$, where $h$ is a hypothesis.
Hypothesis: $h_{\theta} (x)=\theta_0 + \theta_1 x $; Shorthand notation: $h(x)$. $\theta_0$ is intercept, $\theta_1$ is slope.
$\theta_{i’s}:$ parameters; How to choose $\theta_{i’s}$?
Choose $\theta_0$, $\theta_1$ so that $h(x)$ is close to $y$ for the training set.
Cost function: Squared error function \begin{equation} J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2 \label{eq:cost_fn_linear_uni} \end{equation} \begin{equation*} J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 x^{(i)}-y^{(i)})^2 \end{equation*}
Goal: $\underset{\theta_0, \theta_1}{\text{Minimize}}$ $J$; At $J$ minimum, $\frac{\partial J}{\partial \theta_0}=0$ and $\frac{\partial J}{\partial \theta_1}=0$.
Sometimes unrealistic values of $\theta_0$ would be found
one unit change in X gives $\theta_1$ times change in y
Goodness of fit, $R^2$= the amount of variance explained by the model, range 0-1; if $R^2$=0.55, 55% variance is explained by the model.
$R^2$: 0 (bad) $\to$ $1 (good)$

Derivation of coefficients

Expanding equation \eqref{eq:cost_fn_linear_uni}, \begin{equation} J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m \Big(\theta_0^2+\theta_1^2 (x^{(i)})^2 +(y^{(i)})^2 + 2 \theta_0 \theta_1 x^{(i)} - 2 \theta_1 x^{(i)} y^{(i)} - 2 \theta_0 y^{(i)}\Big)
\end{equation} Calculating $\frac{\partial J}{\partial \theta_0}$, \begin{equation} \begin{split} \frac{\partial J}{\partial \theta_0} & = \frac{1}{2m} \sum_{i=1}^m (2 \theta_0 + 2 \theta_1 x^{(i)}- 2 y^{(i)}) \cr & = \theta_0 + \theta_1 \frac{ \sum_{i=1}^m x^{(i)}}{m} - \frac{\sum_{i=1}^m y^{(i)}}{m} \cr & = \theta_0 + \theta_1 \bar{x} - \bar{y} \end{split} \label{eq:dj_dtheta_0} \end{equation}

Setting $\frac{\partial J}{\partial \theta_0}=0$, \begin{equation} \theta_0 = \bar{y} - \theta_1 \bar{x} \label{eq:dtheta_0_eqn} \end{equation}

Calculating $\frac{\partial J}{\partial \theta_1}$,
\begin{equation} \begin{split} \frac{\partial J}{\partial \theta_1} & = \frac{1}{2m} \sum_{i=1}^m (2 \theta_1 (x^{(i)})^2 + 2 \theta_0 x^{(i)}- 2 x^{(i)} y^{(i)}) \cr & = \theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \frac{ \sum_{i=1}^m x^{(i)}}{m} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m} \cr & = \theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \bar{x} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m} \end{split} \end{equation}

Setting $\frac{\partial J}{\partial \theta_1}=0$,
\begin{equation} \theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \bar{x} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m}=0 \label{eq:dj_dtheta_1} \end{equation}

Substituting equation \eqref{eq:dtheta_0_eqn} in equation \eqref{eq:dj_dtheta_1}, we get
\begin{equation} \begin{split} \theta_1 & = \frac{ \sum_{i=1}^m x^{(i)} y^{(i)} - m \bar{x}\bar{y}}{\sum_{i=1}^m (x^{(i)})^2 - m \bar{x}^2} \cr & = \frac{Cov(x,y)}{Var(x)}
\end{split} \end{equation}

Gradient descent

Repeat until convergence :
\begin{equation} \theta_j := \theta_j -\alpha \frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1), \quad \text{j=0,1 update simultaneously}, \end{equation} where $\alpha$ is the learning rate
If $\alpha$ is too small, gradient descent can be slow.
If $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

Batch gradient descent

All the training data is used in calculating $J$. We considered the average of the gradients of all the data points in the training set to update parameters. One epoch equals one gradient descent step. It converges to minima directly

Stochastic gradient descent (SGD)

If our training data set is so huge, considering all the points in the training data to calculate $J$ would not be efficient. In stochastic gradient descent, we consider the gradient of one example at a time in single step. Here, one epoch means, considering all the points one after the other to parameters.

So one epoch has $m$ time steps. Since we consider only one example a time, the cost will fluctuate. It will not necessarily decrease, but in the long run the cost would decrease with fluctuations. Also, it will never reach the minimum because of it’s fluctuating nature. When the dataset is large, SGD can be used to achieve faster convergence as it updates parameters more frequently.

Mini batch gradient descent

The problem with SGD is that, we can not apply vectorized implementation, leading to slower computations. To address this issue, a mix of batch gradient method and SGD are used. In this method, we divide the whole batch in mini-batches of fixed size. Here, one epoch means, calculating the mean gradient of the mini batches one after the other for all the mini-batches to update parameters. Similar to SGD, the cost will fluctuate.

Multivariate Linear Regression

Known as multiple linear regression
Multiple features
Hypothesis: $h_{\theta}(x)=\theta_0+\theta_1 x_1+\theta_2 x_2+ \ldots + \theta_n x_n$
Define $x_0=1$ for convenience.
\begin{align*} x = \begin{bmatrix} x_0 \cr x_1 \cr \vdots \cr x_n \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
\begin{align*} \theta = \begin{bmatrix} \theta_0 \cr \theta_1 \cr \vdots \cr \theta_n \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
$h_{\theta}(x)=\theta^T x$
Cost function $J(\theta_0,\theta_1,\ldots,\theta_n)=\frac{1}{2m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big)^2$
Gradient descent: simultaneously update for $j=0,1,\ldots,n$ \begin{equation} \begin{split} \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \cr & = \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big)^2 \cr & = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big) x_j^{(i)} \end{split} \label{eq:dj_dtheta_mse} \end{equation}
Feature scaling (write its importance)
Mean normalization (write its importance)
predict the sign of $\theta_n$, positive or negative
r-squared value is lower for the test data than the training data; because the model hasn’t seen the test dataset
With r-squared value, we get a little bit of a feeling for how our model would fit in the real world
For non-convex problems of cost function, try out multiple starting values to find global minimum (easy way) or try different algorithms

Polynomial Regression

House price prediction: $h_{\theta}(x)=\theta_0+\theta_1 \times \text{frontage}+\theta_2 \times \text{depth}$
Area= frontage $\times$ depth
$h_{\theta}(x)=\theta_0+ \theta_1 \times Area$
A model hypothesis: \begin{equation} \begin{split} h_{\theta}(x) &= \theta_0+\theta_1 (size) + \theta_2 (size)^2 + \theta_3 (size)^3 \cr &= \theta_0+\theta_1 (x) + \theta_2 (x)^2 + \theta_3 (x)^3 \cr &= \theta_0+\theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 \end{split} \end{equation}
Another hypothesis: $h_{\theta}(x)=\theta_0 + \theta_1 (size) + \theta_2 \sqrt{(size)}$

Normal equation - Multivariate Regression

$m$ examples and $n$ features
\begin{align*} x^{(i)} = \begin{bmatrix} x_0^{(i)} \cr x_1^{(i)} \cr \vdots \cr x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
\begin{align*} X = \begin{bmatrix} \cdots & (x_0^{(i)})^T & \cdots \cr \cdots & (x_1^{(i)})^T & \cdots \cr \cdots & \vdots & \cdots \cr \cdots & x_n^{(i)})^T & \cdots \end{bmatrix} \end{align*}
$X \theta = y$
$\theta= X^{-1} y = (X^T X)^{-1} X^T y$
no need to choose $\alpha$, Don’t need to iterate
Need to compute $(X^T X)^{-1}$
Slow if $n$ is large
What if $(X^T X)$ is non-invertible ? (singular/degenerate)
- Redundant features (linearly dependent)
- Too many features (e.g. $m \leq n$) : delete some features or use regularization