Linear Regression
Regression is fitting a line to our data; Discovering the hidden relationship between input and output.
Univariate linear regression
- Linear regression with one independent variable (or feature).
- a training example: $(x^{(i)},y^{(i)})$, where $i=1,\ldots,m$. $m$ represents the number of training examples.
- $X$ - space of input variables; $Y$ - space of output variables
- $h: X \mapsto Y$, where $h$ is a hypothesis.
- Hypothesis: $h_{\theta} (x)=\theta_0 + \theta_1 x $; Shorthand notation: $h(x)$. $\theta_0$ is intercept, $\theta_1$ is slope.
- $\theta_{i’s}:$ parameters; How to choose $\theta_{i’s}$?
- Choose $\theta_0$, $\theta_1$ so that $h(x)$ is close to $y$ for the training set.
- Cost function: Squared error function \begin{equation} J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2 \label{eq:cost_fn_linear_uni} \end{equation} \begin{equation*} J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 x^{(i)}-y^{(i)})^2 \end{equation*}
- Goal: $\underset{\theta_0, \theta_1}{\text{Minimize}}$ $J$; At $J$ minimum, $\frac{\partial J}{\partial \theta_0}=0$ and $\frac{\partial J}{\partial \theta_1}=0$.
- Sometimes unrealistic values of $\theta_0$ would be found
- one unit change in X gives $\theta_1$ times change in y
- Goodness of fit, $R^2$= the amount of variance explained by the model, range 0-1; if $R^2$=0.55, 55% variance is explained by the model.
- $R^2$: 0 (bad) $\to$ $1 (good)$
Derivation of coefficients
Expanding equation \eqref{eq:cost_fn_linear_uni},
\begin{equation}
J(\theta_0, \theta_1)= \frac{1}{2m} \sum_{i=1}^m \Big(\theta_0^2+\theta_1^2 (x^{(i)})^2 +(y^{(i)})^2 + 2 \theta_0 \theta_1 x^{(i)} - 2 \theta_1 x^{(i)} y^{(i)} - 2 \theta_0 y^{(i)}\Big)
\end{equation}
Calculating $\frac{\partial J}{\partial \theta_0}$,
\begin{equation}
\begin{split}
\frac{\partial J}{\partial \theta_0} & = \frac{1}{2m} \sum_{i=1}^m (2 \theta_0 + 2 \theta_1 x^{(i)}- 2 y^{(i)}) \cr
& = \theta_0 + \theta_1 \frac{ \sum_{i=1}^m x^{(i)}}{m} - \frac{\sum_{i=1}^m y^{(i)}}{m} \cr
& = \theta_0 + \theta_1 \bar{x} - \bar{y}
\end{split}
\label{eq:dj_dtheta_0}
\end{equation}
Setting $\frac{\partial J}{\partial \theta_0}=0$, \begin{equation} \theta_0 = \bar{y} - \theta_1 \bar{x} \label{eq:dtheta_0_eqn} \end{equation}
Calculating $\frac{\partial J}{\partial \theta_1}$,
\begin{equation}
\begin{split}
\frac{\partial J}{\partial \theta_1} & = \frac{1}{2m} \sum_{i=1}^m (2 \theta_1 (x^{(i)})^2 + 2 \theta_0 x^{(i)}- 2 x^{(i)} y^{(i)}) \cr
& = \theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \frac{ \sum_{i=1}^m x^{(i)}}{m} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m} \cr
& = \theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \bar{x} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m}
\end{split}
\end{equation}
Setting $\frac{\partial J}{\partial \theta_1}=0$,
\begin{equation}
\theta_1 \frac{ \sum_{i=1}^m (x^{(i)})^2}{m} + \theta_0 \bar{x} - \frac{\sum_{i=1}^m x^{(i)} y^{(i)}}{m}=0
\label{eq:dj_dtheta_1}
\end{equation}
Substituting equation \eqref{eq:dtheta_0_eqn} in equation \eqref{eq:dj_dtheta_1}, we get
\begin{equation}
\begin{split}
\theta_1 & = \frac{ \sum_{i=1}^m x^{(i)} y^{(i)} - m \bar{x}\bar{y}}{\sum_{i=1}^m (x^{(i)})^2 - m \bar{x}^2} \cr
& = \frac{Cov(x,y)}{Var(x)}
\end{split}
\end{equation}
Gradient descent
- Repeat until convergence :
\begin{equation} \theta_j := \theta_j -\alpha \frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1), \quad \text{j=0,1 update simultaneously}, \end{equation} where $\alpha$ is the learning rate - If $\alpha$ is too small, gradient descent can be slow.
- If $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
Batch gradient descent
All the training data is used in calculating $J$. We considered the average of the gradients of all the data points in the training set to update parameters. One epoch equals one gradient descent step. It converges to minima directly
Stochastic gradient descent (SGD)
If our training data set is so huge, considering all the points in the training data to calculate $J$ would not be efficient. In stochastic gradient descent, we consider the gradient of one example at a time in single step. Here, one epoch means, considering all the points one after the other to parameters.
So one epoch has $m$ time steps. Since we consider only one example a time, the cost will fluctuate. It will not necessarily decrease, but in the long run the cost would decrease with fluctuations. Also, it will never reach the minimum because of it’s fluctuating nature. When the dataset is large, SGD can be used to achieve faster convergence as it updates parameters more frequently.
Mini batch gradient descent
The problem with SGD is that, we can not apply vectorized implementation, leading to slower computations. To address this issue, a mix of batch gradient method and SGD are used. In this method, we divide the whole batch in mini-batches of fixed size. Here, one epoch means, calculating the mean gradient of the mini batches one after the other for all the mini-batches to update parameters. Similar to SGD, the cost will fluctuate.
Multivariate Linear Regression
- Known as multiple linear regression
- Multiple features
- Hypothesis: $h_{\theta}(x)=\theta_0+\theta_1 x_1+\theta_2 x_2+ \ldots + \theta_n x_n$
- Define $x_0=1$ for convenience.
- \begin{align*} x = \begin{bmatrix} x_0 \cr x_1 \cr \vdots \cr x_n \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
- \begin{align*} \theta = \begin{bmatrix} \theta_0 \cr \theta_1 \cr \vdots \cr \theta_n \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
- $h_{\theta}(x)=\theta^T x$
- Cost function $J(\theta_0,\theta_1,\ldots,\theta_n)=\frac{1}{2m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big)^2$
- Gradient descent: simultaneously update for $j=0,1,\ldots,n$ \begin{equation} \begin{split} \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \cr & = \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big)^2 \cr & = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big) x_j^{(i)} \end{split} \label{eq:dj_dtheta_mse} \end{equation}
- Feature scaling (write its importance)
- Mean normalization (write its importance)
- predict the sign of $\theta_n$, positive or negative
- r-squared value is lower for the test data than the training data; because the model hasn’t seen the test dataset
- With r-squared value, we get a little bit of a feeling for how our model would fit in the real world
- For non-convex problems of cost function, try out multiple starting values to find global minimum (easy way) or try different algorithms
Polynomial Regression
- House price prediction: $h_{\theta}(x)=\theta_0+\theta_1 \times \text{frontage}+\theta_2 \times \text{depth}$
Area= frontage $\times$ depth
$h_{\theta}(x)=\theta_0+ \theta_1 \times Area$ - A model hypothesis: \begin{equation} \begin{split} h_{\theta}(x) &= \theta_0+\theta_1 (size) + \theta_2 (size)^2 + \theta_3 (size)^3 \cr &= \theta_0+\theta_1 (x) + \theta_2 (x)^2 + \theta_3 (x)^3 \cr &= \theta_0+\theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 \end{split} \end{equation}
- Another hypothesis: $h_{\theta}(x)=\theta_0 + \theta_1 (size) + \theta_2 \sqrt{(size)}$
Normal equation - Multivariate Regression
- $m$ examples and $n$ features
- \begin{align*} x^{(i)} = \begin{bmatrix} x_0^{(i)} \cr x_1^{(i)} \cr \vdots \cr x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \end{align*}
- \begin{align*} X = \begin{bmatrix} \cdots & (x_0^{(i)})^T & \cdots \cr \cdots & (x_1^{(i)})^T & \cdots \cr \cdots & \vdots & \cdots \cr \cdots & x_n^{(i)})^T & \cdots \end{bmatrix} \end{align*}
- $X \theta = y$
- $\theta= X^{-1} y = (X^T X)^{-1} X^T y$
- no need to choose $\alpha$, Don’t need to iterate
- Need to compute $(X^T X)^{-1}$
- Slow if $n$ is large
- What if $(X^T X)$ is non-invertible ? (singular/degenerate)
- Redundant features (linearly dependent)
- Too many features (e.g. $m \leq n$) : delete some features or use regularization