Logistic Regression

Coursera Notes:

Classification

Email: spam / not spam
Online transactions: Fraudulent (yes/no)
Tumor: Malignant/Benign ?
- $y=0$: negative Class (Benign tumor)
- $y=1$: positive class (Malignant)
Threshold classifier output $h_{\theta}$ at 0.5:
- if $h_{\theta} \geq 0.5$, predict $y=1$
- if $h_{\theta} < 0.5$, predict $y=0$
Classification $y=0$ or $y=1$. But if we use linear regression formula, $h_{\theta}(x)$ can be <0 or >1
Logistic regression model: want $0 \leq h_{\theta} \leq 1$
Sigmoid function $g(z)$: output values range from 0 to 1 \begin{equation} g(z)=\frac{1}{1+\exp^{-z}} \end{equation}
set, $z=\Theta^T x$
Hypothesis: \begin{equation} h_{\theta} (x) = \frac{1}{1+\exp^{-\Theta^T x}} \end{equation}

Interpretation of hypothesis output

$h_{\theta}(x)$ is the estimated probability that $y=1$ on input $x$
$h_{\theta}(x)=0.7$, tell the patient that 70 % chance of tumor being malignant.
$h_{\theta}=P(y=1|x; \Theta)$, probability that $y=1$, given $x$ parameterized by $\Theta$
$h_{\theta}=P(y=0|x; \Theta) + h_{\theta}=P(y=1|x; \Theta) =1$
$h_{\theta}=P(y=0|x; \Theta) = 1 - h_{\theta}=P(y=1|x; \Theta) $

Decision boundary

Decision boundary depends on the threshold that has been set at $h_{\theta}(x)=g(\Theta^T x)= 0.5$
$h_{\theta}(x)= 0.5$, when $z=\Theta^T x=0$
so the line $\Theta^T x=0$ in the space $X$ becomes the decision boundary
- predict $y=1$ if $\Theta^T x \geq 0$
- predict $y=0$ if $\Theta^T x < 0$

Non-linear decision boundary

Threshold at $h_{\theta}(x)=g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2)=0.5$, where $g$ is the sigmoid function
- decision boundary $\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2=0$
Imagine a decision boundary which is a unit circle centered at origin $(x_1=0,x_2=0)$, how would the hypothesis look like?
- $h_{\theta}(x)=g(x_1^2+x_2^2-1)$

Cost function

Can’t use $J= \frac{1}{2m} \sum_{i=1}^m \Big( h_{\theta}(x^{(i)})-y^{(i)} \Big)^2$, this makes the function $J$ non-convex. \begin{equation} h_{\theta} (x)= \frac{1}{1+\exp^{-\Theta^T x}} \end{equation}
Consider \begin{equation} Cost(h_{\theta} (x),y) = \begin{cases} -log(h_{\theta} (x)) \quad \text{if $y$=1} \cr -log(1-h_{\theta} (x)) \quad \text{if $y$=0} \end{cases}
\end{equation}
- For $y=1$,
  - $h_{\theta}(x)=1$ then $Cost=0$,
  - $h_{\theta}(x) \to 0$ then $Cost \to \infty$
- For $y=0$,
  - $h_{\theta}(x)=0$ then $Cost=0$,
  - $h_{\theta}(x) \to 1$ then $Cost \to \infty$
Cost function: \begin{equation} \begin{split} J(\theta) & = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta} (x^{(i)}),y^{(i)}) \cr & = -\frac{1}{m} \sum_{i=1}^m \Big[ y^{(i)}log(h_{\theta} (x^{(i)})) +(1-y^{(i)}) log(1-h_{\theta} (x^{(i)})) \Big] \end{split} \end{equation}

Gradient descent

Repeat until convergence :
\begin{equation} \theta_j := \theta_j -\alpha \frac{\partial}{\partial \theta_j} J (\theta), \quad \text{update all $\theta_j$ simultaneously}, \end{equation} where $\alpha$ is the learning rate
$\underset{\theta}{\text{Minimize}}$ $J(\theta)$

Finding $\frac{\partial J}{\partial \theta_j} $

We have already set $z=\Theta^T x$
Set $h_{\theta}(x)=g(z(x))=a(z(x))$, where $g$ is a sigmoid function

\begin{equation} \frac{\partial}{\partial \theta_j} J (\theta)=\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial \theta_j} Cost(h_{\theta} (x^{(i)}),y^{(i)}) \end{equation}

Chain rule: \begin{equation} \begin{split} \frac{\partial}{\partial \theta_j} Cost(h_{\theta} (x^{(i)}),y^{(i)}) & = \frac{\partial}{\partial \theta_j} Cost(a^{(i)},y^{(i)}) \cr & = \frac{\partial Cost(a^{(i)},y^{(i)})}{\partial a^{(i)}} \times \frac{\partial a^{(i)}}{\partial z^{(i)}} \times \frac{\partial z^{(i)}}{\partial \theta_j} \end{split} \end{equation}
$\frac{\partial Cost(a,y)}{\partial a}=?$ \begin{equation} \begin{split} \frac{\partial Cost(a,y)}{\partial a} & = \frac{\partial}{\partial a} (-y \times log(a)-(1-y)log(1-a)) \cr & = -\frac{y}{a}- \frac{1-y}{1-a}\times -1 \cr & = \frac{a-y}{a(1-a)} \end{split} \end{equation}
$\frac{\partial a}{\partial z}=?$ \begin{equation} \begin{split} \frac{\partial a}{\partial z} & = \frac{\partial}{\partial z} \Big(\frac{1}{1+e^{-z}}\Big) \cr & = \frac{-1}{(1+e^{-z})^2} \times e^{-z} \times -1 =\frac{e^{-z}}{(1+e^{-z})^2} \cr & =\frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2} \cr & = a - a^2 = a (1-a) \end{split} \end{equation}

\begin{equation} \frac{\partial z^{(i)}}{\partial \theta_j}=x_j^{(i)} \end{equation}

Substituting equations in \begin{equation} \begin{split} \frac{\partial J}{\partial \theta_j} & = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)}) \times x_j^{(i)} \cr & = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)}) \times x_j^{(i)} \end{split} \label{eq:dj_dtheta_log} \end{equation}
The above equation looks similar to the linear regression $\frac{\partial J}{\partial \theta_j}$ expression where mean squared error is used for the loss function.