Logistic Regression

Logistic regression (LR) is widely used in classification tasks. The LR model is as follows:

${ y }^{ (i) }=g({ \theta}_{ 0 }+{ \theta}_{ 1 }{ x }_{ 1 }^{ (i) }+{ \theta}_{ 2 }{ x }_{ 2 }^{ (i) }+...+{ \theta}_{ m }{ x }_{ m }^{ (i) })+{ \varepsilon}^{ (i) }$

The observed class (y) for individual i equals to the sum of predicted class and corresponding error (ε(i)). The predicted class is estimated from a function (g) that transforms the linear combination of theta's and predictors (x1 to xm). To simplify the linear combination, linear algebra’s matrix transposition (T) can be applied to generate the following function, with θTx equals to the dot product of the transpose of θ and x matrices:

${ y }^{ (i) }=g({ { \theta}^{ T } }x)+{ \varepsilon}^{ (i) }$

To isolate the function g, the error term can be repositioned onto the left side of the equation. Then, the left hand side of the equation can be considered as the predicted class (ŷ) which can be thought of as a form of hypotheses hθ(x):

${ y }^{ (i) }-{ \varepsilon}^{ (i) }=g({ { \theta}^{ T } }x)$

${ \hat { y }}^{ (i) }={ h }_{ \theta}(x)=g({ { \theta}^{ T } }x)$

The function g in logistic regression is the logistic (or sigmoid) function. The logistic function has a nice property with asymptotes at 0 and 1, which fittingly represents the two classes in a binary classification problem such that y ∈ {0, 1}:

${ h }_{ \theta}(x)=g({ { \theta}^{ T } }x)=\frac { 1 }{ 1+{ e }^{ -{ \theta}^{ T}x } }$

where

$g(z)=\frac { 1 }{ 1+{ e }^{ -z } }$

The g(θTx) term is interpreted as the conditional probability (P) of the outcome variable equaling a "1" class given θ and x:

$P(y=1|\theta;x)=h_{ \theta}(x)$

and complimentarily with class "0" as

$P(y=0|\theta;x)=1-h_{ \theta}(x)$

Typically in binary classification, when P(y=1|θ;x) ≥ 0.5 the predicted classification (ŷ) will be 1, and when P(y=1|θ;x) < 0.5 the predicted classification will be 0. The P(y=1|θ;x)=0.5 occurs when θTx=0, also known as the decision boundary. Values other than 0.5 can be used as decision boundary depending on how many false positives and false negatives can be tolerated in the predictions.

Cost Function

A cost (or loss) function quantifies the amount of deviation between the predicted and observed values. A cost function is used to estimate the most suitable θ values in order to minimize the penalty from misclassification. The cost function for logistic regression is as follows:

$Cost({ h }_{ \theta}(x),y)=\begin{cases} -log({ h }_{ \theta}(x))\quad \quad \quad \quad \quad if\quad y=1 \\ -log(1-{ h }_{ \theta}(x))\quad \quad \quad if\quad y=0 \end{cases}$

It can be condensed into a single equation:

$Cost({ h }_{ \theta}(x),y)=-ylog({ h }_{ \theta}(x))-(1-y)log(1-{ h }_{ \theta}(x))$

The above cost function is for a single training data point. The cost function J(θ) for the entire training set with p training sample is:

$J(\theta)=\frac { 1 }{ p } \sum _{ i=1 }^{ p }{ Cost({ h }_{ \theta}({ x }^{ (i) }),{ y }^{ (i) })=\frac { -1 }{ p } } \sum _{ i=1 }^{ p }{ { y }^{ (i) }log({ h }_{ \theta}({ x }^{ (i) })+(1-{ y }^{ (i) })log(1-{ h }_{ \theta}({ x }^{ (i) })) }$

The objective is to find a set of parameter θ values that minimize the J(θ):

$\begin{matrix} argmin \\ \theta \end{matrix}J(\theta)$

Such minimization task is done by gradient descent. The gradient descent update rule is as follows, where all θj are updated simultaneously:

Repeat {

${ \theta}_{ j }:={ \theta}_{ j }-\alpha \frac { \partial J(\theta) }{ \partial { \theta}_{ j } }$}

The learning rate parameter α dictates the rate of descent while the partial derivative of J(θ) dictates the direction of descent of each step. The partial derivative term can be expressed as:

$\frac { \partial J(\theta) }{ \partial { \theta}_{ j } } =\frac { 1 }{ p } \sum _{ i=1 }^{ p }{ ({ h }_{ \theta} } ({ x }^{ (i) })-{ y }^{ (i) })({ x }_{ j }^{ (i) })$

and with simple substitution, the gradient descent becomes:

Repeat {

${ \theta}_{ j }:={ \theta}_{ j }-\alpha \frac { 1 }{ p } \sum _{ i=1 }^{ p }{ ({ h }_{ \theta} } ({ x }^{ (i) })-{ y }^{ (i) })({ x }_{ j }^{ (i) })$ }

Regularization

To reduce the chance of overfitting, a regularization term can be introduced in logistic regression algorithm to reduce the magnitude of each θ:

$J(\theta)=\frac { -1 }{ p } \sum _{ i=1 }^{ p }{ { y }^{ (i) }log({ h }_{ \theta}({ x }^{ (i) })+(1-{ y }^{ (i) })log(1-{ h }_{ \theta}({ x }^{ (i) })) } +\frac { \lambda}{ 2p } \sum _{ j=1 }^{ m }{ { \theta}_{ j }^{ 2 } }$

where p is the number of total training sample, m is the number of θ parameters, and 𝜆 is the regularization parameter and its purpose is to trade-off between 1) model’s ability to fit well with the data and 2) reducing the magnitudes of θ parameters to avoid overfitting. The larger the 𝜆, the greater the reduction of the magnitude of θ's, which may in turn leads to underfitting when shrinking θ's too much leaving the model unable to fit well with the data.

Finally, the new gradient descent rules with regularization term incorporated will be as follows, where the effect of regularization will take effect in all θ's except θ0:

Repeat {

${ \theta}_{ 0 }:={ \theta}_{ 0 }-\alpha \frac { 1 }{ p } \sum _{ i=1 }^{ p }{ ({ h }_{ \theta} } ({ x }^{ (i) })-{ y }^{ (i) })({ x }_{ 0 }^{ (i) })$

${ \theta}_{ j }:={ \theta}_{ j }-\alpha \frac { 1 }{ p } \sum _{ i=1 }^{ p }{ ({ h }_{ \theta} } ({ x }^{ (i) })-{ y }^{ (i) })({ x }_{ j }^{ (i) })-\frac { \lambda}{ m } { \theta}_{ j }$ }

where j = 1, 2,…, m corresponding to θ1, θ2,…, θm.

Source material