Linear Regression

Ecole Nationale Supérieure de Cognitique

Baptiste Pesquet

GitHub logo

Summary

  • Introduction
  • Analytical Approach: Normal Equation
  • Iterative Approach: Gradient Descent
  • Polynomial Regression
  • Regularization

Introduction

Search for a linear relationship between inputs and the output we try to predict.

Linear Regression example

Problem formulation

  • Features $ x_i $: properties of a data sample.
  • Parameters $ \theta_i $: coefficients of the linear model.
  • Hypothesis $ \mathcal{h}_\theta $: model output, function of input.

$x^{(i)} = \left\{ x^{(i)}_0, x^{(i)}_1, \dotsc, x^{(i)}_n \right\} \in \mathbf{R}^{n+1} $, with $ x^{(i)}_0 = 0$

$$\theta = \left\{ \theta_0, \theta_1, \dotsc, \theta_n \right\} \in \mathbf{R}^{n+1}$$

$$y'^{(i)} = \mathcal{h}_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}_1 + \dotsc + \theta_nx^{(i)}_n$$

Analytical approach

  • Technique for computing the regression coefficients $\theta_i$ analytically (by calculus).
  • One-step learning algorithm (no iterations).
  • Also called Ordinary Least Squares.

$$x^{(i)} = \begin{pmatrix} \ x^{(i)}_0 \\ \ x^{(i)}_1 \\ \ \vdots \\ \ x^{(i)}_n \end{pmatrix}\;\;\; \theta = \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix}\;\;\; \mathcal{h}_\theta(x^{(i)}) = \theta^T\cdot x^{(i)}$$

$$X = \begin{pmatrix} \ x^{(0)T} \\ \ x^{(1)T} \\ \ \vdots \\ \ x^{(m)T} \\ \end{pmatrix} = \begin{pmatrix} \ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_n \\ \ x^{(2)}_0 & x^{(2)}_1 & \cdots & x^{(2)}_n \\ \ \vdots & \vdots & \ddots & \vdots \\ \ x^{(m)}_0 & x^{(m)}_1 & \cdots & x^{(m)}_n \end{pmatrix}\;\;\; y = \begin{pmatrix} \ y^{(1)} \\ \ y^{(2)} \\ \ \vdots \\ \ y^{(m)} \end{pmatrix}$$

Normal equation

Loss function: MSE

$\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m (y'^{(i)} - y^{(i)})^2 = \frac{1}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})^2$

Find $\hat{\theta}$ so that $\frac{\mathcal{L}(\theta)}{d\theta} = 0$ in order to minimize loss

Solution: $\hat{\theta} = (X^T \cdot X)^{-1} \cdot X^T \cdot y$

(Math proof)

Pros/cons of analytical approach

Pros:

  • Computed in one step.
  • Exact solution.

Cons:

  • Computation of $(X^T \cdot X)^{-1}$ is slow when the number of features is large ($n > 10^4$).
  • Problem if $X^T \cdot X$ is not inversible.

Iterative approach: gradient descent

  • Same objective: find $\hat{\theta}$ that minimizes loss.
  • General idea:
    • Start with random values for the $\theta$ vector.
    • Update $\theta$ in small steps towards the loss minimum.
  • To know in which direction update $\theta$, we compute the gradient of the loss function wrt $\theta$ and go into the opposite direction.

Gradient descent line graph

Computation of gradients

$\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})^2$

$\frac{\partial}{\partial \theta_j} \mathcal{L}(\theta) = \frac{2}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})x^{(i)}_j$

$$\nabla_{\theta}\mathcal{L}(\theta) = \begin{pmatrix} \ \frac{\partial}{\partial \theta_1} \mathcal{L}(\theta) \\ \ \frac{\partial}{\partial \theta_2} \mathcal{L}(\theta) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \mathcal{L}(\theta) \end{pmatrix} = \frac{2}{m}X^T\cdot (X\cdot \theta - y)$$

Parameters update

The learning rate $ \eta $ governs the amplitude of updates.

$$\theta_{next} = \theta - \eta\nabla_{\theta}\mathcal{L}(\theta)$$

Importance of learning rate

Types of gradient descent

  • Batch: use the whole dataset to compute gradients.
    • Safe but potentially slow.
  • Stochastic: use only one sample.
    • Fast but potentially erratic.
  • Mini-Batch: use a small set of samples (10-1000).
    • Good compromise.

Pros/cons of iterative approach

Pros:

  • Works well with a large number of features.
  • MSE loss convex => guarantee of a global minimum.

Cons:

  • Convergence depends on learning rate and GD type.
  • Dependant on feature scaling.

Polynomial Regression

  • How can a linear model fit non-linear data?
  • Solution: add powers of each feature as new features.
  • The hypothesis function is still linear.

Polynomial degree examples

Regularization

  • Solution against overfitting: constraining model parameters.

  • Ridge regression (using L2 norm): $\mathcal{L}(\theta) = MSE(\theta) + \frac{\lambda}{2}\sum_{i=1}^n \theta_i^2$

  • Lasso regression (using L1 norm): $\mathcal{L}(\theta) = MSE(\theta) + \lambda\sum_{i=1}^n |\theta_i|$

  • $\lambda$ is called the regularization rate.

Effects of regularization rate

(Called $\alpha$ here)

Left: linear model. Right: 10th degree polynomial model.

Ridge regression example