# Linear Regression

Ecole Nationale Supérieure de Cognitique

Baptiste Pesquet

## Summary

• Introduction
• Analytical Approach: Normal Equation
• Polynomial Regression
• Regularization

## Introduction

Search for a linear relationship between inputs and the output we try to predict.

## Problem formulation

• Features $x_i$: properties of a data sample.
• Parameters $\theta_i$: coefficients of the linear model.
• Hypothesis $\mathcal{h}_\theta$: model output, function of input.

$x^{(i)} = \left\{ x^{(i)}_0, x^{(i)}_1, \dotsc, x^{(i)}_n \right\} \in \mathbf{R}^{n+1}$, with $x^{(i)}_0 = 0$

$$\theta = \left\{ \theta_0, \theta_1, \dotsc, \theta_n \right\} \in \mathbf{R}^{n+1}$$

$$y'^{(i)} = \mathcal{h}_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}_1 + \dotsc + \theta_nx^{(i)}_n$$

## Analytical approach

• Technique for computing the regression coefficients $\theta_i$ analytically (by calculus).
• One-step learning algorithm (no iterations).
• Also called Ordinary Least Squares.

$$x^{(i)} = \begin{pmatrix} \ x^{(i)}_0 \\ \ x^{(i)}_1 \\ \ \vdots \\ \ x^{(i)}_n \end{pmatrix}\;\;\; \theta = \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix}\;\;\; \mathcal{h}_\theta(x^{(i)}) = \theta^T\cdot x^{(i)}$$

$$X = \begin{pmatrix} \ x^{(0)T} \\ \ x^{(1)T} \\ \ \vdots \\ \ x^{(m)T} \\ \end{pmatrix} = \begin{pmatrix} \ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_n \\ \ x^{(2)}_0 & x^{(2)}_1 & \cdots & x^{(2)}_n \\ \ \vdots & \vdots & \ddots & \vdots \\ \ x^{(m)}_0 & x^{(m)}_1 & \cdots & x^{(m)}_n \end{pmatrix}\;\;\; y = \begin{pmatrix} \ y^{(1)} \\ \ y^{(2)} \\ \ \vdots \\ \ y^{(m)} \end{pmatrix}$$

## Normal equation

Loss function: MSE

$\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m (y'^{(i)} - y^{(i)})^2 = \frac{1}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})^2$

Find $\hat{\theta}$ so that $\frac{\mathcal{L}(\theta)}{d\theta} = 0$ in order to minimize loss

Solution: $\hat{\theta} = (X^T \cdot X)^{-1} \cdot X^T \cdot y$

(Math proof)

## Pros/cons of analytical approach

Pros:

• Computed in one step.
• Exact solution.

Cons:

• Computation of $(X^T \cdot X)^{-1}$ is slow when the number of features is large ($n > 10^4$).
• Problem if $X^T \cdot X$ is not inversible.

• Same objective: find $\hat{\theta}$ that minimizes loss.
• General idea:
• Start with random values for the $\theta$ vector.
• Update $\theta$ in small steps towards the loss minimum.
• To know in which direction update $\theta$, we compute the gradient of the loss function wrt $\theta$ and go into the opposite direction.

$\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})^2$

$\frac{\partial}{\partial \theta_j} \mathcal{L}(\theta) = \frac{2}{m}\sum_{i=1}^m (\theta^T\cdot x^{(i)} - y^{(i)})x^{(i)}_j$

$$\nabla_{\theta}\mathcal{L}(\theta) = \begin{pmatrix} \ \frac{\partial}{\partial \theta_1} \mathcal{L}(\theta) \\ \ \frac{\partial}{\partial \theta_2} \mathcal{L}(\theta) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \mathcal{L}(\theta) \end{pmatrix} = \frac{2}{m}X^T\cdot (X\cdot \theta - y)$$

## Parameters update

The learning rate $\eta$ governs the amplitude of updates.

$$\theta_{next} = \theta - \eta\nabla_{\theta}\mathcal{L}(\theta)$$

• Batch: use the whole dataset to compute gradients.
• Safe but potentially slow.
• Stochastic: use only one sample.
• Fast but potentially erratic.
• Mini-Batch: use a small set of samples (10-1000).
• Good compromise.

## Pros/cons of iterative approach

Pros:

• Works well with a large number of features.
• MSE loss convex => guarantee of a global minimum.

Cons:

• Convergence depends on learning rate and GD type.
• Dependant on feature scaling.

## Polynomial Regression

• How can a linear model fit non-linear data?
• Solution: add powers of each feature as new features.
• The hypothesis function is still linear.

## Regularization

• Solution against overfitting: constraining model parameters.

• Ridge regression (using L2 norm): $\mathcal{L}(\theta) = MSE(\theta) + \frac{\lambda}{2}\sum_{i=1}^n \theta_i^2$

• Lasso regression (using L1 norm): $\mathcal{L}(\theta) = MSE(\theta) + \lambda\sum_{i=1}^n |\theta_i|$

• $\lambda$ is called the regularization rate.

## Effects of regularization rate

(Called $\alpha$ here)

Left: linear model. Right: 10th degree polynomial model.