Ecole Nationale SupĂ©rieure de Cognitique

- K-NN in a nutshell
- K-NN for classification
- K-NN for regression

- Instance-based (no model construction).
- No training phase, computations happen during predictions.
`$ \neq $`

K-Means algorithm.- Very simple yet often efficient.
- Can be used for classification or regression.

A new point is assigned to the class which has the most representatives within the `k`

nearest neighbors of the point.

`k`

parameter- Positive integer, chosen by the user. Typically small.
- a larger
`k`

suppresses the effects of noise, but makes the classification boundaries less distinct. `$k=5$`

is a frequent choice.`$k=1$`

=>*nearest neighbor*algorithm.

- Common choice: Euclidian distance.
- Alternatives: Hamming distance, Manhattan distance, custom distance metric (problem-dependent).

- Brute force: compute distance between all training samples.
- Efficient with few samples.

- Other methods try to reduce the required number of distance calculations.
- Basic idea: if
`A`

is very distant from`B`

and`B`

is very close to`C`

, we can infer that`A`

and`C`

are very distant.

- Basic idea: if

- Basic form: uniform weights (simple majority vote). Suboptimal when dataset is skewed (imbalanced).
- Distance weights: closer neighbors weight more than farther ones.

```
Set k
For each training sample
Compute distance between sample and new point
Keep the k samples with smallest distance
Find the most frequent class between the k samples
Assign this class to the new point
```

The target is predicted by local interpolation of the targets associated of the `k`

nearest neighbors.

- Good results for small datasets with few features.
- Distances computation can become intractable with many samples, or samples with many features.
- Categorical features make the distances computation difficult.