- K-NN in a nutshell
- K-NN for classification
- K-NN for regression
K-NN in a nutshell
- Instance-based (no model construction).
- No training phase, computations happen during predictions.
$ \neq $ K-Means algorithm.
- Very simple yet often efficient.
- Can be used for classification or regression.
K-NN for classification
A new point is assigned to the class which has the most representatives within the
k nearest neighbors of the point.
- Positive integer, chosen by the user. Typically small.
- a larger
k suppresses the effects of noise, but makes the classification boundaries less distinct.
$k=5$ is a frequent choice.
$k=1$ => nearest neighbor algorithm.
- Brute force: compute distance between all training samples.
- Efficient with few samples.
- Other methods try to reduce the required number of distance calculations.
- Basic idea: if
A is very distant from
B is very close to
C, we can infer that
C are very distant.
Neighborhood class computation
- Basic form: uniform weights (simple majority vote). Suboptimal when dataset is skewed (imbalanced).
- Distance weights: closer neighbors weight more than farther ones.
Pseudocode (brute force, uniform weights)
For each training sample
Compute distance between sample and new point
Keep the k samples with smallest distance
Find the most frequent class between the k samples
Assign this class to the new point
K-NN for regression
The target is predicted by local interpolation of the targets associated of the
k nearest neighbors.
- Good results for small datasets with few features.
- Distances computation can become intractable with many samples, or samples with many features.
- Categorical features make the distances computation difficult.
Links for further reading