- CNN Architecture
- CNN History
The visual world has the following properties:
- Translation invariance.
- Spatial hierarchy: complex and abstract concepts are composed from simple elements.
Classical models are not designed to detect local patterns in images.
The convolution operation
Apply a filter or kernel to data. Result is called a feature map.
- Filter dimensions: 2D for images.
- Filter size: generally 3x3 or 5x5.
- Number of filters: determine the number of feature maps created by the convolution operation.
- Stride: step for sliding the convolution window. Generally equal to 1.
- Padding: blank rows/columns with all-zero values added on sides of the input feature map.
Preserving output dimensions with padding
2D convolutions on 3D tensors
- An image has several color channels.
- Number of channels = filter depth.
- The convolution result is still a scalar.
- Applied to the (scalar) convolution result.
- Introduces non-linearity in the model.
The pooling operation
- Reduces the dimensionality of feature maps.
- Often done by selecting maximum values (max pooling).
- Convolution layers act as feature extractors.
- Dense layers use the extracted features to classify data.
The beginnings: LeNet5 (1993)
The breakthrough: ILSVRC
Trained on 2 GPU for 5 to 6 days.
- 9 Inception modules, more than 100 layers.
- Trained on several GPU for about a week.
Microsoft ResNet (2015)
- 152 layers
- Trained on 8 GPU for 2 to 3 weeks.
- Smaller error rate than a average human.