Convolutional Neural Networks

Ecole Nationale Supérieure de Cognitique

Baptiste Pesquet

GitHub logo


  • CNN Architecture
  • CNN History

CNN Architecture


The visual world has the following properties:

  • Translation invariance.
  • Spatial hierarchy: complex and abstract concepts are composed from simple elements.

Classical models are not designed to detect local patterns in images.

General Design

General CNN architecture

The convolution operation

Apply a filter or kernel to data. Result is called a feature map.

Convolution with a 3x3 filter of depth 1 applied on 5x5 data

Convolution example

Convolution parameters

  • Filter dimensions: 2D for images.
  • Filter size: generally 3x3 or 5x5.
  • Number of filters: determine the number of feature maps created by the convolution operation.
  • Stride: step for sliding the convolution window. Generally equal to 1.
  • Padding: blank rows/columns with all-zero values added on sides of the input feature map.

Preserving output dimensions with padding

Preserving output dimensions with padding

2D convolutions on 3D tensors

  • An image has several color channels.
  • Number of channels = filter depth.
  • The convolution result is still a scalar.

2D convolution on a 32x32x3 image with 10 filters

2D convolution over RGB image

Activation function

  • Applied to the (scalar) convolution result.
  • Introduces non-linearity in the model.


The pooling operation

  • Reduces the dimensionality of feature maps.
  • Often done by selecting maximum values (max pooling).

Max pooling with 2x2 filter and stride of 2

Pooling output

Pooling with a 2x2 filter and stride of 2 on 10 32x32 feature maps

Training a CNN

Same principle as a dense neural network: backpropagation + gradient descent.

Backpropagation In Convolutional Neural Networks


  • Convolution layers act as feature extractors.
  • Dense layers use the extracted features to classify data.

Feature extraction with a CNN

Visualizing convnet layers on MNIST

Another visualization of intermediate layers on MNIST

CNN History

The beginnings: LeNet5 (1993)


The breakthrough: ILSVRC

ILSVRC results

AlexNet (2012)

Trained on 2 GPU for 5 to 6 days.


We need to go deeper

VGG (2014)


GoogLeNet/Inception (2014)

  • 9 Inception modules, more than 100 layers.
  • Trained on several GPU for about a week.


Microsoft ResNet (2015)

  • 152 layers
  • Trained on 8 GPU for 2 to 3 weeks.
  • Smaller error rate than a average human.


Deeper model