Convolutional Neural Networks

Environment setup

import platform

print(f"Python version: {platform.python_version()}")
assert platform.python_version_tuple() >= ("3", "6")

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
Python version: 3.7.5
# Setup plots
%matplotlib inline
plt.rcParams["figure.figsize"] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
import tensorflow as tf

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {tf.keras.__version__}")
print('GPU found :)' if tf.config.list_physical_devices("GPU") else 'No GPU :(')

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropout
from tensorflow.keras.datasets import fashion_mnist, cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.applications import VGG16
TensorFlow version: 2.3.1
Keras version: 2.4.0
No GPU :(
def plot_loss_acc(history):
    """Plot training and (optionally) validation loss and accuracy
    Takes a Keras History object as parameter"""

    loss = history.history["loss"]
    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(10, 10))

    plt.subplot(2, 1, 1)
    plt.plot(epochs, loss, ".--", label="Training loss")
    final_loss = loss[-1]
    title = "Training loss: {:.4f}".format(final_loss)
    plt.ylabel("Loss")
    if "val_loss" in history.history:
        val_loss = history.history["val_loss"]
        plt.plot(epochs, val_loss, "o-", label="Validation loss")
        final_val_loss = val_loss[-1]
        title += ", Validation loss: {:.4f}".format(final_val_loss)
    plt.title(title)
    plt.legend()

    acc = history.history["accuracy"]

    plt.subplot(2, 1, 2)
    plt.plot(epochs, acc, ".--", label="Training acc")
    final_acc = acc[-1]
    title = "Training accuracy: {:.2f}%".format(final_acc * 100)
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    if "val_accuracy" in history.history:
        val_acc = history.history["val_accuracy"]
        plt.plot(epochs, val_acc, "o-", label="Validation acc")
        final_val_acc = val_acc[-1]
        title += ", Validation accuracy: {:.2f}%".format(final_val_acc * 100)
    plt.title(title)
    plt.legend()

Architecture

Justification

The visual world has the following properties:

  • Translation invariance.

  • Locality: nearby pixels are more strongly correlated

  • Spatial hierarchy: complex and abstract concepts are composed from simple, local elements.

Classical models are not designed to detect local patterns in images.

Visual world

Topological structure of objects

Topological structure

From edges to objects

General CNN design

General CNN architecture

The convolution operation

Apply a kernel to data. Result is called a feature map.

Convolution with a 3x3 filter of depth 1 applied on 5x5 data

Convolution example

Convolution parameters

  • Filter dimensions: 2D for images.

  • Filter size: generally 3x3 or 5x5.

  • Number of filters: determine the number of feature maps created by the convolution operation.

  • Stride: step for sliding the convolution window. Generally equal to 1.

  • Padding: blank rows/columns with all-zero values added on sides of the input feature map.

Preserving output dimensions with padding

Preserving output dimensions with padding

Valid padding

Output size = input size - kernel size + 1

Valid padding

Full padding

Output size = input size + kernel size - 1

Valid padding

Same padding

Output size = input size

Valid padding

Convolutions inputs and outputs: tensors

Convolution inputs and outputs

2D convolutions on 3D tensors

  • Convolution input data is 3-dimensional: images with height, width and color channels, or features maps produced by previous layers.

  • Each convolution filter is a collection of kernels with distinct weights, one for every input channel.

  • At each location, every input channel is convolved with the corresponding kernel. The results are summed to compute the (scalar) filter output for the location.

  • Sliding one filter over the input data produces a 2D output feature map.

2D convolution on a 32x32x3 image with 10 filters

2D convolution over RGB image

Activation function

  • Applied to the (scalar) convolution result.

  • Introduces non-linearity in the model.

  • Standard choice: \(ReLU(z) = max(0,z)\)

The pooling operation

  • Reduces the dimensionality of feature maps.

  • Often done by selecting maximum values (max pooling).

Max pooling with 2x2 filter and stride of 2

Pooling result

Pooling result

Pooling output

Pooling with a 2x2 filter and stride of 2 on 10 32x32 feature maps

Interpretation

  • Convolution layers act as feature extractors.

  • Dense layers use the extracted features to classify data.

A convnet

Feature extraction with a CNN

Visualizing convnet layers on MNIST

Another visualization of intermediate layers on MNIST

History

Humble beginnings: LeNet5 (1988)

LeNet5

from IPython.display import YouTubeVideo

YouTubeVideo("FwFduRA_L6Q")

The breakthrough: ILSVRC

ILSVRC results

AlexNet (2012)

Trained on 2 GPU for 5 to 6 days.

AlexNet

VGG (2014)

VGG16

GoogLeNet/Inception (2014)

  • 9 Inception modules, more than 100 layers.

  • Trained on several GPU for about a week.

Inception

Microsoft ResNet (2015)

  • 152 layers, trained on 8 GPU for 2 to 3 weeks.

  • Smaller error rate than a average human.

ResNet

Deeper model

Depth: challenges and solutions

  • Challenges

    • Computational complexity

    • Optimization difficulties

  • Solutions

    • Careful initialization

    • Sophisticated optimizers

    • Normalisation layers

    • Network design

Training a convnet

General principle

Same principle as a dense neural network: backpropagation + gradient descent.

Backpropagation In Convolutional Neural Networks

Data augmentation

By design, convnets are only robust against translation. Data augmentation can make them robust against rotation and scaling.

Principle: the dataset is enriched with new samples created by applying operations on existing ones.

Data Augmentation

Example: training a CNN to recognize fashion items

The Fashion-MNIST dataset contains 70,000 28x28 grayscale images of fashion items.

It is slightly more challenging than the ubiquitous MNIST handwritten digits dataset.

# Load the Fashion-MNIST digits dataset
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

print(f"Training images: {train_images.shape}. Training labels: {train_labels.shape}")
print(f"Test images: {test_images.shape}. Test labels: {test_labels.shape}")
Training images: (60000, 28, 28). Training labels: (60000,)
Test images: (10000, 28, 28). Test labels: (10000,)
# Plot the first 10 training images
with sns.axes_style("white"):  # Temporary hide Seaborn grid lines
    plt.figure(figsize=(12, 6))
    for i in range(10):
        image = train_images[i]
        fig = plt.subplot(2, 5, i + 1)
        plt.imshow(image, cmap=plt.cm.binary)
../_images/convolutional_neural_networks_44_0.png
# Labels are integer scalars between 0 and 9
df_train_labels = pd.DataFrame(train_labels)
df_train_labels.columns = {"label"}
df_train_labels.sample(n=8)
label
48873 3
50246 8
52153 3
29803 1
23481 8
23519 5
18528 8
12866 1

Data preprocessing

# Change pixel values from (0, 255) to (0, 1)
x_train = train_images.astype("float32") / 255
x_test = test_images.astype("float32") / 255

# Make sure images have shape (28, 28, 1) to apply convolution
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# One-hot encoding of expected results
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)

print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")
x_train: (60000, 28, 28, 1). y_train: (60000, 10)
x_test: (10000, 28, 28, 1). y_test: (10000, 10)

Expected convnet architecture

Example CNN architecture

# Create a linear stack of layers
cnn_model = Sequential()

# Convolution module 1: Conv2D+MaxPooling2D
#   filters: number of convolution filters
#   kernel_size: size of the convolution kernel (2D convolution window)
#   input_shape: shape of the input feature map
#   (Here, The expected input shape is a 3D tensor corresponding to an image)
cnn_model.add(
    Conv2D(filters=32, kernel_size=(3, 3), activation="relu", input_shape=(28, 28, 1))
)
#   pool_size: factors by which to downscale (vertical, horizontal)
cnn_model.add(MaxPooling2D(pool_size=(2, 2)))

# Convolution module 2
cnn_model.add(Conv2D(filters=64, kernel_size=(3, 3), activation="relu"))
cnn_model.add(MaxPooling2D(pool_size=(2, 2)))

# Flattening the last output feature map (a 3D tensor) to feed the Dense layer
cnn_model.add(Flatten())
cnn_model.add(Dense(128))
# To fight overfitting
cnn_model.add(Dropout(0.5))
# Classification layer
cnn_model.add(Dense(10, activation="softmax"))
# Print model summary
cnn_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 128)               204928    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
=================================================================
Total params: 225,034
Trainable params: 225,034
Non-trainable params: 0
_________________________________________________________________

Convnet training

# Preparing the model for training
cnn_model.compile(
    optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]
)

# Training the model, using 10% of the training set for validation
# (May take several minutes depending on your system)
history = cnn_model.fit(
    x_train, y_train, epochs=10, verbose=2, batch_size=128, validation_split=0.1
)
Epoch 1/10
422/422 - 26s - loss: 0.5861 - accuracy: 0.7897 - val_loss: 0.3855 - val_accuracy: 0.8582
Epoch 2/10
422/422 - 25s - loss: 0.3822 - accuracy: 0.8639 - val_loss: 0.3370 - val_accuracy: 0.8768
Epoch 3/10
422/422 - 26s - loss: 0.3342 - accuracy: 0.8799 - val_loss: 0.3098 - val_accuracy: 0.8857
Epoch 4/10
422/422 - 26s - loss: 0.3036 - accuracy: 0.8916 - val_loss: 0.2872 - val_accuracy: 0.8978
Epoch 5/10
422/422 - 26s - loss: 0.2855 - accuracy: 0.8980 - val_loss: 0.2693 - val_accuracy: 0.9047
Epoch 6/10
422/422 - 25s - loss: 0.2677 - accuracy: 0.9037 - val_loss: 0.2637 - val_accuracy: 0.9013
Epoch 7/10
422/422 - 25s - loss: 0.2524 - accuracy: 0.9097 - val_loss: 0.2612 - val_accuracy: 0.9085
Epoch 8/10
422/422 - 25s - loss: 0.2379 - accuracy: 0.9156 - val_loss: 0.2675 - val_accuracy: 0.9027
Epoch 9/10
422/422 - 25s - loss: 0.2261 - accuracy: 0.9188 - val_loss: 0.2547 - val_accuracy: 0.9080
Epoch 10/10
422/422 - 25s - loss: 0.2157 - accuracy: 0.9228 - val_loss: 0.2542 - val_accuracy: 0.9082
# Plot training history
plot_loss_acc(history)
../_images/convolutional_neural_networks_53_0.png
# Evaluate model performance on test data
_, test_acc = cnn_model.evaluate(x_test, y_test, verbose=0)

print(f"Test accuracy: {test_acc:.5f}")
Test accuracy: 0.90320

Using a pretrained convnet

An efficient strategy

A pretrained convnet is a saved network that was previously trained on a large dataset (typically on a large-scale image classification task). If the training set was general enough, it can act as a generic model and its learned features can be useful for many problems.

It is an example of transfer learning.

There are two ways to use a pretrained model: feature extraction and fine-tuning.

Feature extraction

Reuse the convolution base of a pretrained model, and add a custom classifier trained from scratch on top ot if.

State-of-the-art models (VGG, ResNet, Inception…) are regularly published by top AI institutions.

Fine-tuning

Slightly adjusts the top feature extraction layers of the model being reused, in order to make it more relevant for the new context.

These top layers and the custom classification layers on top of them are jointly trained.

Fine-tuning

Example: using a pretrained convnet to recognize common objects

The CIFAR10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. The classes are completely mutually exclusive.

There are 50,000 training images and 10,000 test images.

# Load the CIFAR10 dataset
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

print(f"Training images: {train_images.shape}. Training labels: {train_labels.shape}")
print(f"Test images: {test_images.shape}. Test labels: {test_labels.shape}")
Training images: (50000, 32, 32, 3). Training labels: (50000, 1)
Test images: (10000, 32, 32, 3). Test labels: (10000, 1)
# Plot the first training images
with sns.axes_style("white"):  # Temporary hide Seaborn grid lines
    plt.figure(figsize=(16, 6))
    for i in range(30):
        image = train_images[i]
        fig = plt.subplot(3, 10, i + 1)
        plt.imshow(image, cmap=plt.cm.binary)
../_images/convolutional_neural_networks_61_0.png
# Change pixel values from (0, 255) to (0, 1)
x_train = train_images.astype("float32") / 255
x_test = test_images.astype("float32") / 255

# One-hot encoding of expected results
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)

print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")
x_train: (50000, 32, 32, 3). y_train: (50000, 10)
x_test: (10000, 32, 32, 3). y_test: (10000, 10)
# Using the convolutional base of VGG16
conv_base = VGG16(weights="imagenet", include_top=False, input_shape=(32, 32, 3))

# Freezing the convolutional base
# This prevents weight updates during training
conv_base.trainable = False

conv_base.summary()
Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 32, 32, 64)        1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 32, 32, 64)        36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 16, 16, 64)        0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 16, 16, 128)       73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 16, 16, 128)       147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 8, 8, 128)         0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 8, 8, 256)         295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 8, 8, 256)         590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 8, 8, 256)         590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 4, 4, 256)         0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 4, 4, 512)         1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 4, 4, 512)         2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 4, 4, 512)         2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 2, 2, 512)         0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 2, 2, 512)         2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 2, 2, 512)         2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 2, 2, 512)         2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 1, 1, 512)         0         
=================================================================
Total params: 14,714,688
Trainable params: 0
Non-trainable params: 14,714,688
_________________________________________________________________
# Create our new model
pretrained_cnn_model = Sequential()
# Add VGG as its base
pretrained_cnn_model.add(conv_base)
# Add a Dense classifier on top of the pretrained base
pretrained_cnn_model.add(Flatten())
pretrained_cnn_model.add(Dense(512, activation="relu"))
pretrained_cnn_model.add(Dropout(0.5))
pretrained_cnn_model.add(Dense(10, activation="softmax"))

pretrained_cnn_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
vgg16 (Functional)           (None, 1, 1, 512)         14714688  
_________________________________________________________________
flatten_1 (Flatten)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                5130      
=================================================================
Total params: 14,982,474
Trainable params: 267,786
Non-trainable params: 14,714,688
_________________________________________________________________
# Preparing the model for training
pretrained_cnn_model.compile(
    optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]
)

# Training the model, using 10% of the training set for validation
# (May take several minutes depending on your system)
history = pretrained_cnn_model.fit(
    x_train, y_train, epochs=20, verbose=1, batch_size=32, validation_split=0.1
)
Epoch 1/20
1407/1407 [==============================] - 318s 226ms/step - loss: 1.4528 - accuracy: 0.4889 - val_loss: 1.2485 - val_accuracy: 0.5622
Epoch 2/20
1407/1407 [==============================] - 307s 218ms/step - loss: 1.2735 - accuracy: 0.5536 - val_loss: 1.1642 - val_accuracy: 0.5968
Epoch 3/20
1407/1407 [==============================] - 9552s 7s/step - loss: 1.2200 - accuracy: 0.5700 - val_loss: 1.1345 - val_accuracy: 0.6000
Epoch 4/20
1407/1407 [==============================] - 17599s 13s/step - loss: 1.1845 - accuracy: 0.5829 - val_loss: 1.1143 - val_accuracy: 0.6118
Epoch 5/20
1407/1407 [==============================] - 307s 219ms/step - loss: 1.1577 - accuracy: 0.5936 - val_loss: 1.1115 - val_accuracy: 0.6158
Epoch 6/20
1407/1407 [==============================] - 309s 220ms/step - loss: 1.1330 - accuracy: 0.6043 - val_loss: 1.0977 - val_accuracy: 0.6144
Epoch 7/20
1407/1407 [==============================] - 283s 201ms/step - loss: 1.1148 - accuracy: 0.6071 - val_loss: 1.0855 - val_accuracy: 0.6190
Epoch 8/20
1407/1407 [==============================] - 275s 195ms/step - loss: 1.0938 - accuracy: 0.6144 - val_loss: 1.0883 - val_accuracy: 0.6140
Epoch 9/20
1407/1407 [==============================] - 951s 676ms/step - loss: 1.0749 - accuracy: 0.6228 - val_loss: 1.0831 - val_accuracy: 0.6242
Epoch 10/20
1407/1407 [==============================] - 284s 202ms/step - loss: 1.0620 - accuracy: 0.6257 - val_loss: 1.0767 - val_accuracy: 0.6202
Epoch 11/20
1407/1407 [==============================] - 286s 203ms/step - loss: 1.0435 - accuracy: 0.6302 - val_loss: 1.0717 - val_accuracy: 0.6266
Epoch 12/20
1407/1407 [==============================] - 286s 204ms/step - loss: 1.0284 - accuracy: 0.6363 - val_loss: 1.0729 - val_accuracy: 0.6258
Epoch 13/20
1407/1407 [==============================] - 277s 197ms/step - loss: 1.0119 - accuracy: 0.6420 - val_loss: 1.0881 - val_accuracy: 0.6168
Epoch 14/20
1407/1407 [==============================] - 273s 194ms/step - loss: 1.0008 - accuracy: 0.6466 - val_loss: 1.0772 - val_accuracy: 0.6286
Epoch 15/20
1407/1407 [==============================] - 273s 194ms/step - loss: 0.9804 - accuracy: 0.6498 - val_loss: 1.0703 - val_accuracy: 0.6246
Epoch 16/20
1407/1407 [==============================] - 273s 194ms/step - loss: 0.9735 - accuracy: 0.6566 - val_loss: 1.0654 - val_accuracy: 0.6278
Epoch 17/20
1407/1407 [==============================] - 299s 212ms/step - loss: 0.9625 - accuracy: 0.6599 - val_loss: 1.0730 - val_accuracy: 0.6262
Epoch 18/20
1407/1407 [==============================] - 287s 204ms/step - loss: 0.9469 - accuracy: 0.6618 - val_loss: 1.0576 - val_accuracy: 0.6330
Epoch 19/20
1407/1407 [==============================] - 281s 200ms/step - loss: 0.9359 - accuracy: 0.6667 - val_loss: 1.0725 - val_accuracy: 0.6388
Epoch 20/20
1407/1407 [==============================] - 280s 199ms/step - loss: 0.9224 - accuracy: 0.6719 - val_loss: 1.0847 - val_accuracy: 0.6340
# Plot training history
plot_loss_acc(history)
../_images/convolutional_neural_networks_66_0.png
# Evaluate model performance on test data
_, test_acc = pretrained_cnn_model.evaluate(x_test, y_test, verbose=0)

print(f"Test accuracy: {test_acc:.5f}")
Test accuracy: 0.62750