```
In [1]:
```

```
from IPython.display import display, Image
```

# CNTK 103: Part D - Convolutional Neural Network with MNIST¶

We assume that you have successfully completed CNTK 103 Part A (MNIST Data Loader).

In this tutorial we will train a Convolutional Neural Network (CNN) on MNIST data. This notebook provides the recipe using the Python API. If you are looking for this example in BrainScript, please look here

## Introduction¶

A convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network made up of neurons that have learnable weights and biases, very similar to ordinary multi-layer perceptron (MLP) networks introduced in 103C. The CNNs take advantage of the spatial nature of the data. In nature, we perceive different objects by their shapes, size and colors. For example, objects in a natural scene are typically edges, corners/vertices (defined by two of more edges), color patches etc. These primitives are often identified using different detectors (e.g., edge detection, color detector) or combination of detectors interacting to facilitate image interpretation (object classification, region of interest detection, scene description etc.) in real world vision related tasks. These detectors are also known as filters. Convolution is a mathematical operator that takes an image and a filter as input and produces a filtered output (representing say egdges, corners, colors etc in the input image). Historically, these filters are a set of weights that were often hand crafted or modeled with mathematical functions (e.g., Gaussian / Laplacian / Canny filter). The filter outputs are mapped through non-linear activation functions mimicking human brain cells called neurons.

Convolutional networks provide a machinery to learn these filters from the data directly instead of explicit mathematical models and have been found to be superior (in real world tasks) compared to historically crafted filters. With convolutional networks, the focus is on learning the filter weights instead of learning individually fully connected pair-wise (between inputs and outputs) weights. In this way, the number of weights to learn is reduced when compared with the traditional MLP networks from the previous tutorials. In a convolutional network, one learns several filters ranging from few single digits to few thousands depending on the network complexity.

Many of the CNN primitives have been shown to have a conceptually parallel components in brain’s visual cortex. The group of neurons cells in visual cortex emit responses when stimulated. This region is known as the receptive field (RF). Equivalently, in convolution the input region corresponding to the filter dimensions can be considered as the receptive field. Popular deep CNNs or ConvNets (such as AlexNet, VGG, Inception, ResNet) that are used for various computer vision tasks have many of these architectural primitives (inspired from biology).

In this tutorial, we will introduce the convolution operation and gain familiarity with the different parameters in CNNs.

**Problem**: As in CNTK 103C, we will continue to work on the same
problem of recognizing digits in MNIST data. The MNIST data comprises of
hand-written digits with little background noise.

```
In [2]:
```

```
# Figure 1
Image(url= "http://3.bp.blogspot.com/_UpN7DfJA0j4/TJtUBWPk0SI/AAAAAAAAABY/oWPMtmqJn3k/s1600/mnist_originals.png", width=200, height=200)
```

```
Out[2]:
```

**Goal**: Our goal is to train a classifier that will identify the
digits in the MNIST dataset.

**Approach**:

The same 5 stages we have used in the previous tutorial are applicable: Data reading, Data preprocessing, Creating a model, Learning the model parameters and Evaluating (a.k.a. testing/prediction) the model. - Data reading: We will use the CNTK Text reader - Data preprocessing: Covered in part A (suggested extension section).

In this tutorial, we will experiment with two models with different architectural components.

```
In [3]:
```

```
from __future__ import print_function # Use a function definition from future version (say 3.x from 2.7 interpreter)
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import time
import cntk as C
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components
%matplotlib inline
```

## Data reading¶

In this section, we will read the data generated in CNTK 103 Part A (MNIST Data Loader).

We are using the MNIST data that you have downloaded using the
CNTK_103A_MNIST_DataLoader notebook. The dataset has 60,000 training
images and 10,000 test images with each image being 28 x 28 pixels. Thus
the number of features is equal to 784 (= 28 x 28 pixels), 1 per pixel.
The variable `num_output_classes`

is set to 10 corresponding to the
number of digits (0-9) in the dataset.

In previous tutorials, as shown below, we have always flattened the input image into a vector. With convoultional networks, we do not flatten the image in this way.

**Input Dimensions**:

In convolutional networks for images, the input data is often shaped as a 3D matrix (number of channels, image width, height), which preserves the spatial relationship between the pixels. In the figure above, the MNIST image is a single channel (grayscale) data, so the input dimension is specified as a (1, image width, image height) tuple.

Natural scene color images are often presented as Red-Green-Blue (RGB) color channels. The input dimension of such images are specified as a (3, image width, image height) tuple. If one has RGB input data as a volumetric scan with volume width, volume height and volume depth representing the 3 axes, the input data format would be specified by a tuple of 4 values (3, volume width, volume height, volume depth). In this way CNTK enables specification of input images in arbitrary higher-dimensional space.

```
In [5]:
```

```
# Define the data dimensions
input_dim_model = (1, 28, 28) # images are 28 x 28 with 1 channel of color (gray)
input_dim = 28*28 # used by readers to treat input data as a vector
num_output_classes = 10
```

**Data Format** The data is stored on our local machine in the CNTK CTF
format. The CTF format is a simple text format that contains a set of
samples with each sample containing a set of named fields and their
data. For our MNIST data, each sample contains 2 fields: labels and
feature, formatted as:

```
|labels 0 0 0 1 0 0 0 0 0 0 |features 0 255 0 123 ...
(784 integers each representing a pixel gray level)
```

In this tutorial we are going to use the image pixels corresponding to
the integer stream named “features”. We define a `create_reader`

function to read the training and test data using the CTF
deserializer.
.

The labels are 1-hot encoded
(the label representing the output class of 3 becomes `0001000000`

since we have 10 classes for the 10 possible digits), where the first
index corresponds to digit `0`

and the last one corresponds to digit
`9`

.

```
In [6]:
```

```
# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file
def create_reader(path, is_training, input_dim, num_label_classes):
ctf = C.io.CTFDeserializer(path, C.io.StreamDefs(
labels=C.io.StreamDef(field='labels', shape=num_label_classes, is_sparse=False),
features=C.io.StreamDef(field='features', shape=input_dim, is_sparse=False)))
return C.io.MinibatchSource(ctf,
randomize = is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)
```

```
In [7]:
```

```
# Ensure the training and test data is available for this tutorial.
# We search in two locations in the toolkit for the cached MNIST data set.
data_found=False # A flag to indicate if train/test data found in local cache
for data_dir in [os.path.join("..", "Examples", "Image", "DataSets", "MNIST"),
os.path.join("data", "MNIST")]:
train_file=os.path.join(data_dir, "Train-28x28_cntk_text.txt")
test_file=os.path.join(data_dir, "Test-28x28_cntk_text.txt")
if os.path.isfile(train_file) and os.path.isfile(test_file):
data_found=True
break
if not data_found:
raise ValueError("Please generate the data by completing CNTK 103 Part A")
print("Data directory is {0}".format(data_dir))
```

```
Data directory is ..\Examples\Image\DataSets\MNIST
```

## CNN Model Creation¶

CNN is a feedforward network made up of bunch of layers in such a way that the output of one layer becomes the input to the next layer (similar to MLP). In MLP, all possible pairs of input pixels are connected to the output nodes with each pair having a weight, thus leading to a combinatorial explosion of parameters to be learnt and also increasing the possibility of overfitting (details). Convolution layers take advantage of the spatial arrangement of the pixels and learn multiple filters that significantly reduce the amount of parameters in the network (details). The size of the filter is a parameter of the convolution layer.

In this section, we introduce the basics of convolution operations. We show the illustrations in the context of RGB images (3 channels), eventhough the MNIST data we are using in this tutorial is a grayscale image (single channel).

### Convolution Layer¶

A convolution layer is a set of filters. Each filter is defined by a
weight (**W**) matrix, and bias (\(b\)).

These filters are scanned across the image performing the dot product between the weights and corresponding input value (\(x\)). The bias value is added to the output of the dot product and the resulting sum is optionally mapped through an activation function. This process is illustrated in the following animation.

```
In [8]:
```

```
Image(url="https://www.cntk.ai/jup/cntk103d_conv2d_final.gif", width= 300)
```

```
Out[8]:
```

Convolution layers incorporate following key features:

- Instead of being fully-connected to all pairs of input and output
nodes , each convolution node is
**locally-connected**to a subset of input nodes localized to a smaller input region, also referred to as receptive field (RF). The figure above illustrates a small 3 x 3 regions in the image as the RF region. In the case of an RGB, image there would be three such 3 x 3 regions, one each of the 3 color channels. - Instead of having a single set of weights (as in a Dense layer),
convolutional layers have multiple sets (shown in figure with
multiple colors), called
**filters**. Each filter detects features within each possible RF in the input image. The output of the convolution is a set of`n`

sub-layers (shown in the animation below) where`n`

is the number of filters (refer to the above figure). - Within a sublayer, instead of each node having its own set of
weights, a single set of
**shared weights**are used by all nodes in that sublayer. This reduces the number of parameters to be learnt and thus overfitting. This also opens the door for several aspects of deep learning which has enabled very practical solutions to be built: – Handling larger images (say 512 x 512)- Trying larger filter sizes (corresponding to a larger RF) say 11 x 11
- Learning more filters (say 128)
- Explore deeper architectures (100+ layers)
- Achieve translation invariance (the ability to recognize a feature independent of where they appear in the image).

### Strides and Pad parameters¶

**How are filters positioned?** In general, the filters are arranged in
overlapping tiles, from left to right, and top to bottom. Each
convolution layer has a parameter to specify the `filter_shape`

,
specifying the width and height of the filter in case most natural scene
images. There is a parameter (`strides`

) that controls the how far to
step to right when moving the filters through multiple RF’s in a row,
and how far to step down when moving to the next row. The boolean
parameter `pad`

controls if the input should be padded around the
edges to allow a complete tiling of the RF’s near the borders.

The animation above shows the results with a `filter_shape`

= (3, 3),
`strides`

= (2, 2) and `pad`

= False. The two animations below show
the results when `pad`

is set to True. First, with a stride of 2 and
second having a stride of 1. Note: the shape of the output (the teal
layer) is different between the two stride settings. Many a times your
decision to pad and the stride values to choose are based on the shape
of the output layer needed.

```
In [9]:
```

```
# Plot images with strides of 2 and 1 with padding turned on
images = [("https://www.cntk.ai/jup/cntk103d_padding_strides.gif" , 'With stride = 2'),
("https://www.cntk.ai/jup/cntk103d_same_padding_no_strides.gif", 'With stride = 1')]
for im in images:
print(im[1])
display(Image(url=im[0], width=200, height=200))
```

```
With stride = 2
```

```
With stride = 1
```

## Building our CNN models¶

In this CNN tutorial, we first define two containers. One for the input
MNIST image and the second one being the labels corresponding to the 10
digits. When reading the data, the reader automatically maps the 784
pixels per image to a shape defined by `input_dim_model`

tuple (in
this example it is set to (1, 28, 28)).

```
In [10]:
```

```
x = C.input_variable(input_dim_model)
y = C.input_variable(num_output_classes)
```

The first model we build is a simple convolution only network. Here we
have two convolutional layers. Since, our task is to detect the 10
digits in the MNIST database, the output of the network should be a
vector of length 10, 1 element corresponding to each digit. This is
achieved by projecting the output of the last convolutional layer using
a dense layer with the output being `num_output_classes`

. We have seen
this before with Logistic Regression and MLP where features were mapped
to the number of classes in the final layer. Also, note that since we
will be using the `softmax`

operation that is combined with the
`cross entropy`

loss function during training (see a few cells below),
the final dense layer has no activation function associated with it.

The following figure illustrates the model we are going to build. Note the parameters in the model below are to be experimented with. These are often called network hyperparameters. Increasing the filter shape leads to an increase in the number of model parameters, increases the compute time and helps the model better fit to the data. However, one runs the risk of overfitting. Typically, the number of filters in the deeper layers are more than the number of filters in the layers before them. We have chosen 8, 16 for the first and second layers, respectively. These hyperparameters should be experimented with during model building.

```
In [11]:
```

```
# function to build model
def create_model(features):
with C.layers.default_options(init=C.glorot_uniform(), activation=C.relu):
h = features
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=8,
strides=(2,2),
pad=True, name='first_conv')(h)
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=16,
strides=(2,2),
pad=True, name='second_conv')(h)
r = C.layers.Dense(num_output_classes, activation=None, name='classify')(h)
return r
```

Let us create an instance of the model and inspect the different
components of the model. `z`

will be used to represent the output of a
network. In this model, we use the `relu`

activation function. Note:
using the `C.layers.default_options`

is an elegant way to write
concise models. This is key to minimizing modeling errors, saving
precious debugging time.

```
In [12]:
```

```
# Create the model
z = create_model(x)
# Print the output shapes / parameters of different components
print("Output Shape of the first convolution layer:", z.first_conv.shape)
print("Bias value of the last dense layer:", z.classify.b.value)
```

```
Output Shape of the first convolution layer: (8, 14, 14)
Bias value of the last dense layer: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
```

Understanding number of model parameters to be estimated is key to deep learning since there is a direct dependency on the amount of data one needs to have. You need more data for a model that has larger number of parameters to prevent overfitting. In other words, with a fixed amount of data, one has to constrain the number of parameters. There is no golden rule between the amount of data one needs for a model. However, there are ways one can boost performance of model training with data augmentation.

```
In [13]:
```

```
# Number of parameters in the network
C.logging.log_number_of_parameters(z)
```

```
Training 11274 parameters in 6 parameter tensors.
```

**Understanding Parameters**:

Our model has 2 convolution layers each having a weight and bias. This adds up to 4 parameter tensors. Additionally the dense layer has weight and bias tensors. Thus, the 6 parameter tensors.

Let us now count the number of parameters: - *First convolution layer*:
There are 8 filters each of size (1 x 5 x 5) where 1 is the number of
channels in the input image. This adds up to 200 values in the weight
matrix and 8 bias values.

*Second convolution layer*: There are 16 filters each of size (8 x 5 x 5) where 8 is the number of channels in the input to the second layer (= output of the first layer). This adds up to 3200 values in the weight matrix and 16 bias values.*Last dense layer*: There are 16 x 7 x 7 input values and it produces 10 output values corresponding to the 10 digits in the MNIST dataset. This corresponds to (16 x 7 x 7) x 10 weight values and 10 bias values.

Adding these up gives the 11274 parameters in the model.

**Knowledge check**: Does the dense layer shape align with the task
(MNIST digit classification)?

** Suggested Activity ** - Try printing shapes and parameters of
different network layers, - Record the training error you get with
`relu`

as the activation function, - Now change to `sigmoid`

as the
activation function and see if you can improve your training error.

*Quiz*: Different supported activation functions can be found
here.
Which activation function gives the least training error?

### Learning model parameters¶

Same as the previous tutorial, we use the `softmax`

function to map
the accumulated evidences or activations to a probability distribution
over the classes (Details of the softmax
function
and other
activation
functions).

## Training¶

Similar to CNTK 102, we minimize the cross-entropy between the label and predicted probability by the network. If this terminology sounds strange to you, please refer to the CNTK 102 for a refresher. Since we are going to build more than one model, we will create a few helper functions.

```
In [14]:
```

```
def create_criterion_function(model, labels):
loss = C.cross_entropy_with_softmax(model, labels)
errs = C.classification_error(model, labels)
return loss, errs # (model, labels) -> (loss, error metric)
```

Next we will need a helper function to perform the model training. First let us create additional helper functions that will be needed to visualize different functions associated with training.

```
In [15]:
```

```
# Define a utility function to compute the moving average sum.
# A more efficient implementation is possible with np.cumsum() function
def moving_average(a, w=5):
if len(a) < w:
return a[:] # Need to send a copy of the array
return [val if idx < w else sum(a[(idx-w):idx])/w for idx, val in enumerate(a)]
# Defines a utility that prints the training progress
def print_training_progress(trainer, mb, frequency, verbose=1):
training_loss = "NA"
eval_error = "NA"
if mb%frequency == 0:
training_loss = trainer.previous_minibatch_loss_average
eval_error = trainer.previous_minibatch_evaluation_average
if verbose:
print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%".format(mb, training_loss, eval_error*100))
return mb, training_loss, eval_error
```

### Configure training¶

In the previous tutorials we have described the concepts of `loss`

function, the optimizers or
learners and the
associated machinery needed to train a model. Please refer to earlier
tutorials for gaining familiarility with these concepts. In this
tutorial, we combine model training and testing in a helper function
below.

```
In [16]:
```

```
def train_test(train_reader, test_reader, model_func, num_sweeps_to_train_with=10):
# Instantiate the model function; x is the input (feature) variable
# We will scale the input image pixels within 0-1 range by dividing all input value by 255.
model = model_func(x/255)
# Instantiate the loss and error function
loss, label_error = create_criterion_function(model, y)
# Instantiate the trainer object to drive the model training
learning_rate = 0.2
lr_schedule = C.learning_parameter_schedule(learning_rate)
learner = C.sgd(z.parameters, lr_schedule)
trainer = C.Trainer(z, (loss, label_error), [learner])
# Initialize the parameters for the trainer
minibatch_size = 64
num_samples_per_sweep = 60000
num_minibatches_to_train = (num_samples_per_sweep * num_sweeps_to_train_with) / minibatch_size
# Map the data streams to the input and labels.
input_map={
y : train_reader.streams.labels,
x : train_reader.streams.features
}
# Uncomment below for more detailed logging
training_progress_output_freq = 500
# Start a timer
start = time.time()
for i in range(0, int(num_minibatches_to_train)):
# Read a mini batch from the training data file
data=train_reader.next_minibatch(minibatch_size, input_map=input_map)
trainer.train_minibatch(data)
print_training_progress(trainer, i, training_progress_output_freq, verbose=1)
# Print training time
print("Training took {:.1f} sec".format(time.time() - start))
# Test the model
test_input_map = {
y : test_reader.streams.labels,
x : test_reader.streams.features
}
# Test data for trained model
test_minibatch_size = 512
num_samples = 10000
num_minibatches_to_test = num_samples // test_minibatch_size
test_result = 0.0
for i in range(num_minibatches_to_test):
# We are loading test data in batches specified by test_minibatch_size
# Each data point in the minibatch is a MNIST digit image of 784 dimensions
# with one pixel per dimension that we will encode / decode with the
# trained model.
data = test_reader.next_minibatch(test_minibatch_size, input_map=test_input_map)
eval_error = trainer.test_minibatch(data)
test_result = test_result + eval_error
# Average of evaluation errors of all test minibatches
print("Average test error: {0:.2f}%".format(test_result*100 / num_minibatches_to_test))
```

### Run the trainer and test model¶

We are now ready to train our convolutional neural net.

```
In [17]:
```

```
def do_train_test():
global z
z = create_model(x)
reader_train = create_reader(train_file, True, input_dim, num_output_classes)
reader_test = create_reader(test_file, False, input_dim, num_output_classes)
train_test(reader_train, reader_test, z)
do_train_test()
```

```
Minibatch: 0, Loss: 2.3132, Error: 87.50%
Minibatch: 500, Loss: 0.2041, Error: 10.94%
Minibatch: 1000, Loss: 0.1134, Error: 1.56%
Minibatch: 1500, Loss: 0.1540, Error: 3.12%
Minibatch: 2000, Loss: 0.0078, Error: 0.00%
Minibatch: 2500, Loss: 0.0240, Error: 1.56%
Minibatch: 3000, Loss: 0.0083, Error: 0.00%
Minibatch: 3500, Loss: 0.0581, Error: 3.12%
Minibatch: 4000, Loss: 0.0247, Error: 0.00%
Minibatch: 4500, Loss: 0.0389, Error: 1.56%
Minibatch: 5000, Loss: 0.0368, Error: 1.56%
Minibatch: 5500, Loss: 0.0015, Error: 0.00%
Minibatch: 6000, Loss: 0.0043, Error: 0.00%
Minibatch: 6500, Loss: 0.0120, Error: 0.00%
Minibatch: 7000, Loss: 0.0165, Error: 0.00%
Minibatch: 7500, Loss: 0.0097, Error: 0.00%
Minibatch: 8000, Loss: 0.0044, Error: 0.00%
Minibatch: 8500, Loss: 0.0037, Error: 0.00%
Minibatch: 9000, Loss: 0.0506, Error: 3.12%
Training took 30.4 sec
Average test error: 1.57%
```

Note, the average test error is very comparable to our training error indicating that our model has good “out of sample” error a.k.a. generalization error. This implies that our model can very effectively deal with previously unseen observations (during the training process). This is key to avoid overfitting.

Let us check what is the value of some of the network parameters. We will check the bias value of the output dense layer. Previously, it was all 0. Now you see there are non-zero values, indicating that a model parameters were updated during training.

```
In [18]:
```

```
print("Bias value of the last dense layer:", z.classify.b.value)
```

```
Bias value of the last dense layer: [-0.03064867 -0.01484577 0.01883961 -0.27907506 0.10493447 -0.08710711
0.00442157 -0.09873096 0.33425555 0.04781624]
```

### Run evaluation / prediction¶

We have so far been dealing with aggregate measures of error. Let us now
get the probabilities associated with individual data points. For each
observation, the `eval`

function returns the probability distribution
across all the classes. The classifier is trained to recognize digits,
hence has 10 classes. First let us route the network output through a
`softmax`

function. This maps the aggregated activations across the
network to probabilities across the 10 classes.

```
In [19]:
```

```
out = C.softmax(z)
```

Let us a small minibatch sample from the test data.

```
In [20]:
```

```
# Read the data for evaluation
reader_eval=create_reader(test_file, False, input_dim, num_output_classes)
eval_minibatch_size = 25
eval_input_map = {x: reader_eval.streams.features, y:reader_eval.streams.labels}
data = reader_eval.next_minibatch(eval_minibatch_size, input_map=eval_input_map)
img_label = data[y].asarray()
img_data = data[x].asarray()
# reshape img_data to: M x 1 x 28 x 28 to be compatible with model
img_data = np.reshape(img_data, (eval_minibatch_size, 1, 28, 28))
predicted_label_prob = [out.eval(img_data[i]) for i in range(len(img_data))]
```

```
In [21]:
```

```
# Find the index with the maximum value for both predicted as well as the ground truth
pred = [np.argmax(predicted_label_prob[i]) for i in range(len(predicted_label_prob))]
gtlabel = [np.argmax(img_label[i]) for i in range(len(img_label))]
```

```
In [22]:
```

```
print("Label :", gtlabel[:25])
print("Predicted:", pred)
```

```
Label : [7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4]
Predicted: [7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4]
```

Let us visualize some of the results

```
In [23]:
```

```
# Plot a random image
sample_number = 5
plt.imshow(img_data[sample_number].reshape(28,28), cmap="gray_r")
plt.axis('off')
img_gt, img_pred = gtlabel[sample_number], pred[sample_number]
print("Image Label: ", img_pred)
```

```
Image Label: 1
```

## Pooling Layer¶

Often a times, one needs to control the number of parameters especially when having deep networks. For every layer of the convolution layer output (each layer, corresponds to the output of a filter), one can have a pooling layer. Pooling layers are typically introduced to: - Reduce the dimensionality of the previous layer (speeding up the network), - Makes the model more tolerant to changes in object location in the image. For example, even when a digit is shifted to one side of the image instead of being in the middle, the classifer would perform the classification task well.

The calculation on a pooling node is much simpler than a normal feedforward node. It has no weight, bias, or activation function. It uses a simple aggregation function (like max or average) to compute its output. The most commonly used function is “max” - a max pooling node simply outputs the maximum of the input values corresponding to the filter position of the input. The figure below shows the input values in a 4 x 4 region. The max pooling window size is 2 x 2 and starts from the top left corner. The maximum value within the window becomes the output of the region. Every time the model is shifted by the amount specified by the stride parameter (as shown in the figure below) and the maximum pooling operation is repeated.

Another alternative is average pooling, which emits that average value instead of the maximum value. The two different pooling opearations are summarized in the animation below.

```
In [24]:
```

```
# Plot images with strides of 2 and 1 with padding turned on
images = [("https://www.cntk.ai/jup/c103d_max_pooling.gif" , 'Max pooling'),
("https://www.cntk.ai/jup/c103d_average_pooling.gif", 'Average pooling')]
for im in images:
print(im[1])
display(Image(url=im[0], width=300, height=300))
```

```
Max pooling
```

```
Average pooling
```

## Typical convolution network¶

A typical CNN contains a set of alternating convolution and pooling layers followed by a dense output layer for classification. You will find variants of this structure in many classical deep networks (VGG, AlexNet etc). This is in contrast to the MLP network we used in CNTK_103C, which consisted of 2 dense layers followed by a dense output layer.

The illustrations are presented in the context of 2-dimensional (2D) images, but the concept and the CNTK components can operate on any dimensional data. The above schematic shows 2 convolution layer and 2 max-pooling layers. A typical strategy is to increase the number of filters in the deeper layers while reducing the spatial size of each intermediate layers. intermediate layers.

## Task: Create a network with MaxPooling¶

Typical convolutional networks have interlacing convolution and max pool layers. The previous model had only convolution layer. In this section, you will create a model with the following architecture.

You will use the CNTK
MaxPooling
function to achieve this task. You will edit the `create_model`

function below and add the MaxPooling operation.

Hint: We provide the solution a few cells below. Refrain from looking ahead and try to add the layer yourself first.

```
In [25]:
```

```
# Modify this model
def create_model(features):
with C.layers.default_options(init = C.glorot_uniform(), activation = C.relu):
h = features
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=8,
strides=(2,2),
pad=True, name='first_conv')(h)
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=16,
strides=(2,2),
pad=True, name='second_conv')(h)
r = C.layers.Dense(num_output_classes, activation = None, name='classify')(h)
return r
# do_train_test()
```

**Quiz**: How many parameters do we have in the model with MaxPooling
and Convolution? Which of the two models produces lower error rate?

**Exploration Suggestion** - Does the use of LeakyRelu help improve the
error rate? - What percentage of the parameter does the last dense layer
contribute w.r.t. the overall number of parameters for (a) purely two
convolutional layer and (b) alternating 2 convolutional and maxpooling
layers

## Solution¶

```
In [26]:
```

```
# function to build model
def create_model(features):
with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.relu):
h = features
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=8,
strides=(1,1),
pad=True, name="first_conv")(h)
h = C.layers.MaxPooling(filter_shape=(2,2),
strides=(2,2), name="first_max")(h)
h = C.layers.Convolution2D(filter_shape=(5,5),
num_filters=16,
strides=(1,1),
pad=True, name="second_conv")(h)
h = C.layers.MaxPooling(filter_shape=(3,3),
strides=(3,3), name="second_max")(h)
r = C.layers.Dense(num_output_classes, activation = None, name="classify")(h)
return r
do_train_test()
```

```
Minibatch: 0, Loss: 2.3257, Error: 96.88%
Minibatch: 500, Loss: 0.0592, Error: 0.00%
Minibatch: 1000, Loss: 0.1007, Error: 3.12%
Minibatch: 1500, Loss: 0.1299, Error: 3.12%
Minibatch: 2000, Loss: 0.0077, Error: 0.00%
Minibatch: 2500, Loss: 0.0337, Error: 1.56%
Minibatch: 3000, Loss: 0.0038, Error: 0.00%
Minibatch: 3500, Loss: 0.0856, Error: 3.12%
Minibatch: 4000, Loss: 0.0052, Error: 0.00%
Minibatch: 4500, Loss: 0.0171, Error: 1.56%
Minibatch: 5000, Loss: 0.0266, Error: 1.56%
Minibatch: 5500, Loss: 0.0028, Error: 0.00%
Minibatch: 6000, Loss: 0.0070, Error: 0.00%
Minibatch: 6500, Loss: 0.0144, Error: 0.00%
Minibatch: 7000, Loss: 0.0083, Error: 0.00%
Minibatch: 7500, Loss: 0.0033, Error: 0.00%
Minibatch: 8000, Loss: 0.0114, Error: 0.00%
Minibatch: 8500, Loss: 0.0589, Error: 1.56%
Minibatch: 9000, Loss: 0.0186, Error: 1.56%
Training took 31.9 sec
Average test error: 1.05%
```