CNTK 302 Part B: Image super-resolution using CNNs and GANs

Contributed by Borna Vukorepa October 30, 2017

Introduction

This notebook downloads the image data and trains the models that can create higher resolution images from a lower resolution image. User should complete tutorial CNTK 302A before this so they can familiarize themselves with the super-resolution problem and methods that address it. The goal of the single image super-resolution (SISR) problem is to upscale a given image by some factor while keeping as much image details as possible and not making the image blurry.

In order to train our models, we need to prepare the data. Different models might need diferent training sets.

We will be using Berkeley Segmetation Dataset (BSDS). It contains 300 images which we will download and then prepare for training of the models we will use. Image dimensions are 481 x 321 and 321 x 481.

In [1]:
# Import the relevant modules to be used later
import urllib
import re
import os
import numpy as np
from PIL import Image
import sys
import cntk as C

try:
    from urllib.request import urlretrieve, urlopen
except ImportError:
    from urllib import urlretrieve, urlopen

Select the notebook runtime environment devices / settings

Set the device to cpu / gpu for the test environment. If you have both CPU and GPU on your machine, you can optionally switch the devices. By default, we choose the best available device.

In [2]:
# Select the right target device when this notebook is being tested:
if 'TEST_DEVICE' in os.environ:
    if os.environ['TEST_DEVICE'] == 'cpu':
        C.device.try_set_default_device(C.device.cpu())
    else:
        C.device.try_set_default_device(C.device.gpu(0))

There are two run modes:

  • Fast mode: isFast is set to True. This is the default mode for the notebooks, which means we train for fewer iterations or train/test on limited data. This ensures functional correctness of the notebook though the models produced are far from what a completed training would produce.

  • Slow mode: We recommend the user to set this flag to False once the user has gained familiarity with the notebook content and wants to gain insight from running the notebooks for a longer period with different parameters for training.

Note If the isFlag is set to False the notebook will take several days on a GPU enabled machine. You can try fewer iterations by setting the NUM_MINIBATCHES to a smaller number which comes at the expense of quality of the generated images. You can also try reducing the MINIBATCH_SIZE.

In [3]:
isFast = True

Data download

The data is not present in a .zip, .gz or in a similar packet. It is located in a folder on this link. Therefore, we will use regular expression to find all image names and download them one by one into destination folder. However, if the data is already downloaded or run in test mode, the notebook uses cached data.

In [4]:
# Determine the data path for testing
# Check for an environment variable defined in CNTK's test infrastructure
envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'
def is_test(): return envvar in os.environ

if is_test():
    test_data_path_base = os.path.join(os.environ[envvar], "Tutorials", "data")
    test_data_dir = os.path.join(test_data_path_base, "BerkeleySegmentationDataset")
    test_data_dir = os.path.normpath(test_data_dir)

# Default directory in a local folder where the tutorial is run
data_dir = os.path.join("data", "BerkeleySegmentationDataset")

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

#folder with images to be evaluated
example_folder = os.path.join(data_dir, "example_images")
if not os.path.exists(example_folder):
    os.makedirs(example_folder)

#folders with resulting images
results_folder = os.path.join(data_dir, "example_results")
if not os.path.exists(results_folder):
    os.makedirs(results_folder)
In [5]:
def download_data(images_dir, link):
    #Open the url
    images_html = urlopen(link).read().decode('utf-8')

    #looking for .jpg images whose names are numbers
    image_regex = "[0-9]+.jpg"

    #remove duplicates
    image_list = set(re.findall(image_regex, images_html))
    print("Starting download...")

    num = 0

    for image in image_list:
        num = num + 1
        filename = os.path.join(images_dir, image)

        if num % 25 == 0:
            print("Downloading image %d of %d..." % (num, len(image_list)))
        if not os.path.isfile(filename):
            urlretrieve(link + image, filename)
        else:
            print("File already exists", filename)

    print("Images available at: ", images_dir)

Now we only need to call this function with appropriate parameters. This might take a few minutes.

In [6]:
#folder for raw images, before preprocess
images_dir = os.path.join(data_dir, "Images")
if not os.path.exists(images_dir):
    os.makedirs(images_dir)

#Get the path for pre-trained models and example images
if is_test():
    print("Using cached test data")
    models_dir = os.path.join(test_data_dir, "PretrainedModels")
    images_dir = os.path.join(test_data_dir, "Images")
else:
    models_dir = os.path.join(data_dir, "PretrainedModels")
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)

    images_dir = os.path.join(data_dir, "Images")
    if not os.path.exists(images_dir):
        os.makedirs(images_dir)

    #link to BSDS dataset
    link = "https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/BSDS300/html/images/plain/normal/color/"

    download_data(images_dir, link)

print("Model directory", models_dir)
print("Image directory", images_dir)
Starting download...
Downloading image 25 of 300...
Downloading image 50 of 300...
Downloading image 75 of 300...
Downloading image 100 of 300...
Downloading image 125 of 300...
Downloading image 150 of 300...
Downloading image 175 of 300...
Downloading image 200 of 300...
Downloading image 225 of 300...
Downloading image 250 of 300...
Downloading image 275 of 300...
Downloading image 300 of 300...
Images available at:  data/BerkeleySegmentationDataset/Images
Model directory data/BerkeleySegmentationDataset/PretrainedModels
Image directory data/BerkeleySegmentationDataset/Images

Data preparation

This dataset contains only 300 images which is not enough for super-resolution training. The idea is to split images into 64 x 64 patches which will augment the training data. Function prep_64 does exactly that.

Once we select some 64 x 64 patch, we downscale it by a factor of 2 and then upscale it by a factor of 2 again using bicubic interpolation. This will give us a blurry version of the original patch. In some approaches in the tutorial, the idea will be to learn the model which turns blurry pathes into clear ones. We will put blurry patches into one folder and normal patches into another. They will serve as minibatch sources for training. We will also sample a few test images here.

After processing each patch, we move 42 pixels down/right (so there is some overlap between patches) as long as we can. This preprocessing can take a few minutes.

In [7]:
#extract 64 x 64 patches from BSDS dataset
def prep_64(images_dir, patch_h, patch_w, train64_lr, train64_hr, tests):
    if not os.path.exists(train64_lr):
        os.makedirs(train64_lr)

    if not os.path.exists(train64_hr):
        os.makedirs(train64_hr)

    if not os.path.exists(tests):
        os.makedirs(tests)

    k = 0
    num = 0

    print("Creating 64 x 64 training patches and tests from:", images_dir)

    for entry in os.listdir(images_dir):
        filename = os.path.join(images_dir, entry)
        img = Image.open(filename)
        rect = np.array(img)

        num = num + 1

        if num % 25 == 0:
            print("Processing image %d of %d..." % (num, len(os.listdir(images_dir))))

        if num % 50 == 0:
            img.save(os.path.join(tests, str(num) + ".png"))
            continue

        x = 0
        y = 0

        while(y + patch_h <= img.width):
            x = 0
            while(x + patch_w <= img.height):
                patch = rect[x : x + patch_h, y : y + patch_w]
                img_hr = Image.fromarray(patch, 'RGB')

                img_lr = img_hr.resize((patch_w // 2, patch_h // 2), Image.ANTIALIAS)
                img_lr = img_lr.resize((patch_w, patch_h), Image.BICUBIC)

                out_hr = os.path.join(train64_hr, str(k) + ".png")
                out_lr = os.path.join(train64_lr, str(k) + ".png")

                k = k + 1

                img_hr.save(out_hr)
                img_lr.save(out_lr)

                x = x + 42
            y = y + 42
    print("Done!")

We will download a pretrained CNTK VGG19 model and later use in training of SRGAN model. The model will operate on 224 x 224 images, thus we will need to prepare training data for models which turn 112 x 112 images into 224 x 224 images.

We use similar reasoning for augmenting the training data here with prep_224. We will be selecting 224 x 224 patches and downscaling them to 112 x 112 patches. The only difference is that now we also rotate patches to increase the number of training samples because with the increase in the patch size to 224 x 224, our training set reduces requiring use to augment the training data set more. Like before, 224 x 224 patches go into one folder and 112 x 112 patches go into another.

In [8]:
#extract 224 x 224 and 112 x 112 patches from BSDS dataset
def prep_224(images_dir, patch_h, patch_w, train112, train224):
    if not os.path.exists(train112):
        os.makedirs(train112)

    if not os.path.exists(train224):
        os.makedirs(train224)

    k = 0
    num = 0

    print("Creating 224 x 224 and 112 x 112 training patches from:", images_dir)

    for entry in os.listdir(images_dir):
        filename = os.path.join(images_dir, entry)
        img = Image.open(filename)
        rect = np.array(img)

        num = num + 1
        if num % 25 == 0:
            print("Processing image %d of %d..." % (num, len(os.listdir(images_dir))))

        x = 0
        y = 0

        while(y + patch_h <= img.width):
            x = 0
            while(x + patch_w <= img.height):
                patch = rect[x : x + patch_h, y : y + patch_w]
                img_hr = Image.fromarray(patch, 'RGB')

                img_lr = img_hr.resize((patch_w // 2, patch_h // 2), Image.ANTIALIAS)

                for i in range(4):
                    out_hr = os.path.join(train224, str(k) + ".png")
                    out_lr = os.path.join(train112, str(k) + ".png")

                    k = k + 1

                    img_hr.save(out_hr)
                    img_lr.save(out_lr)

                    img_hr = img_hr.transpose(Image.ROTATE_90)
                    img_lr = img_lr.transpose(Image.ROTATE_90)

                x = x + 64
            y = y + 64
    print("Done!")

We use the originally downloaded images, run prep functions with adequate parameters and generate around 300 images from Berkeley Segmentation Dataset, depending on how many go to test set.

In the folder train64_LR we will have around 20000 64 x 64 blurry image patches that will be used for training. Their original counterparts will be located in the folder train64_HR.

In the folder train224 we will have around 12000 original 224 x 224 patches. Their downscaled 112 x 112 counterparts will be located train112. Images that can later be used for testing will be in tests.

In [9]:
#blurry 64x64 destination
train64_lr = os.path.join(data_dir, "train64_LR")

#original 64x64 destination
train64_hr = os.path.join(data_dir, "train64_HR")

#112x112 patches destination
train112 = os.path.join(data_dir, "train112")

#224x224 pathes destination
train224 = os.path.join(data_dir, "train224")

#tests destination
tests = os.path.join(data_dir, "tests")

#prep
prep_64(images_dir, 64, 64, train64_lr, train64_hr, tests)
prep_224(images_dir, 224, 224, train112, train224)
Creating 64 x 64 training patches and tests from: data/BerkeleySegmentationDataset/Images
Processing image 25 of 300...
Processing image 50 of 300...
Processing image 75 of 300...
Processing image 100 of 300...
Processing image 125 of 300...
Processing image 150 of 300...
Processing image 175 of 300...
Processing image 200 of 300...
Processing image 225 of 300...
Processing image 250 of 300...
Processing image 275 of 300...
Processing image 300 of 300...
Done!
Creating 224 x 224 and 112 x 112 training patches from: data/BerkeleySegmentationDataset/Images
Processing image 25 of 300...
Processing image 50 of 300...
Processing image 75 of 300...
Processing image 100 of 300...
Processing image 125 of 300...
Processing image 150 of 300...
Processing image 175 of 300...
Processing image 200 of 300...
Processing image 225 of 300...
Processing image 250 of 300...
Processing image 275 of 300...
Processing image 300 of 300...
Done!

Data illustration

In the figure below we show how the data looks like. The big image is one of the 300 Berkeley Segmentation Dataset images.

Larger patches (highlighted in red) are the example of one pair of 224 x 224 and 112 x 112 pathces. They can be found in folders train224 and train112 respectively.

Smaller patches (highlighted in blue) are the example of one pair of 64 x 64 clear and blurry patches. They can be found in folders train64_HR and train64_LR respectively.

In [10]:
from IPython.display import Image
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/train_data.png")
Out[10]:

Now we are ready to start the training.

Used models

In the remaining part of this tutorial, we show how to use Cognitive Toolkit (CNTK) to create several deep convolutional networks for solving the SISR problem.

We will try out several different models and see what results we can get. The models we will train are:

  • VDSR (very deep super resolution)

  • DRRN (deep recursive residual network)

  • SRResNet (super resolution residual network)

  • SRGAN (super resolution generative adversarial network)

Each model has some key idea behind them and they will be discussed in more detail throughout this notebook. We will start with VDSR model.

VDSR super-resolution model

The key idea behind the VDSR model is the residual learning. Instead of predicting the high resolution image from low resolution image, we first upscale our starting image using some cheap method, like the bicubic interpolation. Then, we predict the so-called “residual image”, that is, the difference between the high resolution image and the image we obtained with bicubic interpolation in the first step. Therefeore, all the intermediate values in the model and in the predicted image will be small, which is more stable. The final result is obtained by adding the predicted value and cheaply upscaled image. You can see the related paper here.

Notice that the model acts on the interpolated low resolution image which is already upscaled by a factor of 2, so the input and output image dimensions are the same. Training set must be prepared in a way which supports this idea.

The convolutional network consists of alternating convolutional layers and recitfying linear units (ReLUs). There are 19 convolutional layers in total. There is no ReLU layer after the last convolution.

In [11]:
#Figure 1
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/vdsr_architecture.png")
Out[11]:

Figure 1 above was taken from the paper Accurate Image Super-Resolution Using Very Deep Convolutional Networks. It shows the architecture of VDSR model. We start from the interpolated low resolution image (ILR) and predict the residual image. Most values in it are zero or very small. Adding the residual image and ILR gives us our prediction of high resolution image.

Training configuration

We will use minibatches of size 64. Lower the minibatch size if you are getting CUDA out of memory error. Model is trained on 64 x 64 input patches prepared from Berkeley Segementation Dataset (BSDS300) (folders train64_LR and train64_HR).

In [12]:
#training configuration
MINIBATCH_SIZE = 16 if isFast else 64
NUM_MINIBATCHES = 200 if isFast else 200000

# Ensure the training and test data is generated and available for this tutorial.
# We search in two locations for the prepared Berkeley Segmentation Dataset.
data_found = False

for data_dir in [os.path.join("data", "BerkeleySegmentationDataset")]:
    train_hr_path = os.path.join(data_dir, "train64_HR")
    train_lr_path = os.path.join(data_dir, "train64_LR")
    if os.path.exists(train_hr_path) and os.path.exists(train_lr_path):
        data_found = True
        break

if not data_found:
    raise ValueError("Please generate the data by completing the first part of this notebook.")

print("Data directory is {0}".format(data_dir))

#folders with training data (high and low resolution images) and paths to map files
training_folder_HR = os.path.join(data_dir, "train64_HR")
training_folder_LR = os.path.join(data_dir, "train64_LR")
MAP_FILE_Y = os.path.join(data_dir, "train64_HR", "map.txt")
MAP_FILE_X = os.path.join(data_dir, "train64_LR", "map.txt")

#image dimensions
NUM_CHANNELS = 3
IMG_H, IMG_W = 64, 64
IMAGE_DIMS = (NUM_CHANNELS, IMG_H, IMG_W)
Data directory is data/BerkeleySegmentationDataset

Creating minibatch sources

Function below will create map file from folder with image patches (for both high and low resolution images). Map file will be used to create minibatches of training data. This function will be used for creating map files for all models in this tutorial.

In [13]:
# create a map file from a flat folder
import cntk.io.transforms as xforms

def create_map_file_from_flatfolder(folder):
    file_endings = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
    map_file_name = os.path.join(folder, "map.txt")
    with open(map_file_name , 'w') as map_file:
        for entry in os.listdir(folder):
            filename = os.path.join(folder, entry)
            if os.path.isfile(filename) and os.path.splitext(filename)[1] in file_endings:
                tempName = '/'.join(filename.split('\\'))
                tempName = '/'.join(tempName.split('//'))
                tempName = '//'.join(tempName.split('/'))
                map_file.write("{0}\t0\n".format(tempName))
    return map_file_name

Function below will create minibatch source from the map file mentioned above. Later, we will use this to create a minibatch source for both low resolution and high resolution images. This function will be used for creating minibatch sources for all models.

In [14]:
# creates a minibatch source for training or testing
def create_mb_source(map_file, width, height, num_classes = 10, randomize = True):
    transforms = [xforms.scale(width = width,  height = height, channels = NUM_CHANNELS, interpolations = 'linear')]
    return C.io.MinibatchSource(C.io.ImageDeserializer(map_file, C.io.StreamDefs(
        features = C.io.StreamDef(field = 'image', transforms = transforms),
        labels = C.io.StreamDef(field = 'label', shape = num_classes))), randomize = randomize)

VDSR architecture

For VDSR model, we will be only using convolution layers and ReLU layers, always stacked one after another, except for the last one when we don’t have ReLU layer after the convolution layer.

We will have 19 convolution layers. Every neuron has receptive field of size 3 x 3 and every layer contains 64 filters, with the exception of last one which has three filters. They are the RGB representation of the result. Below is the function definition which builds this architecture.

In [15]:
def VDSR(h0):
    print('Generator input shape: ', h0.shape)

    with C.layers.default_options(init = C.he_normal(), activation = C.relu, bias = False):
        model = C.layers.Sequential([
            C.layers.For(range(18), lambda :
                C.layers.Convolution((3, 3), 64, pad = True)),
            C.layers.Convolution((3, 3), 3, activation = None, pad = True)
        ])

    return model(h0)

Computation graph

Our model takes in blurry (interpolated low resolution) 64 x 64 images. They were obtained from their high resolution counterparts by downscaling them and then upscaling them back to original size. The model will evaluate it and predict a residual image which is added to the low resolution image. The Euclidean distance between that result and the original high resolution image is our loss which we want to minimize. The output is also 64 x 64, but less blurry and closer to the original.

real_X stores low resolution images and real_Y stores original high resolution images.

For optimization we use adam learner. Learning rate starts at 0.1 and decreases by a factor of 10 gradually.

In [16]:
#computation graph
def build_VDSR_graph(lr_image_shape, hr_image_shape, net):
    input_dynamic_axes = [C.Axis.default_batch_axis()]
    real_X = C.input(lr_image_shape, dynamic_axes = input_dynamic_axes, name = "real_X")
    real_Y = C.input(hr_image_shape, dynamic_axes = input_dynamic_axes, name = "real_Y")

    real_X_scaled = real_X/255
    real_Y_scaled = real_Y/255

    genG = net(real_X_scaled)

    #Note: this is where the residual error is calculated and backpropagated through Generator
    g_loss_G = IMG_H * IMG_W * C.reduce_mean(C.square(real_Y_scaled - real_X_scaled - genG)) / 2.0

    G_optim = C.adam(g_loss_G.parameters, lr = C.learning_rate_schedule(
        [(1, 0.1), (1, 0.01), (1, 0.001), (1, 0.0001)], C.UnitType.minibatch, 50000),
                   momentum = C.momentum_schedule(0.9), gradient_clipping_threshold_per_sample = 1.0)

    G_G_trainer = C.Trainer(genG, (g_loss_G, None), G_optim)

    return (real_X, real_Y, genG, real_X_scaled, real_Y_scaled, G_optim, G_G_trainer)

Training

Now we are ready to train our model. Using the functions and training configuration above, we create minibatch sources and computation graph. Then we take minibatches one by one and slowly update our model. Number of iterations (minibatches) depends on whether isFast is True.

Notice that the train function accepts model architecture, dimensions of low and high resolution images and computational graph function. This will enable us to use this same function for all models, except for SRGAN model where we will also need to train the discriminator.

The training with isFast set to True might take around 10 minutes.

In [17]:
#training
def train(arch, lr_dims, hr_dims, build_graph):
    create_map_file_from_flatfolder(training_folder_LR)
    create_map_file_from_flatfolder(training_folder_HR)

    print("Starting training")

    reader_train_X = create_mb_source(MAP_FILE_X, lr_dims[1], lr_dims[2])
    reader_train_Y = create_mb_source(MAP_FILE_Y, hr_dims[1], hr_dims[2])
    real_X, real_Y, genG, real_X_scaled, real_Y_scaled, G_optim, G_G_trainer = build_graph(lr_image_shape = lr_dims,
                                                                                           hr_image_shape = hr_dims, net = arch)

    print_frequency_mbsize = 50

    pp_G = C.logging.ProgressPrinter(print_frequency_mbsize)

    input_map_X = {real_X: reader_train_X.streams.features}
    input_map_Y = {real_Y: reader_train_Y.streams.features}

    for train_step in range(NUM_MINIBATCHES):

        X_data = reader_train_X.next_minibatch(MINIBATCH_SIZE, input_map_X)
        batch_inputs_X = {real_X: X_data[real_X].data}

        Y_data = reader_train_Y.next_minibatch(MINIBATCH_SIZE, input_map_Y)
        batch_inputs_X_Y = {real_X : X_data[real_X].data, real_Y : Y_data[real_Y].data}

        G_G_trainer.train_minibatch(batch_inputs_X_Y)
        pp_G.update_with_trainer(G_G_trainer)
        G_trainer_loss = G_G_trainer.previous_minibatch_loss_average

    return (G_G_trainer.model, real_X, real_X_scaled, real_Y, real_Y_scaled)
In [18]:
VDSR_model, real_X, real_X_scaled, real_Y, real_Y_scaled = train(VDSR, IMAGE_DIMS, IMAGE_DIMS, build_VDSR_graph)
Starting training
Generator input shape:  (3, 64, 64)
 Minibatch[   1-  50]: loss = 1583398986.511333 * 800;
 Minibatch[  51- 100]: loss = 3.904708 * 800;
 Minibatch[ 101- 150]: loss = 3.996467 * 800;
 Minibatch[ 151- 200]: loss = 4.015535 * 800;

Now this model can be used to super-resolve arbitrary images. Remember that VDSR model operates on 64 x 64 images obtained by bicubic interpolation and returns a 64 x 64 image, more clear than the starting one. How can it super-resolve arbitrary sized images? The idea is to first upscale the target image with bicubic interpolation, then go patch by patch to clear them up with our model and present the resulting image.

Also remember that loss function is based on scaled images (pixel values between 0 and 1), so the predicted residual has to be scaled back by 255 before it is added to low resolution image. If some pixels become negative or go above 255, we return them to 0 and 255 respectively before saving.

Figure 2 is the example of the results we could get with this model. On the left side is the effect of the bicubic interpolation and on the right is the result of our model applied.

In [19]:
#Figure 2
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/vdsr_small_test.png")
Out[19]:

DRRN super-resolution model

DRRN stands for deep “recursive” residual network. Strictly speaking, the model is not recursive since there is no backward flow of data. It is very similar to the already described VDSR model because it also uses the concept of residual learning meaning that we are only predicting the residual image, that is, the difference between the interpolated low resolution image and the high resolution image. As in VDSR before, adding these two will give us final result which will hopefully have a lot of high-frequency details. You can see the related paper here. Images showing model architecture also come from that paper.

DRRN differs from VDSR only in model architecture so we will only need to define one new function which builds that architecture. DRRN architecture consists of several recursive blocks. Each recursive block consists of two convolutional layers. After each one of them is the batch-normalization layer and ReLU layer.

Figure 3 is taken from the mentioned paper and shows how to stack recursive blocks one after another (one, two and three blocks respectively).

In [20]:
#Figure 3
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/recblock_architecture.png")
Out[20]:

Figure 4, also from the same paper, shows DRRN model architecture with two recursive blocks.

In [21]:
#Figure 4
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/drnn_architecture.png")
Out[21]:

DRRN architecture and training

We will have 9 recursive blocks. There will be no ReLU layer after the last convolution. Every neuron has receptive field of size 3 x 3 and every layer contains 128 filters, with the exception of last one which has three filters. They are the RGB representation of the result.

After that, we only need to call the train function with DRRN as argument since everything else is the same as for VDSR, including the training data and the computation graph.

The training with isFast set to True might take around 25 minutes.

In [22]:
#basic DRRN block
def DRRN_basic_block(inp, num_filters):
    c1 = C.layers.Convolution((3, 3), num_filters, init = C.he_normal(), pad = True, bias = False)(inp)
    c1 = C.layers.BatchNormalization(map_rank = 1)(c1)
    return c1

def DRRN(h0):
    print('Generator input shape: ', h0.shape)

    with C.layers.default_options(init = C.he_normal(), activation = C.relu, bias = False):
        h1 = C.layers.Convolution((3, 3), 128, pad = True)(h0)
        h2 = DRRN_basic_block(h1, 128)
        h3 = DRRN_basic_block(h2, 128)
        h4 = h1 + h3

        for _ in range(8):
            h2 = DRRN_basic_block(h4, 128)
            h3 = DRRN_basic_block(h2, 128)
            h4 = h1 + h3

        h_out = C.layers.Convolution((3, 3), 3, activation = None, pad = True)(h4)

        return h_out
In [23]:
DRRN_model, real_X, real_X_scaled, real_Y, real_Y_scaled = train(DRRN, IMAGE_DIMS, IMAGE_DIMS, build_VDSR_graph)
Starting training
Generator input shape:  (3, 64, 64)
 Minibatch[   1-  50]: loss = 2977.761893 * 800;
 Minibatch[  51- 100]: loss = 166.646252 * 800;
 Minibatch[ 101- 150]: loss = 141.640209 * 800;
 Minibatch[ 151- 200]: loss = 166.618866 * 800;

Now this model can be used to super-resolve arbitrary images the same way we described that in VDSR section. As we already mentioned, we will evaluate our models and create filters in the end.

Figure 5 is the example of the results we could get with this model. As before, on the left side is the effect of the bicubic interpolation and on the right is the result of our model applied. More iterations could give us more improvement. Quality is a lower than with VDSR (visible from training errors) because we only trained for a very little number of iterations. DRRN and VDSR will show similar performance when we apply models trained for larger number of iterations.

In [24]:
#Figure 5
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/drnn_small_test.png")
Out[24]:

SRResNet super-resolution model

Unlike the previous two models, SRResNet does not use residual learning and there is no upscaling in the preprocess. The image is upscaled in the model evaluation process using convolution transpose (also referred to deconvolution) operation. Now the training set is different than before because input and output have different dimensions. Our model will be taking in 112 x 112 images and outputing 224 x 224 images. This is because the pretrained version of CNTK VGG19 model which is later needed for SRGAN operates strictly on 224 x 224 images.

The base of the model architecture is the residual block. Each residual block has two convolutional layers, each followed by a batch normalization (BN) layer with the parametric rectifying linear unit after the first one (PReLU). Convolutional layers have 3 x 3 receptive field and each of them contains 64 filters. Image resoultion is increased near the end of the model which is less complex than increasing it at the beginning. You can see the related paper here.

Figure 6 from the paper shows the model architecture in a more detailed way. k is the receptive field size, n is the number of filters in the layer and s is the stride.

In [25]:
#Figure 6
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/srresnet_architecture.PNG")
Out[25]:

Training configuration

Since the model operates on different image sizes and input dimensions are different from outputs (remember, upscaling is done inside the model and not as preprocess), we have to change training configuration. We follow the paper and use minibatch size of 16. We use the training set prepared for this model (folders train224 and train112), so we need to change the training folder and map files paths.

In [26]:
#training configuration
MINIBATCH_SIZE = 8 if isFast else 16
NUM_MINIBATCHES = 200 if isFast else 1000000

# Ensure the training and test data is generated and available for this tutorial.
# We search in two locations for the prepared Berkeley Segmentation Dataset.
data_found = False

for data_dir in [os.path.join("data", "BerkeleySegmentationDataset")]:
    train_hr_path = os.path.join(data_dir, "train224")
    train_lr_path = os.path.join(data_dir, "train112")
    if os.path.exists(train_hr_path) and os.path.exists(train_lr_path):
        data_found = True
        break

if not data_found:
    raise ValueError("Please generate the data by completing the first part of this notebook.")

print("Data directory is {0}".format(data_dir))

#folders with training data (high and low resolution images) and paths to map files
training_folder_HR = os.path.join(data_dir, "train224")
training_folder_LR = os.path.join(data_dir, "train112")
MAP_FILE_Y = os.path.join(data_dir, "train224", "map.txt")
MAP_FILE_X = os.path.join(data_dir, "train112", "map.txt")

#image dimensions
NUM_CHANNELS = 3
LR_H, LR_W, HR_H, HR_W = 112, 112, 224, 224
LR_IMAGE_DIMS = (NUM_CHANNELS, LR_H, LR_W)
HR_IMAGE_DIMS = (NUM_CHANNELS, HR_H, HR_W)
Data directory is data/BerkeleySegmentationDataset

Function for constructing the architecture presented in the Figure 6 is defined below.

In [27]:
#basic resnet block
def resblock_basic(inp, num_filters):
    c1 = C.layers.Convolution((3, 3), num_filters, init = C.he_normal(), pad = True, bias = False)(inp)
    c1 = C.layers.BatchNormalization(map_rank = 1)(c1)
    c1 = C.param_relu(C.Parameter(c1.shape, init = C.he_normal()), c1)

    c2 = C.layers.Convolution((3, 3), num_filters, init = C.he_normal(), pad = True, bias = False)(c1)
    c2 = C.layers.BatchNormalization(map_rank = 1)(c2)
    return inp + c2

def resblock_basic_stack(inp, num_stack_layers, num_filters):
    assert (num_stack_layers >= 0)
    l = inp
    for _ in range(num_stack_layers):
        l = resblock_basic(l, num_filters)
    return l

#SRResNet architecture
def SRResNet(h0):
    print('Generator inp shape: ', h0.shape)
    with C.layers.default_options(init = C.he_normal(), bias = False):

        h1 = C.layers.Convolution((9, 9), 64, pad = True)(h0)
        h1 = C.param_relu(C.Parameter(h1.shape, init = C.he_normal()), h1)

        h2 = resblock_basic_stack(h1, 16, 64)

        h3 = C.layers.Convolution((3, 3), 64, activation = None, pad = True)(h2)
        h3 = C.layers.BatchNormalization(map_rank = 1)(h3)

        h4 = h1 + h3
        ##here

        h5 = C.layers.ConvolutionTranspose2D((3, 3), 64, pad = True, strides = (2, 2), output_shape = (224, 224))(h4)
        h5 = C.param_relu(C.Parameter(h5.shape, init = C.he_normal()), h5)

        h6 = C.layers.Convolution((3, 3), 3, pad = True)(h5)

        return h6

We follow the paper and the computation graph is a bit different now than in the previous models. The difference is in the loss function and the learning rate schedule. Loss function is the MSE and learning rate is 0.0001 almost the whole time. We put it a bit higher in the beginning which can help our model produce somewhat reasonable results early on.

The training with isFast set to True might take around 25 minutes.

In [28]:
def build_SRResNet_graph(lr_image_shape, hr_image_shape, net):
    inp_dynamic_axes = [C.Axis.default_batch_axis()]
    real_X = C.input(lr_image_shape, dynamic_axes=inp_dynamic_axes, name="real_X")
    real_Y = C.input(hr_image_shape, dynamic_axes=inp_dynamic_axes, name="real_Y")

    real_X_scaled = real_X/255
    real_Y_scaled = real_Y/255

    genG = net(real_X_scaled)

    G_loss = C.reduce_mean(C.square(real_Y_scaled - genG))

    G_optim = C.adam(G_loss.parameters,
                    lr = C.learning_rate_schedule([(1, 0.01), (1, 0.001), (98, 0.0001)], C.UnitType.minibatch, 10000),
                    momentum = C.momentum_schedule(0.9), gradient_clipping_threshold_per_sample = 1.0)

    G_G_trainer = C.Trainer(genG, (G_loss, None), G_optim)

    return (real_X, real_Y, genG, real_X_scaled, real_Y_scaled, G_optim, G_G_trainer)
In [29]:
SRResNet_model, real_X, real_X_scaled, real_Y, real_Y_scaled = train(SRResNet, LR_IMAGE_DIMS,
                                                                     HR_IMAGE_DIMS, build_SRResNet_graph)
Starting training
Generator inp shape:  (3, 112, 112)
 Minibatch[   1-  50]: loss = 0.071668 * 400;
 Minibatch[  51- 100]: loss = 0.010785 * 400;
 Minibatch[ 101- 150]: loss = 0.008225 * 400;
 Minibatch[ 151- 200]: loss = 0.006657 * 400;

Figure 7 shows the results we can get with only 1000 iterations of this model. Images on the left are the original 112 x 112 images and those on the right are the outputs of our model. As we can see, the quality is far from acceptable. This model requires around 106 iterations for the quality to become solid. In the end, we will show what results we can obtain with a sufficient number of iterations.

In [30]:
#Figure 7
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/resnet_small_test.png",
      width = 390, height = 390)
Out[30]:

The SRResNet model created here is a starting point for the last model we will try out: the SRGAN.

SRGAN super-resolution model

A GAN (Generative adversarial network) consists of two neural networks which supervise each other. GANs were first introduced by Ian Goodfellow in the paper Generative Adversarial Nets.

Simply put, our GAN, like any other, will consist of two parts:

  • Generator network: Our generator network will be taking in 112 x 112 images and outputing their super-resolved versions of dimensions 224 x 224. Architecture of our generator network will be the same as for SRResNet above.

  • Discriminator network: Our discriminator network will be taking in both real 224 x 224 high resolution images and those produced by the generator network so it can learn to distinguish one from another. For every 224 x 224 image it takes, our discriminator outputs one real number between 0 and 1 which estimates the probability of the input image being real (and not produced by generator). Architecture of the discriminator is essentially a sequence of several convolutional layers, each followed by batch normalization layer and leaky ReLU layer with α = 0.2. Figure 8 shows the exact architecture. It was taken from the SRGAN paper. k is the receptive field size, n is the number of filters in the layer and s is the stride.

In [31]:
#Figure 8
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/discr_architecture.PNG")
Out[31]:

In every iteration, generator is trying to produce images which will make loss value as small as possible (as in any other model). The idea here is to include the adversarial loss into the loss function which will take discriminator’s opinion into account.

Similarly, in every iteration, discriminator is trying to become better at distinguishing generated images from the real ones.

The key idea is to initialize the generator network weights with the weights of the previously trained SRResNet model and then start the training. Since GANs can be unstable during training (especially in a problem as delicate as super-resolution), this is a good idea because now we only need to refine the generator a bit, so we don’t need massive parameter changes and gradients will hopefully not become very large. Therefore, it makse sense to say that SRGAN is the refinement of SRResNet.

Now we can define the function for constructing the discriminator based on the showcased architecture.

In [32]:
#basic discriminator block
def conv_bn_lrelu(inp, filter_size, num_filters, strides = (1, 1), init = C.he_normal()):
    r = C.layers.Convolution(filter_size, num_filters, init = init, pad = True, strides = strides, bias = False)(inp)
    r = C.layers.BatchNormalization(map_rank = 1)(r)
    return C.param_relu(C.constant((np.ones(r.shape) * 0.2).astype(np.float32)), r)

#discriminator architecture
def discriminator(h0):
    print('Discriminator input shape: ', h0.shape)
    with C.layers.default_options(init = C.he_normal(), bias = False):
        h1 = C.layers.Convolution((3, 3), 64, pad = True)(h0)
        h1 = C.param_relu(C.constant((np.ones(h1.shape) * 0.2).astype(np.float32)), h1)

        h2 = conv_bn_lrelu(h1, (3, 3), 64, strides = (2, 2))

        h3 = conv_bn_lrelu(h2, (3, 3), 128)
        h4 = conv_bn_lrelu(h3, (3, 3), 128, strides = (2, 2))

        h5 = conv_bn_lrelu(h4, (3, 3), 256)
        h6 = conv_bn_lrelu(h5, (3, 3), 256, strides = (2, 2))

        h7 = conv_bn_lrelu(h6, (3, 3), 512)
        h8 = conv_bn_lrelu(h7, (3, 3), 512, strides = (2, 2))

        h9 = C.layers.Dense(1024)(h8)
        h10 = C.param_relu(C.constant(0.2, h9.shape), h9)

        h11 = C.layers.Dense(1, activation = C.sigmoid)(h10)
        return h11

Training configuration

Training configuration is identical to the configuration for SRResNet but we will use smaller number of iterations, because we only need refinement. Other paths and parameters are unchanged. We put minibatch size to 4 here to speed up the process. Larger minibatch sizes offer more accurate gradient approximations and better results. Smaller minibatch sizes offer more speed.

In [33]:
#training configuration
MINIBATCH_SIZE = 2 if isFast else 4
NUM_MINIBATCHES = 200 if isFast else 100000

Computation graph

Computation graph is now much different than before. Both the generator and the discriminator have their own separate loss functions and we use the information from them to update their parameters. Discriminator obviously needs to act on both real 224 x 224 images and on those generated by the generator. The standard way to enable this is to use the clone function. Effectively, we will have two discriminators, one which acts on the real images and the other one on the synthetic images. However, because we used clone, they will share parameters.

The biggest challenge is in the loss functions. It is not easy to create a loss function which will make our GAN work well. Coefficients need to be carefully selected to ensure that our model does not diverge and start generating unusable images. There is no general rule for setting the coefficients for different parts of loss functions, it all depends on the problem we are tackling. Finding good coefficient configurations usually comes after several failed attempts.

If \(G\) is our generator and \(D\) our discriminator, following the Generative Adversarial Nets paper, \(G\) and \(D\) are playing the following min-max game:

In [34]:
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/min_max.PNG")
Out[34]:

Here, the role of \(x\) is taken by real 224 x 224 images and the role of \(z\) is taken by 112 x 112 images that are the input of the generator. Therefore, discriminator wants to return larger numbers (closer to 1) for real 224 x 224 images and smaller numbers (closer to 0) for synthetic images. Conversely, generator wants to perform in such way that the discriminator gives larger outputs (closer to 1) when evaluated on generator’s outputs.

Logarithms can be problematic because their gradients are unbounded. No matter how small the coefficient is in front of the adversarial loss of the generator or the discriminator, there is a chance that our gradients might explode. Once they do, it is very hard for the model to return to generating reasonable images.

To adress this, we adopt the idea from CycleGAN paper, where logarithm losses are replaced by square losses as shown below. Also notice that the game is now max-min instead of min-max.

In [35]:
Image(url = "https://superresolution.blob.core.windows.net/superresolutionresources/square_min_max.PNG")
Out[35]:

In addition to the adversarial and MSE loss, the generator’s loss function will also have the perceptual loss part. A common way to introduce the perceptual loss is with the pre-trained VGG19 network, as mentioned in SRGAN paper.

The idea is to take the high resolution image and its generated counterpart and run them both through VGG19 network. Mean square difference of those two evaluations at the layer relu5_4 is the perceptual loss. Apparently, introducing this loss can help GANs generate perceptually more pleasing results than using MSE loss only (with adversarial losses, of course).

Now we are ready to define a function which builds our computation graph for SRGAN. Coefficients in the generator loss were determined with the help of several papers and trial and error until we got decent results. By trying out more values, we could probably make further improvement and users are encouraged to experiment.

In [36]:
#gan computation graph
def build_GAN_graph(genG, disc, VGG, real_X_scaled, real_Y_scaled, real_Y):
    #discriminator on real images
    D_real = discriminator(real_Y_scaled)

    #discriminator on fake images
    D_fake = D_real.clone(method = 'share', substitutions = {real_Y_scaled.output: genG.output})

    #VGG on real images
    VGG_real = VGG.clone(method = 'share', substitutions = {VGG.arguments[0]: real_Y})

    #VGG on fake images
    VGG_fake = VGG.clone(method = 'share', substitutions = {VGG.arguments[0]: 255 * genG.output})

    #generator loss: GAN loss + MSE loss + perceptual (VGG) loss
    G_loss = -C.square(D_fake)*0.001 + C.reduce_mean(C.square(real_Y_scaled - genG)) + C.reduce_mean(C.square(VGG_real - VGG_fake))*0.08

    #discriminator loss: loss on real + los on fake images
    D_loss = C.square(1.0 - D_real) + C.square(D_fake)

    G_optim = C.adam(G_loss.parameters,
                    lr = C.learning_rate_schedule([(20, 0.0001), (20, 0.00001)], C.UnitType.minibatch, 5000),
                    momentum = C.momentum_schedule(0.9), gradient_clipping_threshold_per_sample = 0.1)

    D_optim = C.adam(D_loss.parameters,
                    lr = C.learning_rate_schedule([(20, 0.0001), (20, 0.00001)], C.UnitType.minibatch, 5000),
                    momentum = C.momentum_schedule(0.9), gradient_clipping_threshold_per_sample = 0.1)

    G_trainer = C.Trainer(genG, (G_loss, None), G_optim)

    D_trainer = C.Trainer(D_real, (D_loss, None), D_optim)

    return (G_trainer, D_trainer)

Training the model

The train function is also considerably different now since we have to alternate updates to the discriminator and the generator. Map files and image dimensions are the same as in SRResNet model. Before the training, we download the pretrained VGG19 model. It might take a few minutes.

In [37]:
data_dir = os.path.join("data", "BerkeleySegmentationDataset")
if not os.path.exists(data_dir):
    data_dir = os.makedirs(data_dir)

models_dir = os.path.join(data_dir, "PretrainedModels")

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

print("Downloading VGG19 model...")
urlretrieve("https://www.cntk.ai/Models/Caffe_Converted/VGG19_ImageNet_Caffe.model",
            os.path.join(models_dir, "VGG19_ImageNet_Caffe.model"))
print("Done!")
Downloading VGG19 model...
Done!
In [38]:
def train_GAN(SRResNet_model, real_X, real_X_scaled, real_Y, real_Y_scaled):
    print("Starting training")

    reader_train_X = create_mb_source(MAP_FILE_X, LR_W, LR_H)
    reader_train_Y = create_mb_source(MAP_FILE_Y, HR_W, HR_H)

    VGG19 = C.load_model(os.path.join(models_dir, "VGG19_ImageNet_Caffe.model"))
    print("Loaded VGG19 model.")

    layer5_4 = VGG19.find_by_name('relu5_4')
    relu5_4  = C.combine([layer5_4.owner])

    G_trainer, D_trainer = build_GAN_graph(genG = SRResNet_model, disc = discriminator, VGG = relu5_4,
                                           real_X_scaled = real_X_scaled, real_Y_scaled = real_Y_scaled, real_Y = real_Y)

    print_frequency_mbsize = 50

    print("First row is discriminator loss, second row is generator loss:")
    pp_D = C.logging.ProgressPrinter(print_frequency_mbsize)
    pp_G = C.logging.ProgressPrinter(print_frequency_mbsize)

    inp_map_X = {real_X: reader_train_X.streams.features}
    inp_map_Y = {real_Y: reader_train_Y.streams.features}

    for train_step in range(NUM_MINIBATCHES):
        X_data = reader_train_X.next_minibatch(MINIBATCH_SIZE, inp_map_X)
        batch_inps_X = {real_X: X_data[real_X].data}

        Y_data = reader_train_Y.next_minibatch(MINIBATCH_SIZE, inp_map_Y)
        batch_inps_X_Y = {real_X: X_data[real_X].data, real_Y : Y_data[real_Y].data}

        D_trainer.train_minibatch(batch_inps_X_Y)
        pp_D.update_with_trainer(D_trainer)
        D_trainer_loss = D_trainer.previous_minibatch_loss_average

        G_trainer.train_minibatch(batch_inps_X_Y)
        pp_G.update_with_trainer(G_trainer)
        G_trainer_loss = G_trainer.previous_minibatch_loss_average

    model = G_trainer.model

    return model

Remember that the requirement for training SRGAN is the SRResNet model, so we pass it into the training function.

The training with isFast set to True might take around 20 minutes. Since our SRResNet (which is used to initialize the SRGAN) and SRGAN were both trained for a very few iterations, the results of SRGAN will probably not be something to behold. We don’t mind since this is only to showcase the code and training procedure and the real models (trained for enough iterations) are used in tutorial CNTK 302 part A.

In [39]:
SRGAN_model = train_GAN(SRResNet_model, real_X, real_X_scaled, real_Y, real_Y_scaled)
Starting training
Loaded VGG19 model.
Discriminator input shape:  (3, 224, 224)
First row is discriminator loss, second row is generator loss:
 Minibatch[   1-  50]: loss = 0.971728 * 100;
 Minibatch[   1-  50]: loss = 0.063828 * 100;
 Minibatch[  51- 100]: loss = 1.000000 * 100;
 Minibatch[  51- 100]: loss = 0.006320 * 100;
 Minibatch[ 101- 150]: loss = 1.000000 * 100;
 Minibatch[ 101- 150]: loss = 0.005728 * 100;
 Minibatch[ 151- 200]: loss = 1.000000 * 100;
 Minibatch[ 151- 200]: loss = 0.006376 * 100;

Now these models can be used to evaluate any image following the description in the CNTK 302 part A. We don’t recommend using these models since they have been trained for a very small number of iterations. We recommend using pre-trained models and they are used in CNTK 302 part A. Training here was done only to showcase the training speed and procedure.