{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Read and feed data to CNTK Trainer\n", "\n", "Feeding data is an integral part of training a deep neural network. While expressiveness and succinct model representation is one of the key aspects of CNTK, efficient and flexible data reading is also made available to the users. Key to deep learning is the ability to provide randomly sampled training data to the CNTK model trainers. These small sampled data sets are called mini-batches. In this manual, we show how minibatch samples can be read from data sources and passed on to trainer objects. The built-in readers follow [convention over configuration principle](https://en.wikipedia.org/wiki/Convention_over_configuration) and greatly simplify training procedure.\n", "\n", "One of the following four mechanisms to read data should cover the different use cases when it comes to training a model: \n", "\n", "A. **Small data (in memory)** or **Chunked Data in Numpy/SciPy**: Users can generate own data using *NumPy* or read in data as *NumPy* arrays or *SciPy* sparse (CSR) matrices. This is the way to go when data set is small and can be loaded in memory. This option is a good choice if one does pre-processing of data and have data in a format that fits into *NumPy* arrays or *SciPy* sparse matrices. In such cases one would read data in small chuncks, pre-process and feed to *Numpy*/*SciPy* data containers.\n", "\n", "B. **Large data (cannot be loaded in memory)**: \n", "\n", "> (1) Using built-in **MinibatchSource** class is the choice when data cannot be loaded in memory and one wants to build models leveraging distributed computing capabilities in CNTK. This is supported via built-in readers and optional parameters that help reduce writing what otherwise would be considered as boiler plate code. \n", "\n", "> (2) Using explicit **minibatch-loop** when data being used does not fit in one of the aforementioned categories. For instance, if one needs to alter training parameters based on certain loss conditions, an explicit control over the training loop is desirable. \n", "\n", "C. **Custom data**: In case the built-in readers are not adequate and one is using CNTK's training session constructs (that will be explained in the tutorial) or needs data to be formatted for distributed scenarios, creating a custom minibatch source is a suggested option. \n", "\n", "We will use a text data reader to walk you through the aforementioned options. CNTK does support image and speech data as well. We will discuss how to read those data types towards the end." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import print_function\n", "import cntk as C\n", "import numpy as np\n", "import os\n", "import requests\n", "import scipy.sparse\n", "\n", "import cntk.tests.test_utils\n", "cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for CNTK internal build system)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sections A - C below are applicable to any model architecture when using CNTK. For illustration of how the different data feeding options can be leveraged with a CNTK model for training, we will use a simple Logistic Regression model in this notebook. \n", "\n", "\n", "## A. Small data (in memory)\n", "\n", "The `train()` and `test()` functions accept a tuple of NumPy arrays or SciPy sparse matrices (in CSR format) for their `minibatch_source` arguments.\n", "The tuple members must be in the same order as the arguments of the `criterion` function that `train()` or `test()` are called on.\n", "For dense tensors, use NumPy arrays, while sparse data should have the type `scipy.sparse.csr_matrix`.\n", "\n", "Each of the arguments should be a Python list of NumPy/SciPy arrays, where each list entry represents a data item. For arguments declared as a sequence, the first axis (dimension) of the NumPy/SciPy array is the sequence length, while the remaining axes are the shape of each element of the sequence. Arguments that are not sequences consist of a single tensor. The shapes, data types (`np.float32/float64`) and sparseness must match the argument types. \n", "\n", "As an optimization, arguments that are not sequences can also be passed as a single large numpy/scipy array (instead of a list).\n", "\n", "**Note**: \n", "- It is the responsibility of the user to randomize the data.\n", "- It is the responsibility of the user to provide the data in a row major memory layout that CNTK uses for the python interface. In C++ interface we use a columun major format.. This becomes more relevant for Tensors of higher dimensions. Refer to the Image data section later in the notebook for more information." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data =\n", " [[ 2.2741797 3.56347561]\n", " [ 5.12873602 5.79089499]\n", " [ 1.3574543 5.5718112 ]\n", " [ 3.54340553 2.46254587]]\n", "labels =\n", " [[ 1. 0.]\n", " [ 0. 1.]\n", " [ 0. 1.]\n", " [ 1. 0.]]\n" ] } ], "source": [ "# Generate your own data\n", "input_dim_lr = 2 # classify 2-dimensional data\n", "num_classes_lr = 2 # into one of two classes\n", "\n", "# This example uses synthetic data from normal distributions,\n", "# which we generate in the following.\n", "# X_lr[corpus_size,input_dim] - input data\n", "# Y_lr[corpus_size] - labels (0 or 1), one-hot-encoded\n", "np.random.seed(0)\n", "def generate_synthetic_data(N):\n", " Y = np.random.randint(size=N, low=0, high=num_classes_lr) # labels\n", " X = (np.random.randn(N, input_dim_lr)+3) * (Y[:,None]+1) # data\n", " # Our model expects float32 features, and cross-entropy\n", " # expects one-hot encoded labels.\n", " Y = scipy.sparse.csr_matrix((np.ones(N,np.float32), (range(N), Y)), shape=(N, num_classes_lr))\n", " X = X.astype(np.float32)\n", " return X, Y\n", "X_train_lr, Y_train_lr = generate_synthetic_data(20000)\n", "X_test_lr, Y_test_lr = generate_synthetic_data(1024)\n", "print('data =\\n', X_train_lr[:4])\n", "print('labels =\\n', Y_train_lr[:4].todense())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a small model and train on the data. Note: this manual does not focus on training instead just on different ways to pass data into the trainers." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Composite(Tensor[2], SparseTensor[2]) -> Tensor[1]\n" ] } ], "source": [ "## Define a small Logistic Regression model function\n", "x = C.input_variable(input_dim_lr)\n", "y = C.input_variable(num_classes_lr, is_sparse=True)\n", "model = C.layers.Dense(num_classes_lr, activation=None)\n", "loss = C.cross_entropy_with_softmax(model(x), y) # applies softmax to the model output under the hood\n", "print(loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feed NumPy data from memory" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " average since average since examples\n", " loss last metric last \n", " ------------------------------------------------------\n", "Learning rate per minibatch: 0.1\n", " 3.58 3.58 0 0 32\n", " 1.61 0.629 0 0 96\n", " 1.1 0.715 0 0 224\n", " 0.88 0.688 0 0 480\n", " 0.734 0.598 0 0 992\n", " 0.637 0.543 0 0 2016\n", " 0.541 0.447 0 0 4064\n", " 0.45 0.359 0 0 8160\n", " 0.366 0.284 0 0 16352\n" ] } ], "source": [ "learner = C.sgd(model.parameters,\n", " C.learning_parameter_schedule(0.1))\n", "progress_writer = C.logging.ProgressPrinter(0)\n", "\n", "train_summary = loss.train((X_train_lr, Y_train_lr), parameter_learners=[learner],\n", " callbacks=[progress_writer])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: We have not set any function defining the `metric`. Hence, you see the metric values are all set to 0.0. The plot below shows how the loss reduces with the training iterations. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEZCAYAAACEkhK6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XuYXVV9//H3J4kJt1xBuSYBucjFC1CF8EOaUYQCFuiv\nBQERFS8PUnxArLYgSlK0Lba1FlSk9IcUqMitJaLcEYZbNVCSQAgJBkEIAQJCJuRCQiDf3x9rD3Ny\n5sxkz+Ts2WfPfF7Pc57Zl3XW/p49M+d71tprr6OIwMzMbEOGlR2AmZlVgxOGmZnl4oRhZma5OGGY\nmVkuThhmZpaLE4aZmeXihDHESfqxpHP6WlbSVEmLio3u7eM+LemjA3GswSD73cxtdtl+xHGfpE8X\nUbeVY0TZAVgxJP0e2AbYLiJerdk+G/gAsGNEPBsRp+ats0HZft3EI2ky8DQwIiLW9aeOwULSh4Fb\nSOdyGLAZsAJQtm3PiHiuL3VGxD3A+5pd1swtjMErSG/KJ3RukPReYFP6+UbfRJ1vhir8QNLwoo+x\nMSLi/ogYHRFjgL1I52Vs57b6ZKFMKcHakOeEMbhdCXymZv0zwOW1BSRdJum8bHmqpEWSvippiaTF\nkj7bqGzXJp0t6WVJT0n6ZM2OIyTNkrRM0jOSptU8757sZ4ek1yTtnz3ni5Iez7Y9JmnvmufsI+kR\nSUsl/UzSyEYvWNJnJN0v6V8kvQxMkzRN0pU1ZSZLWidpWLZ+t6Tzsue9JulWSRN6qP9xSUfUrA+X\n9JKkvSWNknSlpD9kcc6U9M5G9WzAegkh69o5T9L/kFofEyV9vuZcLZT0+ZryB0t6umZ9kaQzJT2a\nxfVTSe/oa9ls/9mSXsjKfSE7j5M2+IKScyX9XtKLkn4iaXS2b9PsOJ3n7Ted5z97nU9nr/NJSZ/o\nx/m0JnHCGNx+A4yW9J7szfE44D/p/ZP9NsBoYDvgC8CPJI3tpeyErOxngUsk7ZrtWwGcFBFjgY8D\nX5J0VLbvj7OfY7JP0TMlHQucC3wq+7R9FPBKzbGOBQ4FdiJ1qX22l9ewP/AksDXwd9m2+lZV/foJ\npIT6TmAU8LUe6r4K+GTN+mHAyxExJ3v+GGB70nn5EvB6L3H2xadIr3kMsBh4ETg8O1dfBH6QtSA7\n1b++Y4GDgXcDHwRO6mtZSX8KnAZMBXYDPtrguT35Ium8/TGwM+n8/Gu272RSy3e7bPtfAquzhPI9\n4ODsdR4IPJrzeFYAJ4zBr7OVcQgwH3h+A+XfAL4dEW9FxC2kN/739FA2gG9FxNqIuBe4CfgEQETc\nGxHzsuXHgKtJbzS1ahPX54F/jIhZ2XOeiojai+oXRMSSiOgAfgHUtj7qLY6IiyJiXUSs2cDr7XRZ\nRPwuK39tL/X/DDhK0ibZ+gnZNoC1wJbAbpHMjogVOY+/IT+JiN9mv5e3IuKmiHgGICLagV8BB/Xy\n/O9HxMsRsRT4Jb2fv57KHgtcmsXxOvC3fYj/k8A/Z9fNVgLfoCvxrgW2ouu8zYqIVdm+dcD7JI3K\nfv8L+nBMazInjMHvP0n/mJ8FrshR/pW6C9GrgC16KLs0IlbXrD9D+pSIpP0l3ZV113QAp5DeFHoy\nEfhdL/uX5IwJoD+jt17MU39E/A54HDhS0qakltBV2e4rgduAqyU9J+l8Ne8aynqvSdKfZl03r0ha\nSvpA0Nv57cv566nsdnVxLCL/dajtSH8fnZ4BRmVddv8B3Alcm3V1/b2kYRGxnJSQvwy8KOnGmhas\nlcAJY5CLiGdJF78PB/67ydWPz940O02iqwXzU2AGsH1EjAP+ja43l0bdGItIXRXNUF//StLoo07b\nbmT9V5OS8NHAvIh4CiAi3oyIb0fEXsD/AY4EmjWs9O3XlLVuriN1t70zIsYDd1D8IIIXgB1q1ieR\nv0vqeWByzfpkYE3WklkbEedFxJ7Ah4E/B04EiIjbIuIQUvfn70h/R1YSJ4yh4XPAR7NuhGYS8LeS\n3iHpINK1imuzfVuQWiBrJe3H+v3+L5O6GmoTxP8DviZpXwBJO0ua2KQ45wB/LGlidj3mrI2s72rS\n9ZRT6WpdIKlN0nuz60UrSF0tfR02nOdNfxTwDuAPQGTXFg7u43H641rg85J2k7QZ8M0+PPdnwFez\nAQejge+QnTtJH5G0lyRRc94kbZO1pDYF3iQl/rea+YKsb5wwBq+3P/lFxNOd1wbq9/WlngZeAJaS\nPj1eCZwSEQuzfX8JfFvSMtIbyzU18bxO+nT8gKRXJe0XEddn266S9BpwA+kCaF/j7f4CIu7Mjv8o\n8BDpGsh6RfpY34vAr4Ep1Lwu0qfg64FlwDzgbtJ56bzp8aI81W9oW0QsA84kteBeIX0ir39NG6qz\nz2Uj4pfAj4F7gSeA+7NdPV0nqq3r30nn6j7SgIRlwFeyfduRWr/LgLnA7aRkMhz4Ounv62XgANJF\ndyuJBuILlLJPXP8LPBcRRzXYfyGpy2Ql8NlsxImZtbBsVNbDETGq7FhsYAxUC+MM0oXCbiQdDuwc\nEbuSLoxePEAxmVkfSfqzrAtyAnA+qZVjQ0ThCUPSDsARpD7qRo4mG70TETOBsZK2LjouM+uX00jX\nTn5LGkH15XLDsYE0EHNJfZ/UD9nTzV/bs/5QvcXZtiWNi5tZWbIRSzZEFdrCkPRxYEl2TUIMwNxB\nZmZWjKJbGAeS7oo9gnTr/2hJV0RE7dj0xaSbtjrtkG1bj6SyJ8wzM6ukiGjKh/VCWxgR8Y2ImBQR\n7waOB+6qSxYAN5Ld3CRpCtAREQ27oyKipR7Tpk0rPYaqxOWYHNNQiKsVY2qmUr4PQ9IpQETEJRFx\ns9LMpk+ShtWeXEZMZmbWuwFLGJG+qOWebPnf6vZ5pIWZWYvznd4boa2trewQGmrFuBxTPo4pv1aM\nqxVjaqYBudO7GSRFVWI1M2sVkogqXPQ2M7PBwwnDzMxyccIwM7NcnDDMzCwXJwwzM8vFCcPMzHJx\nwjAzs1ycMMzMLBcnDDMzy8UJw8zMcnHCMDOzXJwwzMwsFycMMzPLxQnDzMxyccIwM7NcnDDMzCwX\nJwwzM8ul0IQhaZSkmZJmS5oraVqDMlMldUialT2+2VN9xx5bZLRmZtabEUVWHhFrJH0kIlZJGg48\nIOmWiHiwrui9EXHUhup74IFi4jQzsw0rvEsqIlZli6NICarRF3Pn+r7ZdeuaFZWZmfVV4QlD0jBJ\ns4EXgTsi4qEGxQ6QNEfSTZL27KkuJwwzs/IMRAtjXUTsA+wA7N8gITwMTIqIvYEfAjN6rqu4OM3M\nrHeFXsOoFRGvSbobOAx4vGb7iprlWyRdJGlCRLxaX8eKFdOZPj0tt7W10dbWVnTYZmaV0t7eTnt7\neyF1Kwr82C5pK2BtRCyTtClwG3B+RNxcU2briFiSLe8HXBsROzaoK+69NzjooMLCNTMbdCQREbmu\nE29I0S2MbYHLJQ0jdX9dExE3SzoFiIi4BDhG0qnAWuB14LieKnOyMDMrT6EtjGaSFFWJ1cysVTSz\nheE7vc3MLBcnDDMzy8UJw8zMcqlUwvjYx8qOwMxs6KrURe/hw4M33yw7EjOz6hiyF709NYiZWXkq\nlTAq0hgyMxuUKpUwwEnDzKwslUoYw4a5W8rMrCyVShi/+hWoKZduzMysryo1SqoqsZqZtYohO0rK\nzMzK44RhZma5OGGYmVkuThhmZpZLpRLGkUfC8uVlR2FmNjRVKmE88ACsXVt2FGZmQ1OlEoZv3DMz\nK48ThpmZ5VJowpA0StJMSbMlzZU0rYdyF0paKGmOpL17qs8Jw8ysPCOKrDwi1kj6SESskjQceEDS\nLRHxYGcZSYcDO0fErpL2By4GpjSqzwnDzKw8hXdJRcSqbHEUKUHVz+9xNHBFVnYmMFbS1o3quu46\n2HLLoiI1M7PeFJ4wJA2TNBt4EbgjIh6qK7I9sKhmfXG2rZsDD4RRo4qJ08zMeldolxRARKwD9pE0\nBpghac+IeLw/dU2fPv3t5ba2Ntra2poSo5nZYNHe3k57e3shdQ/obLWSvgWsjIh/qdl2MXB3RFyT\nrS8ApkbEkrrnerZaM7M+qsxstZK2kjQ2W94UOARYUFfsRuDTWZkpQEd9sjAzs/IV3SW1LXC5pGGk\n5HRNRNws6RQgIuKSbP0ISU8CK4GTC47JzMz6oVJfoHTSScF3vgOTJpUdjZlZNVSmS6rZHnoIVq4s\nOwozs6GpUgnDN+6ZmZWnUglDcsIwMytLpRLGsGFQkUsuZmaDTuUShlsYZmblqNQoqQcfDHbfHUaP\nLjsaM7NqaOYoqUoljKrEambWKobssFozMyuPE4aZmeXihGFmZrk4YZiZWS6VShinnw6PPFJ2FGZm\nQ1OlEsYjj0BHR9lRmJkNTZVKGL5xz8ysPJVKGJ5LysysPJVKGJ5LysysPJVLGG5hmJmVo1JTg8yZ\nE0ycCBMmlB2NmVk1eC4pMzPLpTJzSUnaQdJdkuZJmivp9AZlpkrqkDQre3yzyJjMzKx/RhRc/5vA\nVyNijqQtgIcl3R4RC+rK3RsRRxUci5mZbYRCWxgR8WJEzMmWVwDzge0bFG1Kc8nMzIozYKOkJO0I\n7A3MbLD7AElzJN0kac+BisnMzPIruksKgKw76nrgjKylUethYFJErJJ0ODAD2K1RPVOnTmennWDH\nHaGtrY22trYiwzYzq5z29nba29sLqbvwUVKSRgC/BG6JiAtylH8a+KOIeLVuexxzTPCJT8CxxxYU\nrJnZIFOZUVKZnwCP95QsJG1ds7wfKYm92qisb9wzMytPoV1Skg4ETgTmSpoNBPANYDIQEXEJcIyk\nU4G1wOvAcT3X54RhZlaWQhNGRDwADN9AmR8BP8pTn+eSMjMrj+eSMjOzXCo1Nci8ecGECbDNNmVH\nY2ZWDZ5LyszMcqnaKCkzMxsEnDDMzCwXJwwzM8vFCcPMzHKpVML4/vdhxoyyozAzG5oGZPLBZnny\nSRg5suwozMyGpkq1MHzjnplZeZwwzMwsl8olDN+7Z2ZWjsolDLcwzMzKUampQRYuDEaNgokTy47G\nzKwaPJeUmZnl4rmkzMxswOVKGJLOkDRGyaWSZkk6tOjgzMysdeRtYXwuIl4DDgXGAycB5xcWlZmZ\ntZy8CaOz/+sI4MqImFezzczMhoC8CeNhSbeTEsZtkkYDGxzgKmkHSXdJmidprqTTeyh3oaSFkuZI\n2run+i6/HC67LGfEZmbWVHnnkvo8sDfwVESskjQBODnH894EvhoRcyRtQZZ4ImJBZwFJhwM7R8Su\nkvYHLgamNKrsmWfgzTdzRmxmZk2Vt4VxAPBERHRI+hTwTWDZhp4UES9GxJxseQUwH9i+rtjRwBVZ\nmZnAWElbNwzWN+6ZmZUmb8L4MbBK0geAvwJ+R/Ymn5ekHUmtlJl1u7YHFtWsL6Z7UsnqcMIwMytL\n3i6pNyMiJB0N/DAiLpX0+bwHybqjrgfOyFoa/XLXXdNZswamT4e2tjba2tr6W5WZ2aDU3t5Oe3t7\nIXXnutNb0j3ArcDngIOAl4BHIuJ9OZ47AvglcEtEXNBg/8XA3RFxTba+AJgaEUvqysX55wevvgrf\n/e6GX5iZmTX3Tu+8LYzjgE+S7sd4UdIk4J9yPvcnwOONkkXmRuA04BpJU4CO+mTR6fjjfdHbzKws\nueeSyi5EfyhbfTAiXsrxnAOBe4G5QGSPbwCTgYiIS7JyPwQOA1YCJ0fErAZ1eS4pM7M+GvDJByV9\ngtSiaCfdsHcQ8PWIuL4ZQeThhGFm1ndlJIxHgEM6WxWS3gncGREfaEYQeThhmJn1XRmz1Q6r64J6\npQ/PNTOzQSDvRe9bJd0G/CxbPw64uZiQzMysFfXlovdfAAdmq/dFxA2FRdX4+DFjRvDUU3DmmQN5\nZDOz6ipjWC0R8V/AfzXjoP31wgvwxBNlRmBmNnT1mjAkLScNhe22izQsdkwhUfXAc0mZmZWn14QR\nEaMHKpA8PJeUmVl5KjXSadgw8MhaM7NyVC5huIVhZlaO3KOkyiYpnn8+6OiAPfYoOxozs2oY8Du9\nW4Hv9DYz67sy7vQ2M7MhzgnDzMxyccIwM7NcnDDMzCyXSiWMe+6Bc88tOwozs6GpUgnjlVfgscfK\njsLMbGiqVMLwjXtmZuWpVMLwXFJmZuUpNGFIulTSEkmP9rB/qqQOSbOyxzd7q89zSZmZlSf392H0\n02XAD4Areilzb0Qclacyd0mZmZWn0IQREfdLmryBYrlvWZ8yBSZO3MigzMysX1rhGsYBkuZIuknS\nnr0V3HJLeP/7ByosMzOrVXSX1IY8DEyKiFWSDgdmALv1VHj69OlvL7e1tdHW1lZ0fGZmldLe3k57\ne3shdRc+W23WJfWLiNhg20DS08AfRcSrDfZ5tlozsz6q2my1oofrFJK2rlnej5TAuiULMzMrX6Fd\nUpKuAtqALSU9C0wDRgIREZcAx0g6FVgLvA4cV2Q8ZmbWf5X6AqU5c4JLL4ULLyw7GjOzaqhal1TT\nLF8Os2aVHYWZ2dBUqYThG/fMzMpTqYThuaTMzMpTqYThuaTMzMpTuYThFoaZWTkqNUrqtdeCJ56A\nD36w7GjMzKqhmaOkKpUwqhKrmVmrGLLDas3MrDxOGGZmlosThpmZ5eKEYWZmuVQqYSxaBJ/6VNlR\nmJkNTZVKGKtXw8yZZUdhZjY0VSph+MY9M7PyVCpheC4pM7PyVCpheC4pM7PyVC5huIVhZlaOSiWM\nd70Lrrmm7CjMzIYmzyVlZjaIVWYuKUmXSloi6dFeylwoaaGkOZL2LjIeMzPrv6K7pC4D/qSnnZIO\nB3aOiF2BU4CLC47HzMz6qdCEERH3A0t7KXI0cEVWdiYwVtLWRcZkZmb9U/ZF7+2BRTXri7NtZmbW\nYkaUHUBfnH32dK6/Hk48Edra2mhrays7JDOzltLe3k57e3shdRc+SkrSZOAXEfH+BvsuBu6OiGuy\n9QXA1IhY0qBsdHQEkybBsmWFhmxmNmhUZpRURtmjkRuBTwNImgJ0NEoWnXzjnplZeQrtkpJ0FdAG\nbCnpWWAaMBKIiLgkIm6WdISkJ4GVwMm91+eEYWZWlkrduLdyZbDVVrBqVdnRmJlVQ9W6pJrGXVJm\nZuWpVMIYORJuv73sKMzMhqZKdUlVJVYzs1YxZLukzMysPE4YZmaWixOGmZnl4oRhZma5VC5hfOxj\nsHZt2VGYmQ09lRslNWoUvPYajBpVdkRmZq1vSI+S8s17ZmblqFzC6JxP6s474Xvfg6eeKjsiM7Oh\noXIJY9gwWLMGvvAFeOghOOAA2Gcf+Pa3YfXqsqMzMxu8Kpkwli2DU06Bq6+G55+HCy9MSWTkyLKj\nMzMbvCp30fu++1KrYkSlvivQzKwczbzoXbmE0Ve33gp77AGTJxcQlJlZixvSo6T66rHHYN994Ywz\n4KWXyo7GzKy6Bn3C+NrX4PHH0/Iee8C3vgUdHeXGZGZWRYO+S6rW738P06eni+Y33NB9f0Qatmtm\nNlhU6hqGpMOAfyW1Zi6NiO/W7Z8K/BzovKPivyPiOw3qadr3YfSUGC6/HM46K7VE9tgDdt89/Xz/\n++Fd72rKoc3MBlRlEoakYcBvgYOB54GHgOMjYkFNmanAX0XEURuoq/AvUIqA556D+fNhwYL0c/58\nOOQQOOec7uVXrIBNNvGILTNrXc1MGEW/1e0HLIyIZwAkXQ0cDSyoK9cSHUESTJyYHoceuuHyF1yQ\nbhicNAl23RV22SU9Dj0U3vOe4uM1MxtIRV/03h5YVLP+XLat3gGS5ki6SdKeBcfUNOecky6g//zn\n8KUvpaG78+enayWN3HMP3HQTPPEEvPHGgIZqZrbRWqEz5WFgUkSsknQ4MAPYreSYcttkk65rHhsy\nfz7MmAELF6aur+23Ty2S73wH9tuv+FjNzDZG0QljMTCpZn2HbNvbImJFzfItki6SNCEiXq2vbPr0\n6W8vt7W10dbW1ux4C/WlL6UHpBbGM8/Ak0+mLrBGzjortWBqu7ve/W7YdNOBi9nMqqW9vZ329vZC\n6i76ovdw4AnSRe8XgAeBEyJifk2ZrSNiSba8H3BtROzYoK7CL3q3mvvvh0ceSUml8/H00/Dgg2nk\nVr033vB8Wma2vsqMkoK3h9VeQNew2vMlnQJERFwi6TTgVGAt8DpwZkTMbFDPkEsYjbz1Vro4P6zB\n1adddoHXX+9qiUyYAOPGwZlnwhZbdC+/enX6Iirfe2I2eFUqYTSLE8aGvfUWLF7c1RJZujQ9zjkH\nNtuse/ltt4VXX4Xx41NiGTcuLV97LYwe3b38vffC5pt3lR03DoYPL/51mVn/OWFY06xena6TdCaX\njo40LLjRvSVHHAFLlnSVW7YsJZAlSxpfV/mHf0j7OxNS588992zcQjKz5nPCsJawbh0sXw5jxnTv\n1oqAc8/tSkYdHV2PRx7p3jKJSAlp7Nj1k8v48fDFL7rbzKy/nDBs0Fm3Dm6/vSu5dP5csQIuuqh7\n+TVrYKut1u9KGz8+bbv00sb1z5vXVXbzzZ2EbGhwwrAhLyK1buoTzMqVcOKJ3cuvWJG+eKuz3Jo1\nqTUzcSLMnt29/KpVaabjUaPWf4wdC6ed1r382rVw331d5TbZJP3cdFPYYYfmv36zvKo0NYhZIaTU\nFTZmTL4vx9piC5g7t2v9jTdS4li+vOf699orJZbOx/LlaRRaI6tWpWliOsuuXp1+br556oKr98or\nsNtu3RPS1lunlla9FSsaJ7Bx4+DLX+5e/o030iCF2uTVmcAmTepe3iwPtzDMSrBuXRqhVpuQ1qxJ\n2/fdt3v5VavSbMr15UeOhGnTupdfuhSOOWb95LVmTRr99vDD3csvWZJmZ65PSNtu2ziBLVsGX/96\n9/ITJjROYGvWpKlx6hPYZpv52zCL5i4pM2uqdeu6uurqE9jee3cvv3Il/PSn3RPSJps0ntn5lVfg\n+OO71z9uHPz6193LP/98mm6nPiFNnJi+drne0qWpBVabjEaNSte0GiWw11/vSmC1j803h5126vv5\na2VOGGY2qK1bB6+91j3BRMD73te9/IoVcPXV6yevNWtSF9xZZ3Uv/9JLcNJJ3evfcsvUlVfv2WdT\nF2V9gpk8uXEC+8MfUgusM4GNGQPnnbfx56U/nDDMzAbQunUpKTVKYHvt1b388uVw3XVdCUyCr3xl\n4OMGJwwzM8upmQnD99uamVkuThhmZpaLE4aZmeXihGFmZrk4YZiZWS5OGGZmlosThpmZ5eKEYWZm\nuThhmJlZLoUnDEmHSVog6beS/qaHMhdKWihpjqQGU52ZmVnZCk0YkoYBPwT+BNgLOEHS7nVlDgd2\njohdgVOAi4uMqZna29vLDqGhVozLMeXjmPJrxbhaMaZmKrqFsR+wMCKeiYi1wNXA0XVljgauAIiI\nmcBYSVsXHFdTtOofRyvG5ZjycUz5tWJcrRhTMxWdMLYHFtWsP5dt663M4gZlzMysZL7obWZmuRQ6\nvbmkKcD0iDgsWz8LiIj4bk2Zi4G7I+KabH0BMDUiltTV5bnNzcz6oVnTm49oRiW9eAjYRdJk4AXg\neOCEujI3AqcB12QJpqM+WUDzXrCZmfVPoQkjIt6S9GXgdlL316URMV/SKWl3XBIRN0s6QtKTwErg\n5CJjMjOz/qnMN+6ZmVm5KnHRO8/Nf0081qWSlkh6tGbbeEm3S3pC0m2SxtbsOzu76XC+pENrtu8r\n6dEs5n/dyJh2kHSXpHmS5ko6vey4JI2SNFPS7CymaWXHVFPfMEmzJN3YCjFJ+r2kR7Jz9WCLxDRW\n0nXZMeZJ2r8FYtotO0ezsp/LJJ3eAnGdKemxrL6fShrZAjGdkf3fDez7QUS09IOU1J4EJgPvAOYA\nuxd4vA8DewOP1mz7LvDX2fLfAOdny3sCs0ldeztmcXa22mYCH8qWbwb+ZCNi2gbYO1veAngC2L0F\n4tos+zkc+A3pvptSY8rqOBP4T+DGFvn9PQWMr9tWdkz/AZycLY8AxpYdU118w4DngYllxgVsl/3+\nRmbr1wCfKTmmvYBHgVGk/73bgZ0HIqaN/sUW/QCmALfUrJ8F/E3Bx5zM+gljAbB1trwNsKBRLMAt\nwP5Zmcdrth8P/LiJ8c0APtYqcQGbAf8LfKjsmIAdgDuANroSRtkxPQ1sWbettJiAMcDvGmxvib+n\nrK5DgfvKjouUMJ4BxpPecG8s+38POAb495r1bwJfB+YXHVMVuqTy3PxXtHdFNnIrIl4E3tVDbJ03\nHW5PirNT02KWtCOpBfQb0h9HaXFlXT+zgReBOyLiobJjAr5P+uepvThXdkwB3CHpIUlfaIGYdgL+\nIOmyrPvnEkmblRxTveOAq7Ll0uKKiOeB7wHPZvUvi4g7y4wJeAw4KOuC2gw4gtQSKzymKiSMVlTK\nSAFJWwDXA2dExIoGcQxoXBGxLiL2IX2q30/SXmXGJOnjwJKImAP0Ngx7oH9/B0bEvqR/7NMkHdQg\nhoGMaQSwL/CjLK6VpE+hpf49dZL0DuAo4Loe4hjIv6lxpOmLJpNaG5tLOrHMmCJiAan76Q5SN9Js\n4K1GRZt97CokjMXApJr1HbJtA2mJsvmtJG0DvFQT28QGsfW0vd8kjSAliysj4uetEhdARLwGtAOH\nlRzTgcBRkp4CfgZ8VNKVwItlnqeIeCH7+TKpO3E/yj1PzwGLIuJ/s/X/IiWQlvh7Ag4HHo6IP2Tr\nZcb1MeCpiHg1It4CbgD+T8kxERGXRcQHI6IN6CBd1yw8piokjLdv/pM0ktTPdmPBxxTrf0K9Efhs\ntvwZ4Oc124/PRk3sBOwCPJg1B5dJ2k+SgE/XPKe/fkLqb7ygFeKStFXnKAxJmwKHkPpQS4spIr4R\nEZMi4t2kv5O7IuIk4BdlxSRps6xliKTNSX3zcyn3PC0BFknaLdt0MDCvzJjqnEBK+J3KjOtZYIqk\nTbK6DgYeLzkmJL0z+zkJ+L+k7rviY2rGBaqiH6RPrk8AC4GzCj7WVaTRGWtIfywnky543ZnFcDsw\nrqb82aRRB/OBQ2u2/xHpjWEhcMFGxnQgqck5h9T8nJWdkwllxQW8L4tjDmnExjnZ9tJiqotvKl0X\nvcs8TzukDqQVAAADuklEQVTV/N7mdv79ln2egA+QPozNAf6bNEqq9N8daQDFy8Domm1ln6tpWf2P\nApeTRmuWHdO9pGsZs4G2gTpPvnHPzMxyqUKXlJmZtQAnDDMzy8UJw8zMcnHCMDOzXJwwzMwsFycM\nMzPLxQnDWo6kdZKuqFkfLulldU1XfqSkv95AHdtKujZb/oykH/QxhrNzlLlM0p/3pd5mknS3pH3L\nOr4NPU4Y1opWAu+VNCpbP4SaydMi4hcR8Y+9VRARL0TEJ2o39TGGb/SxfKVIGl52DFY9ThjWqm4G\nPp4trzdVRG2LIfuUf4GkByQ92fmJP5tKZm5NfZOyT+RPSDq3pq4bsllk53bOJCvpH4BNs5lcr8y2\nfVpdX4J0eU29U+uPXSuL43GlGWEfk3RrZyKsbSFI2lLS0zWv7walL8N5StJpSl/iM0vS/yhNiNfp\n01lMj0r6UPb8zZS+COw3kh6WdGRNvT+X9CvSHcFmfeKEYa0ogKuBE7I31/eTvuilvkynbSLiQOBI\n0iyejcp8iDTnzgeAY2u6ck6OiA9l+8+QND4izgZWRcS+EXGSpD1JLY62SLPznpHj2LV2AX4QEe8F\nlgF/0cvr7rQX8GekiQr/DlgRaWbZ35Dm/Om0aRbTaaT5xgDOAX4VEVOAjwL/nM33BbAP8OcR8ZEe\nYjDrkROGtaSIeIz07WAnADfR+3TlM7LnzKfrOwDq3RERHRGxmjR30oez7V+RNIf0RrwDsGu2vfZ4\nHwWui4il2XE6+njspyOis7XzcPa6NuTuiFgVacbWDuCX2fa5dc//WXb8+4DRksaQJjg8S+m7StqB\nkXTN+HxHRCzLcXyzbkaUHYBZL24E/on07Xlb9VJuTc1yT4ml2/cXSJpKSgb7R8QaSXcDm/QxxjzH\nri3zVs0x3qTrQ1v9cWufEzXr61j//7bR9zII+IuIWFi7Q9IU0vUhs35xC8NaUecb70+Av42Ief14\nbr1DJI3Lumb+DHiANEPr0ixZ7E76OuBOb9RcGL6L1I01AUDS+D4eu6ftvwc+mC0f20OZDTkui+nD\npG+DWw7cBpz+9sGlvftZt9l6nDCsFQVARCyOiB/mKdvLeqcHSV1Rc0jdS7OAW4F3SJoH/D3w65ry\nlwBzJV0ZEY9n++/Junm+18dj97T9n4FTJT1Mmpq6J73Vu1rSLOAi4HPZ9m+TXtejkh4DzuulbrPc\nPL25mZnl4haGmZnl4oRhZma5OGGYmVkuThhmZpaLE4aZmeXihGFmZrk4YZiZWS5OGGZmlsv/B62V\nz6qUkuh8AAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from collections import defaultdict\n", "\n", "plotdata = defaultdict(list)\n", "for d in train_summary['updates']:\n", " plotdata[\"samples\"].append(d[\"samples\"])\n", " plotdata[\"loss\"].append(d[\"loss\"])\n", "\n", "# Plot the data \n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.plot(plotdata[\"samples\"], plotdata[\"loss\"], 'b--')\n", "plt.xlabel(\"Minibatch number\")\n", "plt.ylabel(\"loss\")\n", "plt.title('Minibatch run vs. Training loss')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feed NumPy data for sequences\n", "\n", "Many real world data involving a notion of time (such as stock market data) or notion of state (such as words in a sentence or utterances in a speech recording) present data in the form of sequences. Sequences are often modelled using recurrent networks (RNNs). CNTK makes it really easy to model such networks and requires one to specify their input variables as `C.sequence.input` instead of `C.input_variables`. [Introduction to sequence modeling](https://cntk.ai/pythondocs/sequence.html#sequence-classification) and [feeding sequences with NumPy](https://cntk.ai/pythondocs/sequence.html#feeding-sequences-with-numpy) provide details on how to model sequence data in CNTK. \n", "\n", "## B. Large data (not in memory)\n", "\n", "In deep learning modeling the data often may not fit into memory (RAM). For this case, CNTK provides the `MinibatchSource` class\n", "\n", "### B.1 Feeding data using the `MinibatchSource` class \n", "\n", "`MinibatchSource` class provides:\n", "\n", " * A **chunked randomization algorithm** that holds only part of the data in RAM at any given time.\n", " * **Distributed reading** where each worker reads a different subset.\n", " * A **transformation pipeline** for images and image augmentation.\n", " * **Composability** across multiple data types (e.g. image captioning).\n", " * Transparent **asynchronous loading** so that the GPU is not stalling while a minibatch is read/prepared \n", "\n", "At present, the `MinibatchSource` class implements a limited set of data types in the form of \"deserializers\":\n", "\n", " * **Text/Basic**: `CTFDeserializer`\n", " * **Image**: `ImageDeserializer`.\n", " * **Speech**: `HTKFeatureDeserializer`, `HTKMLFDeserializer`. \n", "\n", "We will discuss briefly about **Image** and **Speech** deserializers towards the end of this notebook. However, in this section we focus on utilizing the `MinibatchScource` using **CNTK text format (CTF)**, which consists of a set of named feature channels each containing a one dimensional sparse or dense sequence per example. The CTFDeserializer then associates as each feature channel with an input of your model or criterion. Another converter of great value is [Text to CTF converter](https://github.com/Microsoft/CNTK/blob/master/Scripts/txt2ctf.py).\n", "\n", "Here we will read MNIST data using CTF deserializers. In addition to the [CTF format API](https://www.cntk.ai/pythondocs/_modules/cntk/io.html#CTFDeserializer) documentation, the format with respect to MNIST data and the deserialization features are illustrated in [CNTK 103B tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_103B_MNIST_LogisticRegression.ipynb). We reuse the code from this tutorial to demonstrate the use of CTF file readers.\n", "\n", "#### MNIST data reading using CTF deserializer\n", "\n", "In this notebook we use the MNIST data downloaded using CNTK_103A_MNIST_DataLoader notebook. The dataset has 60,000 training images and 10,000 test images with each image being 28 x 28 pixels. Thus, the number of features is equal to 784 (= 28 x 28 pixels), 1 per pixel. The variable `num_output_classes` is set to 10 corresponding to the number of digits (0-9) in the dataset.\n", "\n", "The data is in the following format:\n", "\n", " |labels 0 0 0 1 0 0 0 0 0 0 |features 0 0 0 0 ... \n", " (784 integers each representing a pixel)\n", " \n", "In this tutorial we are going to use the image pixels corresponding the integer stream named \"features\". We define a `create_reader` function to read the training and test data using the [CTF deserializer](https://cntk.ai/pythondocs/cntk.io.html?highlight=ctfdeserializer#cntk.io.CTFDeserializer). The labels are [1-hot encoded](https://en.wikipedia.org/wiki/One-hot). Refer to CNTK 103A tutorial for a visual description of the one-hot representation.\n", "\n", "Note: One can directly convert the input `features` stream (784 dimensional vector) from CTF reader to another input of say (1, 28, 28) shaped image (refer to [CNTK 103D tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_103D_MNIST_ConvolutionalNeuralNetwork.ipynb)) for usage." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data directory is ..\\Examples\\Image\\DataSets\\MNIST\n" ] } ], "source": [ "# Ensure the training and test data is generated and available for this tutorial.\n", "# We search in two locations in the toolkit for the cached MNIST data set.\n", "data_found = False\n", "\n", "for data_dir in [os.path.join(\"..\", \"Examples\", \"Image\", \"DataSets\", \"MNIST\"),\n", " os.path.join(\"data\", \"MNIST\")]:\n", " train_file = os.path.join(data_dir, \"Train-28x28_cntk_text.txt\")\n", " test_file = os.path.join(data_dir, \"Test-28x28_cntk_text.txt\")\n", " if os.path.isfile(train_file) and os.path.isfile(test_file):\n", " data_found = True\n", " break\n", " \n", "if not data_found:\n", " raise ValueError(\"Please generate the data by completing CNTK 103 Part A\")\n", " \n", "print(\"Data directory is {0}\".format(data_dir))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define the data dimensions\n", "mnist_input_dim = 784\n", "mnist_num_output_classes = 10\n", "\n", "# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file\n", "def create_reader(path, is_training, input_dim, num_label_classes):\n", " \n", " labelStream = C.io.StreamDef(field='labels', shape=num_label_classes, is_sparse=False)\n", " featureStream = C.io.StreamDef(field='features', shape=input_dim, is_sparse=False)\n", " \n", " deserializer = C.io.CTFDeserializer(path, C.io.StreamDefs(labels = labelStream, features = featureStream))\n", " \n", " return C.io.MinibatchSource(deserializer,\n", " randomize = is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, the class `cntk.io.MinibatchSource` uses CTFDeserializer, enables randomization and specifies number of sweeps to be made through the data. In deep learning, we take small randomly sampled instances of data and update the parameters of the model in an iterative manner. Making multiple passes through the data often improves the performance of a model as it learns to cope up with different instances of randomly sampled data. This makes specification of the number of sweeps important. In this case, since the data set is relatively small we set the value to `cntk.io.INFINITELY_REPEAT`. In case, you are faced with very large data sets and CPU memory pressure is high, reducing the size of the randomization window in chunks (`randomization_window_in_chunks`) can help in dealing with memory resource utilization. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a MNIST LR model\n", "## Define a small model function\n", "\n", "x = C.input_variable(mnist_input_dim)\n", "y = C.input_variable(mnist_num_output_classes)\n", "model = C.layers.Dense(mnist_num_output_classes, activation=None)\n", "z = model(x/255.0) #scale the input to 0-1 range\n", "loss = C.cross_entropy_with_softmax(z, y)\n", "learner = C.sgd(z.parameters,\n", " C.learning_parameter_schedule(0.05))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Feeding data into `training_session`\n", "\n", "Training session wraps a bunch of functionalities such as a minibatch source, a trainer and optionally a validation mini-batch source (Note: the . Additionally, it makes consistent checkpointing and reporting of training progress with a specified frequency. In the code below, we use the `training_session` class to accept data from the reader (using `cntk.io.Minibatchsource`). \n", "\n", "Note: The current mechanism for validation is mistakenly documented as cross-validation. With practical deep learning application, performing a k-fold cross validation can be expensive and time consuming. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " average since average since examples\n", " loss last metric last \n", " ------------------------------------------------------\n", "Learning rate per minibatch: 0.05\n", " 2.33 2.33 0 0 64\n", " 2.31 2.3 0 0 192\n", " 2.22 2.16 0 0 448\n", " 2.05 1.9 0 0 960\n", " 1.81 1.58 0 0 1984\n", " 1.48 1.16 0 0 4032\n", " 1.16 0.842 0 0 8128\n", " 0.898 0.639 0 0 16320\n", " 0.701 0.504 0 0 32704\n", " 0.561 0.422 0 0 65472\n", " 0.465 0.369 0 0 131008\n", " 0.398 0.33 0 0 262080\n", " 0.35 0.303 0 0 524224\n" ] } ], "source": [ "reader_train = create_reader(train_file, True, mnist_input_dim, mnist_num_output_classes)\n", "minibatch_size = 64\n", "input_map = {x : reader_train.streams.features, y : reader_train.streams.labels }\n", "num_samples_per_sweep = 60000\n", "num_sweeps_to_train_with = 10\n", "\n", "\n", "progress_writer = C.logging.ProgressPrinter(0)\n", "trainer = C.Trainer(z, (loss, None), learner, progress_writer)\n", "\n", "C.train.training_session(\n", " trainer=trainer,\n", " mb_source = reader_train,\n", " mb_size = minibatch_size,\n", " model_inputs_to_streams = input_map,\n", " max_samples = num_samples_per_sweep * num_sweeps_to_train_with,\n", " ).train()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### B.2 Feeding data with full-control over each minibatch update \n", "\n", "This is the most granular method for feeding data into trainer. Note the `cntk.Trainer` class provides access to functions such as `previous_minibatch_loss_average` and `previous_minibatch_evaluation_average` to gain fine grain control over how the training proceeds." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " average since average since examples\n", " loss last metric last \n", " ------------------------------------------------------\n", "Learning rate per minibatch: 0.05\n", " 0.175 0.175 0 0 64\n", " 0.342 0.426 0 0 192\n", " 0.345 0.347 0 0 448\n", " 0.311 0.282 0 0 960\n", " 0.292 0.273 0 0 1984\n", " 0.282 0.274 0 0 4032\n", " 0.288 0.294 0 0 8128\n", " 0.292 0.296 0 0 16320\n", " 0.291 0.29 0 0 32704\n", " 0.287 0.283 0 0 65472\n", " 0.286 0.285 0 0 131008\n", " 0.284 0.282 0 0 262080\n", " 0.28 0.277 0 0 524224\n" ] } ], "source": [ "z1 = model(x/255.0) #scale the input to 0-1 range\n", "loss = C.cross_entropy_with_softmax(z1, y)\n", "learner = C.sgd(z1.parameters,\n", " C.learning_parameter_schedule(0.05))\n", "\n", "num_minibatches_to_train = (num_samples_per_sweep * num_sweeps_to_train_with) / minibatch_size\n", "\n", "progress_writer = C.logging.ProgressPrinter(0)\n", "trainer = C.Trainer(z1, (loss, None), learner, progress_writer)\n", "\n", "for i in range(0, int(num_minibatches_to_train)):\n", " \n", " # Read a mini batch from the training data file\n", " data = reader_train.next_minibatch(minibatch_size, input_map = input_map)\n", " \n", " trainer.train_minibatch(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## C. Custom data\n", "\n", "When built-in readers are not adequate and one is using CNTK's training session constructs (with MinibatchSource class) for scenarios such as distributed training, creating a custom minibatch source is a suggested option. The extension can be written in pure Python.\n", "\n", "We summarize the two steps for creating a [user defined minibatch source](https://cntk.ai/pythondocs/extend.html#user-defined-minibatch-sources). \n", "- Inherit base functionalities from `UserMinibatchSource` class. \n", "- Implement the `stream_info()` and `next_minibatch()`\n", "\n", "**Note**: While creating a custom `UserMinibatchSource` enables easy to extend CNTK's capability to accept any type of data. There are a couple of [details](https://cntk.ai/pythondocs/extend.html#user-defined-minibatch-sources) one needs to be aware of prior to choosing this option:\n", "\n", "- The user is responsible for randomization of the training data\n", "- A few more steps (explained in the detailed documentation) are needed to adapt the user-defined reader to a multi-GPU scenario. \n", "\n", "Please go through the code provided in the user minibatch source documentation referred above to help create your custom reader." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Different Data Formats\n", "\n", "With the different mechanisms under which one can pass text data to a CNTK model trainers, the logical next point is how we support different types of data. In the section below, I will introduce the different data formats and a way to read them. The following 4 data formats (in addition to CTF text format are):\n", "\n", "1. Comma separated values (CSV)\n", "2. Image \n", "3. Speech\n", "4. Composite (combination of data types)\n", "\n", "### Comma separated values (CSV)\n", "\n", "A large amount of data used in machine learning are either in CSV format or tab-separated value (TSV). The following data snippet shows 3 variables (Variable 1 being the date-time, variable 2 and 3 are real numbers)\n", "\n", "```\n", "2013-12-07 15:00:00,1215.0,8615.0\n", "2013-12-07 15:30:00,881.5,9140.0\n", "2013-12-07 16:00:00,81.75,9330.0\n", "```\n", "\n", "Pandas is one of the popular framework to read CSV / TSV data ([details](http://pandas-docs.github.io/pandas-docs-travis/io.html)). We illustrate the use of this IO framework and converting the input to NumPy arrays in the [CNTK 106 B tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb). If one needs to create new features or columns. Leveraging Pandas data frames can save time. [CNTK 104 tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_104_Finance_Timeseries_Basic_with_Pandas_Numpy.ipynb) shows how one can easily facilitate that. \n", "\n", "A challenge often faced by data scientists is having the need to handle large file which may not fit in memory. Pandas [chunk-by-chunk reader](http://pandas-docs.github.io/pandas-docs-travis/io.html#iterating-through-files-chunk-by-chunk) provides the capability to read smaller chunks into memory from a larger file on disc. \n", "\n", "**Note**: Use of Pandas with CNTK opens the door to reading data from a variety of Text, Binary and SQL based data access.\n", "\n", "### Image\n", "\n", "Use of deep learning in computer vision has transformed the field. Consequently, efficient handling of large image data sets is key to solving many problems involving images.\n", "\n", "#### Map file\n", "\n", "ImageDeserializers are very useful both in efficiency as well as reducing the burden on the user to write effective code. In order to use ImageDeserializer, one needs to create a text file that has 2 columns, one with the path to the image file and the second column indicates the image class. Here is an example of the input map file:\n", "\n", "```\n", "c:\\data\\CIFAR-10\\test\\00002.png\t8\n", "c:\\data\\CIFAR-10\\test\\00003.png\t0\n", "c:\\data\\CIFAR-10\\test\\00004.png\t6\n", "```\n", "\n", "The first column in the map file is the fully qualified path to the image and the second column indicates the class label.\n", "\n", "It is often easier to work with zip files and relative path. An alternative way to create the input map file is in the following format where the image is bundled in a zip file. Here the `...` syntax is equivalent to `AbosolutePath` + `/zipdirectory@/`. \n", "\n", "```\n", ".../test.zip@/test/00002.png\t8\n", ".../test.zip@/test/00003.png\t0\n", ".../test.zip@/test/00004.png\t6\n", "```\n", "\n", "When you have your own data set in different directories, a map file needs to be created with the aforementioned syntax.\n", "\n", "With image based model, typically one shifts the data by subtracting the he mean value of the pixel in the data set from the corresponding pixel in every image. Refer to the `saveMean` function in the [CNTK 201A tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201A_CIFAR-10_DataLoader.ipynb).\n", "\n", "**Note**: Memory layout of the data in CNTK is in row major format for the python API (based on usage format in other toolkits) and column major for C++. In case of images it is represented as (number of channels, image height, image width) order. If one is preparing their own data or reading data from other sources such as (image height, image width, number of channels).\n", "\n", "#### Transforms \n", "\n", "In computer vision, one would often augment the data set by transforming the input image by random cropping, adjust color properties, mean transform, and scale transforms. In CNTK, we facilitate these computations as built-in convenience options in the toolkit.\n", "\n", "Please refer to [CNTK 201 B tutorials](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb) for an end-to-end use of the deserializers.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Determine the data path for testing\n", "# Check for an environment variable defined in CNTK's test infrastructure\n", "envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'\n", "def is_test(): return envvar in os.environ\n", "\n", "if is_test():\n", " data_path = os.path.join(os.environ[envvar],'Image','CIFAR','v0','tutorial201')\n", " data_path = os.path.normpath(data_path)\n", "else:\n", " data_path = os.path.join('data', 'CIFAR-10')\n", "\n", "# model dimensions\n", "image_height = 32\n", "image_width = 32\n", "num_channels = 3\n", "num_classes = 10\n", "\n", "import cntk.io.transforms as xforms \n", "#\n", "# Define the reader for both training and evaluation action.\n", "#\n", "def create_reader(map_file, mean_file, train):\n", " print(\"Reading map file:\", map_file)\n", " print(\"Reading mean file:\", mean_file)\n", " \n", " if not os.path.exists(map_file) or not os.path.exists(mean_file):\n", " raise RuntimeError(\"This tutorials depends 201A tutorials, please run 201A first.\")\n", "\n", " # transformation pipeline for the features has jitter/crop only when training\n", " transforms = []\n", " # train uses data augmentation (translation only)\n", " if train:\n", " transforms += [\n", " xforms.crop(crop_type='randomside', side_ratio=0.8) \n", " ]\n", " transforms += [\n", " xforms.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear'),\n", " xforms.mean(mean_file)\n", " ]\n", " # deserializer\n", " return cntk.io.MinibatchSource(C.io.ImageDeserializer(map_file, C.io.StreamDefs(\n", " features = C.io.StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'\n", " labels = C.io.StreamDef(field='label', shape=num_classes) # and second as 'label'\n", " )))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading map file: c:\\Data\\CNTKTestData\\Image\\CIFAR\\v0\\tutorial201\\train_map.txt\n", "Reading mean file: c:\\Data\\CNTKTestData\\Image\\CIFAR\\v0\\tutorial201\\CIFAR-10_mean.xml\n", "Reading map file: c:\\Data\\CNTKTestData\\Image\\CIFAR\\v0\\tutorial201\\test_map.txt\n", "Reading mean file: c:\\Data\\CNTKTestData\\Image\\CIFAR\\v0\\tutorial201\\CIFAR-10_mean.xml\n" ] } ], "source": [ "# Create the train and test readers\n", "reader_train = create_reader(os.path.join(data_path, 'train_map.txt'), \n", " os.path.join(data_path, 'CIFAR-10_mean.xml'), True)\n", "reader_test = create_reader(os.path.join(data_path, 'test_map.txt'), \n", " os.path.join(data_path, 'CIFAR-10_mean.xml'), False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Image reading with PIL\n", "\n", "One could use PIL python package to read images. An exemplar use case for using this package is found in [CNTK 201B tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Speech\n", "\n", "CNTK provides flexible and efficient readers HTKFeatureDeserializer/HTKMLFDeserializer for acoustic features and labels. At the same time, they take care of various optimizations of reading from permanent storage devices/network storage, CPU and GPU asynchronous prefetching which results in significant speed up of model training.\n", "\n", "CNTK consumes Acoustic Model (AM) training data in HTK/MLF format and typically expects 3 input files:\n", "\n", "- SCP file with features: SCP file contains mapping of utterance ids to corresponding feature files.\n", "\n", "- MLF (master label file) file with labels: MLF is a traditional format for representing transcription alignment to features. Even though the referenced MLF file contains phoneme boundaries, they are not needed during [CTC](ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf) training and ignored. For more details on feature/label formats, refer to a copy of HTK book, e.g. [here](http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/) \n", "\n", "- States list file. This file contains the list of all labels (states) in the training set. The blank label, required by CTC, is located in the end of the file at index (line) 132, assuming 0-based indexing." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Current directory C:\\repos\\CNTK\\Manual\n", "C:\\repos\\CNTK\\Tests\\EndToEndTests\\Speech\\Data\n", "C:\\repos\\CNTK\\Tests\\EndToEndTests\\Speech\\Data\\glob_0000.scp\n" ] } ], "source": [ "# Point to a data location\n", "data_dir = os.path.join(\"..\", \"Tests\", \"EndToEndTests\", \"Speech\", \"Data\")\n", "working_dir = os.getcwd()\n", "print(\"Current directory {0}\".format(os.getcwd()))\n", "\n", "\n", "print(os.path.realpath(data_dir))\n", "if not os.path.realpath(data_dir):\n", " raise RuntimeError(\"The required data directory is missing\")\n", "\n", "\n", "# Type of features/labels and dimensions are application specific\n", "# Here we use rather small dimensional feature and the label set for the sake of keeping the train set compact.\n", "feature_dimension = 33\n", "feature = C.sequence.input((feature_dimension))\n", "\n", "label_dimension = 133\n", "label = C.sequence.input((label_dimension))\n", "\n", "train_feature_filepath = os.path.realpath(os.path.join(data_dir, \"glob_0000.scp\"))\n", "train_label_filepath = os.path.realpath(os.path.join(data_dir, \"glob_0000.mlf\"))\n", "mapping_filepath = os.path.realpath(os.path.join(data_dir, \"state_ctc.list\"))\n", "\n", "print(train_feature_filepath)\n", " \n", "if os.path.exists(train_feature_filepath) and \\\n", " os.path.exists(train_label_filepath ) and \\\n", " os.path.exists(mapping_filepath):\n", " \n", " os.chdir(data_dir) # Input data in scp requires execution from a specific directory\n", " train_feature_stream = C.io.HTKFeatureDeserializer(\n", " C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, \n", " scp = train_feature_filepath)))\n", "\n", " train_label_stream = \\\n", " cntk.io.HTKMLFDeserializer(mapping_filepath, \n", " C.io.StreamDefs(speech_label = \\\n", " C.io.StreamDef(shape = label_dimension, \n", " mlf = train_label_filepath)), \n", " phoneBoundaries = True)\n", "\n", " train_data_reader = C.io.MinibatchSource([train_feature_stream, train_label_stream], frame_mode=False)\n", "\n", " train_input_map = {feature: train_data_reader.streams.speech_feature, \n", " label: train_data_reader.streams.speech_label}\n", " os.chdir(working_dir)\n", "else:\n", " raise RunTimeError(\"Missing data file\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Composite reader\n", "\n", "Finally, there are scenarios where one may need to combine readers to make composite reader. One such example is in detection of regions in an image. In this case there are images and region-of-interest (ROI's) in the image which are provided in a text format. In such a situation one need to combine a ImageDeserializer with a CTFDeserializer. Here is some sample pseudo code for reference. Details of such an example can be found in the [A2_RunWithPyModel.py](https://github.com/Microsoft/CNTK/blob/master/Examples/Image/Detection/FastRCNN/A2_RunWithPyModel.py).\n", "\n", "\n", "\n", "First step is to instantiate the image deserializer\n", "\n", "```\n", " image_source = ImageDeserializer(map_file, StreamDefs(\n", " features = StreamDef(field='image', transforms=transforms)))\n", "```\n", " \n", "Second step is to instantiate \n", " ```\n", " roi_source = CTFDeserializer(roi_file, StreamDefs(\n", " rois = StreamDef(field=roi_stream_name, shape=rois_dim, is_sparse=False)))\n", " \n", " label_source = CTFDeserializer(label_file, StreamDefs(\n", " roiLabels = StreamDef(field=label_stream_name, shape=label_dim, is_sparse=False)))\n", " ```\n", " \n", "Finally create a composite reader.\n", "\n", "```\n", " return MinibatchSource([image_source, roi_source, label_source], \n", " max_samples=sys.maxsize, \n", " randomize=data_set == \"train\")\n", "```\n", "\n", "In the speech example, you can see a similar pattern." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }