cntk.io package

CNTK IO utilities.

Base64ImageDeserializer(filename, streams)[source]

Configures the image reader that reads base64 encoded images and corresponding labels from a file of the form:

[sequenceId <tab>] <numerical label (0-based class id)> <tab> <base64 encoded image>

Similarly to the ImageDeserializer, the sequenceId prefix is optional and can be omitted.

Parameters:filename (str) – file name of the input file dataset that contains images and corresponding labels
CBFDeserializer(filename, streams={})[source]

Configures the CNTK binary-format deserializer.

Parameters:
  • filename (str) – file name containing the binary data
  • streams – any dictionary-like object that contains a mapping from stream names to StreamDef objects. Each StreamDef object configures an input stream.
CTFDeserializer(filename, streams)[source]

Configures the CNTK text-format reader that reads text-based files with lines of the form:

[Sequence_Id] (Sample)+

where:

Sample=|Input_Name (Value )*
Parameters:
  • filename (str) – file name containing the text input
  • streams – any dictionary-like object that contains a mapping from stream names to StreamDef objects. Each StreamDef object configures an input stream.
HTKFeatureDeserializer(streams)[source]

Configures the HTK feature reader that reads speech data from scp files.

Parameters:streams – any dictionary-like object that contains a mapping from stream names to StreamDef objects. Each StreamDef object configures a feature stream.
HTKMLFDeserializer(label_mapping_file, streams, phoneBoundaries=False)[source]

Configures an HTK label reader that reads speech HTK format MLF (Master Label File)

Parameters:
  • label_mapping_file (str) – path to the label mapping file
  • streams – any dictionary-like object that contains a mapping from stream names to StreamDef objects. Each StreamDef object configures a label stream.
  • phoneBoundaries (bool, defaults to False) – if phone boundaries should be considered (should be set to True for CTC training, False otherwise)
INFINITELY_REPEAT = 18446744073709551615

int – constant used to specify a minibatch scheduling unit to equal the size of the full data sweep.

ImageDeserializer(filename, streams)[source]

Configures the image reader that reads images and corresponding labels from a file of the form:

<full path to image> <tab> <numerical label (0-based class id)>

or:

sequenceId <tab> path <tab> label
Parameters:filename (str) – file name of the map file that associates images to classes
LatticeDeserializer(lattice_index_file, streams)[source]

Configures a lattice deserializer

Parameters:lattice_index_file (str) – path to the file containing list of lattice TOC (table of content) files
class MinibatchData(value, num_sequences, num_samples, sweep_end)[source]

Bases: cntk.cntk_py.MinibatchData, cntk.tensor.ArrayMixin

Holds a minibatch of input data. This is never directly created, but only returned by MinibatchSource instances.

as_sequences(variable=None)[source]

Convert the value of this minibatch instance to a sequence of NumPy arrays that have their masked entries removed.

Returns:a list of NumPy arrays if dense, otherwise a SciPy CSR array
data

Retrieves the underlying Value instance.

end_of_sweep

Indicates whether the data in this minibatch comes from a sweep end or crosses a sweep boundary (and as a result includes data from different sweeps).

is_sparse

Whether the data in this minibatch is sparse.

mask

The mask object of the minibatch. In it, 2 marks the beginning of a sequence, 1 marks a sequence element as valid, and 0 marks it as invalid.

num_samples

The number of samples in this minibatch

num_sequences

The number of sequences in this minibatch

shape

The shape of the data in this minibatch as tuple.

class MinibatchSource(deserializers, max_samples=cntk.io.INFINITELY_REPEAT, max_sweeps=cntk.io.INFINITELY_REPEAT, randomization_window_in_chunks=cntk.io.DEFAULT_RANDOMIZATION_WINDOW, randomization_window_in_samples=0, randomization_seed=0, trace_level=cntk.logging.get_trace_level(), multithreaded_deserializer=None, frame_mode=False, truncation_length=0, randomize=True, max_errors=0)[source]

Bases: cntk.cntk_py.MinibatchSource

Parameters:
  • deserializers (a single deserializer or a list) – deserializers to be used in the composite reader
  • max_samples (int, defaults to cntk.io.INFINITELY_REPEAT) – The maximum number of input samples (not ‘label samples’) the reader can produce. After this number has been reached, the reader returns empty minibatches on subsequent calls to next_minibatch(). max_samples and max_sweeps are mutually exclusive, an exception will be raised if both have non-default values. Important: Click here for a description of input and label samples.
  • max_sweeps (int, defaults to cntk.io.INFINITELY_REPEAT) – The maximum number of sweeps over the input dataset After this number has been reached, the reader returns empty minibatches on subsequent calls to func:next_minibatch. max_samples and max_sweeps are mutually exclusive, an exception will be raised if both have non-default values.
  • randomization_window_in_chunks (int, defaults to cntk.io.DEFAULT_RANDOMIZATION_WINDOW_IN_CHUNKS) – size of the randomization window in chunks, non-zero value enables randomization. randomization_window_in_chunks and randomization_window_in_samples are mutually exclusive, an exception will be raised if both have non-zero values.
  • randomization_window_in_samples (int, defaults to 0) – size of the randomization window in samples, non-zero value enables randomization. randomization_window_in_chunks and randomization_window_in_samples are mutually exclusive, an exception will be raised if both have non-zero values.
  • randomization_seed (int, defaults to 0) – initial randomization seed value (incremented every sweep when the input data is re-randomized).
  • trace_level (an instance of cntk.logging.TraceLevel) – the output verbosity level, defaults to the current logging verbosity level given by get_trace_level().
  • multithreaded_deserializer (bool) – specifies if the deserialization should be done on a single or multiple threads. Defaults to None, which is effectively “auto” (multhithreading is disabled unless ImageDeserializer is present in the deserializers list). False and True faithfully turn the multithreading off/on.
  • frame_mode (bool, defaults to False) – switches the frame mode on and off. If the frame mode is enabled the input data will be processed as individual frames ignoring all sequence information (this option cannot be used for BPTT, an exception will be raised if frame mode is enabled and the truncation length is non-zero).
  • truncation_length (int, defaults to 0) – truncation length in samples, non-zero value enables the truncation (only applicable for BPTT, cannot be used in frame mode, an exception will be raised if frame mode is enabled and the truncation length is non-zero).
  • randomize (bool, defaults to True) – Enables or disables randomization; use randomization_window_in_chunks or randomization_window_in_samples to specify the randomization range
  • max_errors (int, defaults to 0) – maximum number of errors in the dataset to ignore
current_position

Gets current position in the minibatch source.

Parameters:
  • getter (Dictionary) – minibatch position on the global timeline.
  • setter (Dictionary) – position returned by the getter
get_checkpoint_state()[source]

Gets the checkpoint state of the MinibatchSource.

Returns:A dict that has the checkpoint state of the MinibatchSource
next_minibatch(minibatch_size_in_samples, input_map=None, device=None, num_data_partitions=None, partition_index=None)[source]

Reads a minibatch that contains data for all input streams. The minibatch size is specified in terms of #samples and/or #sequences for the primary input stream; value of 0 for #samples/#sequences means unspecified. In case the size is specified in terms of both #sequences and #samples, the smaller of the 2 is taken. An empty map is returned when the MinibatchSource has no more data to return.

Parameters:
  • minibatch_size_in_samples (int) – number of samples to retrieve for the next minibatch. Must be > 0. Important: Click here for a full description of this parameter.
  • input_map (dict) – mapping of Variable to StreamInformation which will be used to convert the returned data.
  • device (DeviceDescriptor, defaults to None) – CNTK DeviceDescriptor
  • num_data_partitions – Used for distributed training, indicates into how many partitions the source should split the data.
  • partition_index (int, defaults to None) – Used for distributed training, indicates data from which partition to take.
Returns:

A mapping of StreamInformation to MinibatchData if input_map was not specified. Otherwise, the returned value will be a mapping of Variable to class:MinibatchData. When the maximum number of epochs/samples is exhausted, the return value is an empty dict.

Return type:

cntk.io.MinibatchData

restore_from_checkpoint(checkpoint)[source]

Restores the MinibatchSource state from the specified checkpoint.

Parameters:checkpoint (dict) – checkpoint to restore from
stream_info(name)[source]

Gets the description of the stream with given name. Throws an exception if there are none or multiple streams with this same name.

Parameters:name (str) – stream name to fetch
Returns:StreamInformation The information for the given stream name.
stream_infos()[source]

Describes the streams this minibatch source produces.

Returns:A list of instances of StreamInformation
streams

Describes the streams ‘this’ minibatch source produces.

Returns:A dict mapping input names to instances of StreamInformation
class MinibatchSourceFromData(data_streams, max_samples=18446744073709551615)[source]

Bases: cntk.io.UserMinibatchSource

This wraps in-memory data as a CNTK MinibatchSource object (aka “reader”), used to feed the data into a TrainingSession.

Use this if your data is small enough to be loaded into RAM in its entirety, and the data is already sufficiently randomized.

While CNTK allows user code to iterate through minibatches by itself and feed data minibatch by minibatch through train_minibatch(), the standard way is to iterate through data using a MinibatchSource object. For example, the high-level TrainingSession interface, which manages a full training including checkpointing and cross validation, operates on this level.

A MinibatchSource created as a MinibatchSourceFromData linearly iterates through the data provided by the caller as numpy arrays or scipy.sparse.csr_matrix objects, without randomization. The data is not copied, so if you want to modify the data while being read through a MinibatchSourceFromData, please pass a copy.

Example

>>> N = 5
>>> X = np.arange(3*N).reshape(N,3).astype(np.float32) # 6 rows of 3 values
>>> s = C.io.MinibatchSourceFromData(dict(x=X), max_samples=len(X))
>>> mb = s.next_minibatch(3) # get a minibatch of 3
>>> d = mb[s.streams['x']]
>>> d.data.asarray()
array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.]], dtype=float32)
>>> mb = s.next_minibatch(3) # note: only 2 left
>>> d = mb[s.streams['x']]
>>> d.data.asarray()
array([[  9.,  10.,  11.],
       [ 12.,  13.,  14.]], dtype=float32)
>>> mb = s.next_minibatch(3)
>>> mb
{}
>>> # example of a sparse input
>>> Y = np.array([i % 3 == 0 for i in range(N)], np.float32)
>>> import scipy.sparse
>>> Y = scipy.sparse.csr_matrix((np.ones(N,np.float32), (range(N), Y)), shape=(N, 2))
>>> s = C.io.MinibatchSourceFromData(dict(x=X, y=Y)) # also not setting max_samples -> will repeat
>>> mb = s.next_minibatch(3)
>>> d = mb[s.streams['y']]
>>> d.data.asarray().todense()
matrix([[ 0.,  1.],
        [ 1.,  0.],
        [ 1.,  0.]], dtype=float32)
>>> mb = s.next_minibatch(3) # at end only 2 sequences
>>> d = mb[s.streams['y']]
>>> d.data.asarray().todense()
matrix([[ 0.,  1.],
        [ 1.,  0.]], dtype=float32)
>>> # if we do not set max_samples, then it will start over once the end is hit
>>> mb = s.next_minibatch(3)
>>> d = mb[s.streams['y']]
>>> d.data.asarray().todense()
matrix([[ 0.,  1.],
        [ 1.,  0.],
        [ 1.,  0.]], dtype=float32)
>>> # values can also be GPU-side CNTK Value objects (if everything fits into the GPU at once)
>>> s = C.io.MinibatchSourceFromData(dict(x=C.Value(X), y=C.Value(Y)))
>>> mb = s.next_minibatch(3)
>>> d = mb[s.streams['y']]
>>> d.data.asarray().todense()
matrix([[ 0.,  1.],
        [ 1.,  0.],
        [ 1.,  0.]], dtype=float32)
>>> # data can be sequences
>>> import cntk.layers.typing
>>> XX = [np.array([1,3,2], np.float32),np.array([4,1], np.float32)]  # 2 sequences
>>> YY = [scipy.sparse.csr_matrix(np.array([[0,1],[1,0],[1,0]], np.float32)), scipy.sparse.csr_matrix(np.array([[1,0],[1,0]], np.float32))]
>>> s = cntk.io.MinibatchSourceFromData(dict(xx=(XX, cntk.layers.typing.Sequence[cntk.layers.typing.tensor]), yy=(YY, cntk.layers.typing.Sequence[cntk.layers.typing.tensor])))
>>> mb = s.next_minibatch(3)
>>> mb[s.streams['xx']].data.asarray()
array([[ 1.,  3.,  2.]], dtype=float32)
>>> mb[s.streams['yy']].data.shape # getting sequences out is messy, so we only show the shape
(1, 3, 2)
Parameters:
  • data_streams – name-value pairs
  • max_samples (int, defaults to cntk.io.INFINITELY_REPEAT) – The maximum number of samples the reader can produce. If inputs are sequences, and the different streams have different lengths, then each sequence counts with the maximum length. After this number has been reached, the reader returns empty minibatches on subsequent calls to next_minibatch(). Important: Click here for a description of input and label samples.
Returns:

An implementation of a cntk.io.MinibatchSource that will iterate through the data.

get_checkpoint_state()[source]

Gets the checkpoint state of the MinibatchSource.

Returns:A Dictionary that has the checkpoint state of the MinibatchSource
Return type:cntk.cntk_py.Dictionary
next_minibatch(num_samples, number_of_workers=1, worker_rank=0, device=None)[source]
restore_from_checkpoint(checkpoint)[source]

Restores the MinibatchSource state from the specified checkpoint.

Parameters:checkpoint (Dictionary) – checkpoint to restore from
stream_infos()[source]
class StreamConfiguration(name, dim, is_sparse=False, stream_alias='', defines_mb_size=False)[source]

Bases: cntk.cntk_py.StreamConfiguration

Configuration of a stream in a text format reader.

Parameters:
  • name (str) – name of this stream
  • dim (int) – dimensions of this stream. A text format reader reads data as flat arrays. If you need different shapes you can reshape() it later.
  • is_sparse (bool, defaults to False) – whether the provided data is sparse (False by default)
  • stream_alias (str, defaults to '') – name of the stream in the file
  • defines_mb_size (bool, defaults to False) – whether this stream defines the minibatch size.
StreamDef(field=None, shape=None, is_sparse=False, transforms=None, context=None, scp=None, mlf=None, broadcast=None, defines_mb_size=False, max_sequence_length=65535)[source]
Configuration of a stream for use with the builtin Deserializers. The meanings of some configuration keys have a mild dependency on the exact deserializer, and certain keys are meaningless for certain deserializers.
Parameters:
  • field (str, defaults to None) –

    this is the name of the stream

    • for CTFDeserializer the name is inside the CTF file
    • for ImageDeserializer the acceptable names are image or label
    • for HTKFeatureDeserializer and HTKMLFDeserializer only the default value of None is acceptable
  • shape (int or tuple, defaults to None) – dimensions of this stream. HTKFeatureDeserializer, HTKMLFDeserializer, and CTFDeserializer read data as flat arrays. If you need different shapes you can reshape() it later.
  • is_sparse (bool, defaults to False) – whether the provided data is sparse. False by default, unless mlf is provided.
  • transforms (list, defaults to None) – list of transforms to be applied by the Deserializer. Currently only ImageDeserializer supports transforms.
  • context (tuple, defaults to None) – left and right context to consider when reading in HTK data. Only supported by HTKFeatureDeserializer.
  • scp (str or list, defaults to None) – scp files for HTK data
  • mlf (str or list, defaults to None) – mlf files for HTK data
  • broadcast (bool, defaults to None) – whether the features in this stream should be broadcast to the whole sequence (useful in e.g. ivectors with HTK)
  • defines_mb_size (bool, defaults to False) – whether this stream defines the minibatch size.
  • max_sequence_length (int, defaults to 65535) – the upper limit on the length of consumed sequences. Sequence of larger size are skipped.
class StreamInformation(name, stream_id, storage_format, dtype, shape, defines_mb_size=False)[source]

Bases: cntk.cntk_py.StreamInformation

Stream information container that is used to describe streams when implementing custom minibatch source through UserMinibatchSource.

Parameters:
  • name (str) – name of the stream
  • stream_id (int) – unique ID of the stream
  • storage_format (str) – ‘dense’ or ‘sparse’
  • dtype (NumPy type) – data type
  • shape (tuple) – shape of the elements
  • defines_mb_size (bool, default to False) – whether this stream defines the minibatch size when there are multiple streams.
name
class UserDeserializer[source]

Bases: cntk.cntk_py.SwigDataDeserializer

User deserializer is a base class for all user defined deserializers. To support deserialization of a custom format, please implement the public methods of this class and pass an instance of it to MinibatchSource. A UserDeserializer is a plug-in to MinibatchSource for reading data in custom formats. Reading data through this mechanism provides the following benefits:

  • randomization of data too large to fit into RAM, through CNTK chunked paging algorithm
  • distributed reading - only chunks needed by a particular worker are requested
  • composability of transforms (currently composability of user deserializers is not yet supported)
  • transparent support of sequence/frame/truncated BPTT modes
  • automatic chunk and minibatch prefetch
  • checkpointing

The MinibatchSource uses the information provided by this class to build the timeline and move along it when the next minibatch is requested. The deserializer itself, however, is stateless.

get_chunk(chunk_id)[source]

Should return a dictionary of stream name -> data of the chunk, where data is csr_matrix/numpy array in sample mode, or a list of csr_matrix/numpy array in sequence mode.

Parameters:chunk_id (int) – id of the chunk to be read, 0 <= chunk_id < num_chunks
Returns:dict containing the data
num_chunks()[source]

Should return the total number of chunks.

stream_infos()[source]

Should return a list of meta information StreamInformation about all streams exposed by the deserializer.

Returns:list of StreamInformation exposed by the deserializer
class UserMinibatchSource[source]

Bases: cntk.cntk_py.SwigMinibatchSource

Base class of all user minibatch sources.

get_checkpoint_state()[source]

Returns a dictionary describing the current state of the minibatch source. Needs to be overwritten if the state of the minibatch source needs to be stored to and later restored from the checkpoint.

Returns:dictionary, that can be later used on restore_from_checkpoint().
is_infinite()[source]

Should return true if the user has not specified any limit on the number of sweeps and samples.

next_minibatch(num_samples, number_of_workers, worker_rank, device=None)[source]

Function to be implemented by the user.

Parameters:
  • num_samples (int) – number of samples to return
  • number_of_workers (int) – number of workers in total
  • worker_rank (int) – worker for which the data is to be returned
  • device (DeviceDescriptor, defaults to None) – the device descriptor that contains the type and id of the device on which the computation is performed. If None, the default device is used.
Returns:

mapping of StreamInformation to MinibatchData

restore_from_checkpoint(state)[source]

Sets the state of the checkpoint.

Parameters:state (dict) – dictionary containing the state
stream_info(name)[source]

Gets the description of the stream with given name. Throws an exception if there are none or multiple streams with this same name.

stream_infos()[source]

Function to be implemented by the user.

Returns:list of StreamInformation instances
sequence_to_cntk_text_format(seq_idx, alias_tensor_map)[source]

Converts a list of NumPy arrays representing tensors of inputs into a format that is readable by CTFDeserializer.

Parameters:
  • seq_idx (int) – number of current sequence
  • alias_tensor_map (dict) – maps alias (str) to tensor (ndarray). Tensors are assumed to have dynamic axis.
Returns:

String representation in CNTKTextReader format

Return type:

str