cntk.layers.sequence module

First / higher-order functions over sequences, like Recurrence().

Delay(T=1, initial_state=0, name='')[source]

Layer factory function to create a layer that delays input the input by a given number of time steps. Negative means future. This is provided as a layer that wraps delay() so that it can easily be used in a Sequential() expression.

Example

>>> # create example input: one sequence with 4 tensors of shape (3, 2)
>>> from cntk.layers import Sequential
>>> from cntk.layers.typing import Tensor, Sequence
>>> x = C.input_variable(**Sequence[Tensor[2]])
>>> x0 = np.reshape(np.arange(6,dtype=np.float32),(1,3,2))
>>> x0
array([[[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]]], dtype=float32)
>>> # trigram expansion: augment each item of the sequence with its left and right neighbor
>>> make_trigram = Sequential([tuple(Delay(T) for T in (-1,0,1)),  # create 3 shifted versions
...                            splice])                            # concatenate them
>>> y = make_trigram(x)
>>> y(x0)
[array([[ 2.,  3.,  0.,  1.,  0.,  0.],
        [ 4.,  5.,  2.,  3.,  0.,  1.],
        [ 0.,  0.,  4.,  5.,  2.,  3.]], dtype=float32)]
>>> #    --(t-1)--  ---t---  --(t+1)--
Parameters:
  • T (int) – the number of time steps to look into the past, where negative values mean to look into the future, and 0 means a no-op (default 1).
  • initial_state – tensor or scalar representing the initial value to be used when the input tensor is shifted in time.
  • name (str, optional) – the name of the Function instance in the network
Returns:

A function that accepts one argument (which must be a sequence) and returns it delayed by T steps

Return type:

cntk.ops.functions.Function

Fold(folder_function, go_backwards=False, initial_state=0, return_full_state=False, name='')[source]

Layer factory function to create a function that runs a step function recurrently over an input sequence, and returns the final state. This is often used for embeddings of sequences, e.g. in a sequence-to-sequence model. Pseudo-code:

# pseudo-code for h = Fold(step_function)(x)
#  x: input sequence of tensors along the dynamic axis
#  h: resulting final step-function output tensor that no longer has a dynamic axis
h = initial_state   # h = output of previous step ("state"), and also the final result
for x_n in x:       # pseudo-code for looping over all steps of input sequence along its dynamic axis
    h = step_function(h, x_n)  # pass previous state and new data to step_function -> new state
# now h is the result of the final invocation of the step function

Fold() is the same as Recurrence() except that only the final state is returned (whereas Recurrence() returns the entire state sequence). Hence, this documentation will only focus on the differences to Recurrence(), please see Recurrence() for a detailed information on parameters.

Commonly, the folder_function is a recurrent block such as an LSTM. However, one can pass any binary function. E.g. passing plus will sum up all items of a sequence; while element_max would perform a max-pooling over all items of the sequence.

Note: CNTK’s Fold() is similar to the fold() catamorphism known from functional programming. go_backwards=False corresponds to a fold-left, and True to a fold-right, except that the folder_function signature is always the one of fold-left.

Example

>>> from cntk.layers import *
>>> from cntk.layers.typing import *
>>> # sequence classifier. Maps a one-hot word sequence to a scalar probability value.
>>> # The recurrence is a Fold(), meaning only the final hidden state is produced.
>>> # The Label() layer allows to access the final hidden layer by name.
>>> sequence_classifier = Sequential([ Embedding(300),
...                                    Fold(LSTM(500)),
...                                    Dense(1, activation=sigmoid) ])
>>> # element-wise max-pooling over an input sequence
>>> x = C.input_variable(**Sequence[Tensor[2]])
>>> x0 = np.array([[ 1, 2 ],
...                [ 6, 3 ],
...                [ 4, 2 ],
...                [ 8, 1 ],
...                [ 6, 0 ]])
>>> seq_max_pool = Fold(C.element_max)
>>> y = seq_max_pool(x)
>>> y(x0)
    array([[ 8.,   3.]], dtype=float32)
>>> # element-wise sum over an input sequence
>>> seq_sum = Fold(C.plus)
>>> y = seq_sum(x)
>>> y(x0)
    array([[ 25.,   8.]], dtype=float32)
Parameters:
  • folder_function (Function or equivalent Python function) – This function must have N+1 inputs and N outputs, where N is the number of state variables (typically 1 for GRU and plain RNNs, and 2 for LSTMs).
  • go_backwards (bool, defaults to False) – if True then run the recurrence from the end of the sequence to the start.
  • initial_state (scalar or tensor without batch dimension; or a tuple thereof) – the initial value for the state. This can be a constant or a learnable parameter. In the latter case, if the step function has more than 1 state variable, this parameter must be a tuple providing one initial state for every state variable.
  • return_full_state (bool, defaults to False) – if True and the step function has more than one state variable, then the layer returns the final value of a all state variables (a tuple of sequences); whereas if not given or False, only the final value of the first of the state variables is returned to the caller.
  • name (str, optional) – the name of the Function instance in the network
Returns:

A function that accepts one argument (which must be a sequence) and performs the fold operation on it

Return type:

Function

PastValueWindow(window_size, axis, go_backwards=False, name='')[source]

Layer factory function to create a function that returns a static, maskable view for N past steps over a sequence along the given ‘axis’. It returns two matrices: a value matrix, shape=(N,dim), and a valid window, shape=(N,1).

This is used for attention modeling. CNTK presently does not support nested dynamic axes. Since attention models require nested axes (encoder hidden state vs. decoder hidden state), this layer can be used to map the encoder’s dynamic axis to a static tensor axis. The static axis has a maximum length (window_size). To account for shorter input sequences, this function also returns a validity mask of the same axis dimension. Longer sequences will be truncated.

Example

>>> # create example input: one sequence with 4 tensors of shape (3, 2)
>>> from cntk.layers import Sequential
>>> from cntk.layers.typing import Tensor, Sequence
>>> x = C.input_variable(**Sequence[Tensor[2]])
>>> x0 = np.reshape(np.arange(6,dtype=np.float32),(1,3,2))
>>> x0
array([[[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]]], dtype=float32)
>>> # convert dynamic-length sequence to a static-dimension tensor
>>> to_static_axis = PastValueWindow(4, axis=-2)  # axis=-2 means second last
>>> y = to_static_axis(x)
>>> value, valid = y(x0)
>>> # 'value' contains the items from the back, padded with 0
>>> value
array([[[ 4.,  5.],
        [ 2.,  3.],
        [ 0.,  1.],
        [ 0.,  0.]]], dtype=float32)
>>> # 'valid' contains a scalar 1 for each valid item, and 0 for the padded ones
>>> # E.g., when computing the attention softmax, only items with a 1 should be considered.
>>> valid
array([[[ 1.],
        [ 1.],
        [ 1.],
        [ 0.]]], dtype=float32)
Parameters:
  • window_size (int) – maximum number of items in sequences. The axis will have this dimension.
  • axis (int or Axis, optional, keyword only) – axis along which the concatenation will be performed
  • name (str, optional, keyword only) – the name of the Function instance in the network
Returns:

A function that accepts one argument, which must be a sequence. It returns a fixed-size window of the last window_size items, spliced along axis.

Return type:

Function

Recurrence(step_function, go_backwards=False, initial_state=0, return_full_state=False, name='')[source]

Layer factory function that implements a recurrent model, including the common RNN, LSTM, and GRU recurrences. This factory function creates a function that runs a step function recurrently over an input sequence, where in each step, Recurrence() will pass to the step function a data input as well as the output of the previous step. The following pseudo-code repesents what happens when you call a Recurrence() layer:

# pseudo-code for y = Recurrence(step_function)(x)
#  x: input sequence of tensors along the dynamic axis
#  y: resulting sequence of outputs along the same dynamic axis
y = []              # result sequence goes here
s = initial_state   # s = output of previous step ("state")
for x_n in x:       # pseudo-code for looping over all steps of input sequence along its dynamic axis
    s = step_function(s, x_n)  # pass previous state and new data to step_function -> new state
    y.append(s)

The common step functions are LSTM(), GRU(), and RNNStep(), but the step function can be any Function or Python function. The signature of a step function with a single state variable must be (h_prev, x) -> h, where h_prev is the previous state, x is the new data input, and the output is the new state. The step function will be called item by item, resulting in a sequence of the same length as the input.

Step functions can have more than one state output, e.g. LSTM(). In this case, the first N arguments are the previous state, followed by one more argument that is the data input; and its output must be a tuple of N values. In this case, the recurrence operation will, by default, return the first of the state variables (in the LSTM case, the h), while additional state variables are internal (like the LSTM’s c). If all state variables should be returned, pass return_full_state=True.

To provide your own step function, just use any Function (or equivalent Python function) that has a signature as described above. For example, a cumulative sum over a sequence can be computed as Recurrence(plus), where each step consists of plus(s,x_n), where s is the output of the previous call and hence the cumulative sum of all elements up to x_n. Another example is a GRU layer with projection, which could be realized as Recurrence(GRU(500) >> Dense(200)), where the projection is applied to the hidden state as fed back to the next step. F>>G is a short-hand for Sequential([F, G]).

Optionally, the recurrence can run backwards. This is useful for constructing bidirectional models.

initial_state must be a constant. To pass initial_state as a data input, e.g. for a sequence-to-sequence model, use RecurrenceFrom() instead.

Note: Recurrence() is the equivalent to what in functional programming is often called scanl().

Example

>>> from cntk.layers import Sequential
>>> from cntk.layers.typing import Tensor, Sequence
>>> # a recurrent LSTM layer
>>> lstm_layer = Recurrence(LSTM(500))
>>> # a bidirectional LSTM layer
>>> # using function tuples to implement a bidirectional LSTM
>>> bi_lstm_layer = Sequential([(Recurrence(LSTM(250)),                      # first tuple entry: forward pass
...                              Recurrence(LSTM(250), go_backwards=True)),  # second: backward pass
...                             splice])                                     # splice both on top of each other
>>> bi_lstm_layer.update_signature(Sequence[Tensor[13]])
>>> bi_lstm_layer.shape   # shape reflects concatenation of both output states
(500,)
>>> tuple(str(axis.name) for axis in bi_lstm_layer.dynamic_axes)  # (note: str() needed only for Python 2.7)
('defaultBatchAxis', 'defaultDynamicAxis')
>>> # custom step function example: using Recurrence() to
>>> # compute the cumulative sum over an input sequence
>>> x = C.input_variable(**Sequence[Tensor[2]])
>>> x0 = np.array([[   3,    2],
...                [  13,   42],
...                [-100, +100]])
>>> cum_sum = Recurrence(C.plus, initial_state=Constant([0, 0.5]))
>>> y = cum_sum(x)
>>> y(x0)
[array([[   3. ,    2.5],
        [  16. ,   44.5],
        [ -84. ,  144.5]], dtype=float32)]
Parameters:
  • step_function (Function or equivalent Python function) – This function must have N+1 inputs and N outputs, where N is the number of state variables (typically 1 for GRU and plain RNNs, and 2 for LSTMs).
  • go_backwards (bool, defaults to False) – if True then run the recurrence from the end of the sequence to the start.
  • initial_state (scalar or tensor without batch dimension; or a tuple thereof) – the initial value for the state. This can be a constant or a learnable parameter. In the latter case, if the step function has more than 1 state variable, this parameter must be a tuple providing one initial state for every state variable.
  • return_full_state (bool, defaults to False) – if True and the step function has more than one state variable, then the layer returns a all state variables (a tuple of sequences); whereas if not given or False, only the first state variable is returned to the caller.
  • name (str, optional) – the name of the Function instance in the network
Returns:

A function that accepts one argument (which must be a sequence) and performs the recurrent operation on it

Return type:

Function

RecurrenceFrom(step_function, go_backwards=False, return_full_state=False, name='')[source]

Layer factory function to create a function that runs a step function recurrently over an input sequence, with initial state. This layer is very similar to Recurrence(), except that the initial state is not a constant but computed from input data. Thus, in RecurrenceFrom(), the initial state is passed to the layer function as a data input rather than, like Recurrence(), as an initialization parameter to the factory function. This form is meant for use in sequence-to-sequence scenarios. This documentation only covers this case; for additional information on parameters, see Recurrence(). In pseudo-code:

# pseudo-code for y = RecurrenceFrom(step_function)(s,x)
#  x: input sequence of tensors along the dynamic axis
#  s: initial state for the recurrence (computed from input data elsewhew)
#  y: resulting sequence of outputs along the same dynamic axis
y = []              # result sequence goes here
for x_n in x:       # pseudo-code for looping over all steps of input sequence along its dynamic axis
    s = step_function(s, x_n)  # pass previous state and new data to step_function -> new state
    y.append(s)

The layer function returned by this factory function accepts the initial state as data argument(s). The initial state can be non-sequential data, as one would have for a plain sequence-to-sequence model, or sequential data. In the latter case, the last item is the initial state.

Example

>>> from cntk.layers import *
>>> from cntk.layers.typing import *
>>> # a plain sequence-to-sequence model in training (where label length is known)
>>> en = C.input_variable(**SequenceOver[Axis('m')][SparseTensor[20000]])  # English input sentence
>>> fr = C.input_variable(**SequenceOver[Axis('n')][SparseTensor[30000]])  # French target sentence
>>> embed = Embedding(300)
>>> encoder = Recurrence(LSTM(500), return_full_state=True)
>>> decoder = RecurrenceFrom(LSTM(500))       # decoder starts from a data-dependent initial state, hence -From()
>>> emit = Dense(30000)
>>> h, c = encoder(embed(en)).outputs         # LSTM encoder has two outputs (h, c)
>>> z = emit(decoder(h, c, sequence.past_value(fr)))   # decoder takes encoder outputs as initial state
>>> loss = C.cross_entropy_with_softmax(z, fr)
Parameters:
  • step_function (Function or equivalent Python function) – This function must have N+1 inputs and N outputs, where N is the number of state variables (typically 1 for GRU and plain RNNs, and 2 for LSTMs).
  • go_backwards (bool, defaults to False) – if True then run the recurrence from the end of the sequence to the start.
  • initial_state (scalar or tensor without batch dimension; or a tuple thereof) – the initial value for the state. This can be a constant or a learnable parameter. In the latter case, if the step function has more than 1 state variable, this parameter must be a tuple providing one initial state for every state variable.
  • return_full_state (bool, defaults to False) – if True and the step function has more than one state variable, then the layer returns a all state variables (a tuple of sequences); whereas if not given or False, only the first state variable is returned to the caller.
  • name (str, optional) – the name of the Function instance in the network
Returns:

A function that accepts arguments (initial_state_1, initial_state_2, ..., input_sequence), where the number of initial state variables must match the step function’s. The initial state can be a sequence, in which case its last (or first if go_backwards) item is used.

Return type:

Function

UnfoldFrom(generator_function, until_predicate=None, length_increase=1, name='')[source]

Layer factory function to create a function that implements a recurrent generator. Starting with a seed state, the UnfoldFrom() layer repeatedly applies generator_function and emits the sequence of results. It stops after a maximum number of steps, or earlier when, if provided, until_predicate evaluates to True for the current output. The maximum number of steps is based on a second input from which only the dynamic-axis information is used. This is best explained in pseudo-code:

# pseudo-code for y = UnfoldFrom(generator_function, until_predicate)(s,axis_like)
#  s:         initial state for the recurrence (computed from input data elsewhew),
#  axis_like: any input with a dynamic axis, unfold will happen along the same dynamic axis,
#  y:         resulting sequence of outputs along the same dynamic axis
y = []               # result sequence goes here
for _ in axis_like:  # pseudo-code for looping over all steps of dynamic axis of axis_like
    s = generator_function(s)  # pass previous state and new data to step_function -> new state
    y.append(s)
    if until_predicate(s):
        break
# now y is the output of repeatedly applying generator_function()

A typical application is the decoder of a sequence-to-sequence model, the generator function f accepts a two-valued state, with the first being an emitted word, and the second being an internal recurrent state. The initial state would be a tuple (w0, h0) where w0 represents the sentence-start symbol, and h0 is a thought vector that encodes the input sequence (as obtained from a Fold() operation).

A variant allows the state and the emitted sequence to be different. In that case, f returns a tuple (output value, new state), and UnfoldFrom(f)(s) would emit the sequence f(s)[0], f(f(s)[1])[0], f(f(f(s)[1])[1])[0], ....

The maximum length of the output sequence is not unlimited, but determined by the argument to the layer function, multiplied by an optional increase factor.

Optionally, a function can be provided to denote that the end of the sequence has been reached.

Note: In the context of functional programming, the first form of this operation is known as the unfold() anamorphism.

Example

TO BE PROVIDED after signature changes.

Parameters:
  • generator_function (Function or equivalent Python function) – This function must have N inputs and a N-tuple-valued output, where N is the number of state variables. If the emitted value should be different from the state, then the function should have a tuple of N+1 outputs, where the first output is the value to emit, while the others are the state.
  • until_predicate (Function or equivalent Python function) – A function that denotes when the last element of the unfold has been emitted. It takes the same number of arguments as the generator, and returns a scalar that must be 1 for the last element of the sequence, and 0 otherwise. This is subject to the maximum length as determined by the input sequence and length_increase. If this parameter is not provided, the output length will be equal to the specified maximum length.
  • length_increase (float, defaults to 1) – the maximum number of output items is equal to the number of items of the dynamic_axis_like argument to the returned unfold() function, multiplied by this factor. For example, pass 1.5 here if the output sequence can be at most 50% longer than the input.
  • name (str, optional) – the name of the Function instance in the network
Returns:

A function that accepts two arguments (initial state and dynamic_axis_like), and performs the unfold operation on it. The initial state argument is the initial state for the recurrence. The dynamic_axis_like must be a sequence and provides a reference for the maximum length of the output sequence.

Return type:

Function