cntk.learners package¶
A learner tunes a set of parameters during the training process. One can use different learners for different sets of parameters. Currently, CNTK supports the following learning algorithms:
AdaDelta
AdaGrad
FSAdaGrad
Adam
MomentumSGD
Nesterov
RMSProp
SGD
Learner with a customized update function
-
IGNORE
= 0¶ Indicate that the minibatch size is ignored in learning’s hyper-parameter schedule.
-
class
Learner
(*args)[source]¶ Bases:
cntk.cntk_py.Learner
Abstraction for learning a subset of parameters of a learnable function using first order gradient values. For example momentum, AdaGrad, RMSProp, etc. are different types of learners with their own algorithms for learning parameter values using first order gradients. To instantiate a concrete learner, use the factory methods in this module.
-
parameters
¶ The set of parameters associated with this learner.
-
reset_learning_rate
(learning_rate)[source]¶ Resets the learning rate. The new schedule is adjusted to be relative to the current number of elapsed samples/sweeps: the 0 offset in the new schedule corresponds to the current value of elapsed samples/sweeps, and it takes effect from the current position in the training process onwards.
Parameters: learning_rate (output of learning_parameter_schedule()
) – learning rate to reset to
-
update
(gradient_values, training_sample_count, is_sweep_end=False)[source]¶ Update the parameters associated with this learner.
Parameters: - gradient_values (dict) – maps
Parameter
to a NumPy array containing the first order gradient values for the Parameter w.r.t. the training objective. - training_sample_count (int) – number of samples in the minibatch
- is_sweep_end (bool) – a flag indicating whether it is at the end of a sweep of data
Returns: False to indicate that learning has stopped for all of the parameters associated with this learner
Return type: bool
- gradient_values (dict) – maps
-
-
class
UnitType
[source]¶ Bases:
enum.Enum
Deprecated:: 2.2
Indicates whether the values in the schedule are specified on the per-sample or per-minibatch basis.
-
minibatch
= 'minibatch'¶ Schedule contains per-minibatch values (and need to be re-scaled by the learner using the actual minibatch size in samples).
-
sample
= 'sample'¶ Schedule contains per-sample values.
-
-
class
UserLearner
(parameters, lr_schedule, as_numpy=True)[source]¶ Bases:
cntk.cntk_py.Learner
Base class of all user-defined learners. To implement your own learning algorithm, derive from this class and override the
update()
.Certain optimizers (such as AdaGrad) require additional storage. This can be allocated and initialized during construction.
-
update
(gradient_values, training_sample_count, sweep_end)[source]¶ Update the parameters associated with this learner.
Parameters: - gradient_values (dict) – maps
Parameter
to a NumPy array containing the first order gradient values for the Parameter w.r.t. the training objective. - training_sample_count (int) – number of samples in the minibatch
- sweep_end (bool) – if the data is fed by a conforming reader, this indicates whether a full pass over the dataset has just occurred.
Returns: False to indicate that learning has stopped for all of the parameters associated with this learner
Return type: bool
- gradient_values (dict) – maps
-
-
adadelta
(parameters, lr, rho, epsilon, l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates an AdaDelta learner instance to learn the parameters. See [1] for more information.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- rho (float) – exponential smooth factor for each minibatch.
- epsilon (float) – epsilon for sqrt.
- l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: - See also
- [1] Matthew D. Zeiler, ADADELTA: An Adaptive Learning Rate Method.
- parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
adagrad
(parameters, lr, need_ave_multiplier=True, l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates an AdaGrad learner instance to learn the parameters. See [1] for more information.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- need_ave_multiplier (bool, default) –
- l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: See also
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.
- parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
adam
(parameters, lr, momentum, unit_gain=default_unit_gain_value(), variance_momentum=momentum_schedule_per_sample(0.9999986111120757), l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True, epsilon=1e-8, adamax=False)[source]¶ Creates an Adam learner instance to learn the parameters. See [1] for more information.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- momentum (float, list, output of
momentum_schedule()
) – momentum schedule. Note that this is the beta1 parameter in the Adam paper [1]. For additional information, please refer to the this CNTK Wiki article. - unit_gain – when
True
, momentum is interpreted as a unit-gain filter. Defaults to the value returned bydefault_unit_gain_value()
. - variance_momentum (float, list, output of
momentum_schedule()
) – variance momentum schedule. Note that this is the beta2 parameter in the Adam paper [1]. Defaults tomomentum_schedule_per_sample(0.9999986111120757)
. - l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- epsilon (float, optional) – numerical stability constant, defaults to 1e-8
- adamax – when
True
, use infinity-norm variance momentum update instead of L2. Defaults to False - minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate, momentum and variance_momentum. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: See also
[1] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.
- parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
default_unit_gain_value
()[source]¶ Returns true if by default momentum is applied in the unit-gain fashion.
-
fsadagrad
(parameters, lr, momentum, unit_gain=default_unit_gain_value(), variance_momentum=momentum_schedule_per_sample(0.9999986111120757), l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates an FSAdaGrad learner instance to learn the parameters.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- momentum (float, list, output of
momentum_schedule()
) – momentum schedule. For additional information, please refer to the this CNTK Wiki article. - unit_gain – when
True
, momentum is interpreted as a unit-gain filter. Defaults to the value returned bydefault_unit_gain_value()
. - variance_momentum (float, list, output of
momentum_schedule()
) – variance momentum schedule. Defaults tomomentum_schedule_per_sample(0.9999986111120757)
. - l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate, momentum and variance_momentum. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
learning_parameter_schedule
(schedule, minibatch_size=None, epoch_size=None)[source]¶ Create a learning parameter schedule.
Parameters: - schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list [p_1, p_2, .., p_n], the i-th parameter p_i in the list is used as the
value from the (
epoch_size
* (i-1) + 1)-th sample to the (epoch_size
* i)-th sample. If list contains pair, i.e. [(num_epoch_1, p_1), (num_epoch_n, p_2), .., (num_epoch_n, p_n)], the i-th parameter is used as a value from the (epoch_size
* (num_epoch_0 + ... + num_epoch_2 + ... + num_epoch_(i-1) + 1)-th sample to the (epoch_size
* num_epoch_i)-th sample (taking num_epoch_0 = 0 as a special initialization). - minibatch_size (int) – an integer to specify the minibatch size that schedule are designed for.
CNTK will scale the schedule internally so as to simulate the behavior of the schedule as much as possible
to match the designed effect. If it is not specified, CNTK will set to the special value
IGNORE
. - epoch_size (optional, int) – number of samples as a scheduling unit.
Parameters in the schedule change their values every
epoch_size
samples. If noepoch_size
is provided, this parameter is substituted by the size of the full data sweep, in which case the scheduling unit is the entire data sweep (as indicated by the MinibatchSource) and parameters change their values on the sweep-by-sweep basis specified by theschedule
.
Returns: learning parameter schedule
- schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list [p_1, p_2, .., p_n], the i-th parameter p_i in the list is used as the
value from the (
-
learning_parameter_schedule_per_sample
(schedule, epoch_size=None)[source]¶ Create a learning parameter schedule as if the parameter is applied to minibatches of size 1. CNTK will scale the parameters accordingly with respect to the actual minibatch size.
Parameters: - schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list [p_1, p_2, .., p_n], the i-th parameter p_i in the list is used as the
value from the (
epoch_size
* (i-1) + 1)-th sample to the (epoch_size
* i)-th sample. If list contains pair, i.e. [(num_epoch_1, p_1), (num_epoch_n, p_2), .., (num_epoch_n, p_n)], the i-th parameter is used as a value from the (epoch_size
* (num_epoch_0 + ... + num_epoch_2 + ... + num_epoch_(i-1) + 1)-th sample to the (epoch_size
* num_epoch_i)-th sample (taking num_epoch_0 = 0 as a special initialization). - epoch_size (optional, int) – number of samples as a scheduling unit.
Parameters in the schedule change their values every
epoch_size
samples. If noepoch_size
is provided, this parameter is substituted by the size of the full data sweep, in which case the scheduling unit is the entire data sweep (as indicated by the MinibatchSource) and parameters change their values on the sweep-by-sweep basis specified by theschedule
.
Returns: learning parameter schedule as if it is applied to minibatches of size 1.
- schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list [p_1, p_2, .., p_n], the i-th parameter p_i in the list is used as the
value from the (
-
learning_rate_schedule
(lr, unit, epoch_size=None)[source]¶ Deprecated:: 2.2
Create a learning rate schedule (using the same semantics as
training_parameter_schedule()
).Parameters: - lr (float or list) – see parameter
schedule
intraining_parameter_schedule()
. - unit (
UnitType
) –see parameter
unit
intraining_parameter_schedule()
.- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minbiatch size instead.
- epoch_size (int) – see parameter
epoch_size
intraining_parameter_schedule()
.
Returns: learning rate schedule
See also
- lr (float or list) – see parameter
-
momentum_as_time_constant_schedule
(momentum, epoch_size=None)[source]¶ Create a momentum schedule in a minibatch-size agnostic way (using the same semantics as
training_parameter_schedule()
with unit=UnitType.sample).- Deprecated:: 2.2
This is for legacy API. In this legacy API,:
#assume the desired minibatch size invariant constant momentum rate is: momentum_rate momentum_time_constant = -minibatch_size/np.log(momentum_rate) momentum = momentum_as_time_constant_schedule(momentum_time_constant)
The equivalent code in the latest API,
momentum = momentum_schedule(momentum_rate, minibatch_size = minibatch_size)
Parameters: - momentum (float or list) – see parameter
schedule
intraining_parameter_schedule()
. - epoch_size (int) – see parameter
epoch_size
intraining_parameter_schedule()
.
CNTK specifies momentum in a minibatch-size agnostic way as the time constant (in samples) of a unit-gain 1st-order IIR filter. The value specifies the number of samples after which a gradient has an effect of 1/e=37%.
If you want to specify the momentum per N samples (or per minibatch), use
momentum_schedule()
.Examples
>>> # Use a fixed momentum of 1100 for all samples >>> m = momentum_as_time_constant_schedule(1100)
>>> # Use the time constant 1100 for the first 1000 samples, >>> # then 1500 for the remaining ones >>> m = momentum_as_time_constant_schedule([1100, 1500], 1000)
Returns: momentum as time constant schedule
-
momentum_schedule
(momentum, epoch_size=None, minibatch_size=None)[source]¶ Create a momentum schedule (using the same semantics as
learning_parameter_schedule()
) which applies the momentum decay every N samples where N is specified by the argument minibatch_size.Parameters: - momentum (float or list) – see parameter
schedule
intraining_parameter_schedule()
. - epoch_size (int) – see parameter
epoch_size
intraining_parameter_schedule()
. - minibatch_size (int) – an integer to specify the reference minibatch size; CNTK will scale the momentum internally so as to simulate the momentum decay of the specified minibatch size while the actual minibatch sizes of the fed data can vary. In this way, momentum values can be provided in a minibatch-size agnostic way (equal decay per sample). If minibatch_size is None (default), the momentum is applied to the whole minibatch regardless of the actual minibatch sizes (not in a minibatch-size agnostic way).
Examples
>>> # Use a fixed momentum of 0.99 for all samples >>> m = momentum_schedule(0.99)
>>> # Use the momentum value 0.99 for the first 1000 samples, >>> # then 0.9 for the remaining ones >>> m = momentum_schedule([0.99,0.9], 1000) >>> m[0], m[999], m[1000], m[1001] (0.99, 0.99, 0.9, 0.9)
>>> # Use the momentum value 0.99 for the first 999 samples, >>> # then 0.88 for the next 888 samples, and 0.77 for the >>> # the remaining ones >>> m = momentum_schedule([(999,0.99),(888,0.88),(0, 0.77)]) >>> m[0], m[998], m[999], m[999+888-1], m[999+888] (0.99, 0.99, 0.88, 0.88, 0.77)
Returns: momentum schedule - momentum (float or list) – see parameter
-
momentum_schedule_per_sample
(momentum, epoch_size=None)[source]¶ Create a per-sample momentum schedule (using the same semantics as
momentum_schedule()
but specializing in per sample momentum schedule).Parameters: - momentum (float or list) – see parameter
schedule
intraining_parameter_schedule()
. - epoch_size (int) – see parameter
epoch_size
inmomentum_schedule()
.
Returns: momentum schedule
- momentum (float or list) – see parameter
-
momentum_sgd
(parameters, lr, momentum, unit_gain=default_unit_gain_value(), l1_regularization_weight=0.0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates a Momentum SGD learner instance to learn the parameters.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- momentum (float, list, output of
momentum_schedule()
) – momentum schedule. For additional information, please refer to the this CNTK Wiki article. - unit_gain – when
True
, momentum is interpreted as a unit-gain filter. Defaults to the value returned bydefault_unit_gain_value()
. - l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate and momentum. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
nesterov
(parameters, lr, momentum, unit_gain=default_unit_gain_value(), l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates a Nesterov SGD learner instance to learn the parameters. This was originally proposed by Nesterov [1] in 1983 and then shown to work well in a deep learning context by Sutskever, et al. [2].
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- momentum (float, list, output of
momentum_schedule()
) – momentum schedule. For additional information, please refer to the this CNTK Wiki article. - unit_gain – when
True
, momentum is interpreted as a unit-gain filter. Defaults to the value returned bydefault_unit_gain_value()
. - l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate and momentum. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: See also
[1] Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/ sqrt(k)). Soviet Mathematics Doklady, 1983.
[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 2013.
- parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
rmsprop
(parameters, lr, gamma, inc, dec, max, min, need_ave_multiplier=True, l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates an RMSProp learner instance to learn the parameters.
Parameters: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
parameters
. - lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- gamma (float) – Trade-off factor for current and previous gradients. Common value is 0.95. Should be in range (0.0, 1.0)
- inc (float) – Increasing factor when trying to adjust current learning_rate. Should be greater than 1
- dec (float) – Decreasing factor when trying to adjust current learning_rate. Should be in range (0.0, 1.0)
- max (float) – Maximum scale allowed for the initial learning_rate. Should be greater than zero and min
- min (float) – Minimum scale allowed for the initial learning_rate. Should be greater than zero
- need_ave_multiplier (bool, default
True
) – - l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: - parameters (list of parameters) – list of network parameters to tune.
These can be obtained by the root operator’s
-
sgd
(parameters, lr, l1_regularization_weight=0, l2_regularization_weight=0, gaussian_noise_injection_std_dev=0, gradient_clipping_threshold_per_sample=np.inf, gradient_clipping_with_truncation=True)[source]¶ Creates an SGD learner instance to learn the parameters. See [1] for more information on how to set the parameters.
Parameters: - parameters (list of parameters) – list of network parameters to tune. These can be obtained by the ‘.parameters()’ method of the root operator.
- lr (float, list, output of
learning_parameter_schedule()
) – a learning rate in float, or a learning rate schedule. See also:learning_parameter_schedule()
- l1_regularization_weight (float, optional) – the L1 regularization weight per sample, defaults to 0.0
- l2_regularization_weight (float, optional) – the L2 regularization weight per sample, defaults to 0.0
- gaussian_noise_injection_std_dev (float, optional) – the standard deviation of the Gaussian noise added to parameters post update, defaults to 0.0
- gradient_clipping_threshold_per_sample (float, optional) – clipping threshold per sample, defaults to infinity
- gradient_clipping_with_truncation (bool, default
True
) – use gradient clipping with truncation - use_mean_gradient (bool, optional) –
use averaged gradient as input to learner.
- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minibatch size.
- minibatch_size (int, default
None
) – The minibatch size that the learner’s parameters are designed or pre-tuned for. This size is usually set to the same as the minibatch data source’s size. CNTK will perform automatic scaling of the parameters to enable efficient model parameter update implementation while approximate the behavior of pre-designed and pre-tuned parameters. In case that minibatch_size is not specified, CNTK will inherit the minibatch size from the learning rate schedule; if the learning rate schedule does not specify the minibatch_size, CNTK will set it toIGNORE
. Setting minibatch_size toIGNORE
will have the learner apply as it is preventing CNTK performing any hyper-parameter scaling. See also:learning_parameter_schedule()
- epoch_size (optional, int) – number of samples as a scheduling unit for learning rate. See also:
learning_parameter_schedule()
Returns: learner instance that can be passed to the
Trainer
Return type: See also
[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012.
-
training_parameter_schedule
(schedule, unit=<UnitType.minibatch: 'minibatch'>, epoch_size=None)[source]¶ Deprecated:: 2.2
Create a training parameter schedule containing either per-sample (default) or per-minibatch values.
Examples
>>> # Use a fixed value 0.01 for all samples >>> s = training_parameter_schedule(0.01) >>> s[0], s[1] (0.01, 0.01)
>>> # Use 0.01 for the first 1000 samples, then 0.001 for the remaining ones >>> s = training_parameter_schedule([0.01, 0.001], epoch_size=1000) >>> s[0], s[1], s[1000], s[1001] (0.01, 0.01, 0.001, 0.001)
>>> # Use 0.1 for the first 12 epochs, then 0.01 for the next 15, >>> # followed by 0.001 for the remaining ones, with a 100 samples in an epoch >>> s = training_parameter_schedule([(12, 0.1), (15, 0.01), (1, 0.001)], epoch_size=100) >>> s[0], s[1199], s[1200], s[2699], s[2700], s[5000] (0.1, 0.1, 0.01, 0.01, 0.001, 0.001)
Parameters: - schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list, the elements are used as the
values for
epoch_size
samples. If list contains pair, the second element is used as a value for (epoch_size
x first element) samples - unit (
UnitType
) –one of two *
sample
: the returned schedule contains per-sample values *minibatch
: the returned schedule contains per-minibatch values.- deprecated:: 2.2
- Use minibatch_size parameter to specify the reference minbiatch size.
- epoch_size (optional, int) – number of samples as a scheduling unit.
Parameters in the schedule change their values every
epoch_size
samples. If noepoch_size
is provided, this parameter is substituted by the size of the full data sweep, in which case the scheduling unit is the entire data sweep (as indicated by the MinibatchSource) and parameters change their values on the sweep-by-sweep basis specified by theschedule
.
Returns: training parameter schedule
See also
- schedule (float or list) – if float, is the parameter schedule to be used
for all samples. In case of list, the elements are used as the
values for
-
universal
(update_func, parameters)[source]¶ Creates a learner which uses a CNTK function to update the parameters.
Parameters: - update_func – function that takes parameters and gradients as arguments and
returns a
Function
that performs the desired updates. The returned function updates the parameters by means of containingassign()
operations. Ifupdate_func
does not containassign()
operations the parameters will not be updated. - parameters (list) – list of network parameters to tune. These can be obtained by the root operator’s parameters.
Returns: learner instance that can be passed to the
Trainer
Return type: Examples
>>> def my_adagrad(parameters, gradients): ... accumulators = [C.constant(0, shape=p.shape, dtype=p.dtype, name='accum') for p in parameters] ... update_funcs = [] ... for p, g, a in zip(parameters, gradients, accumulators): ... accum_new = C.assign(a, g * g) ... update_funcs.append(C.assign(p, p - 0.01 * g / C.sqrt(accum_new + 1e-6))) ... return C.combine(update_funcs) ... >>> x = C.input_variable((10,)) >>> y = C.input_variable((2,)) >>> z = C.layers.Sequential([C.layers.Dense(100, activation=C.relu), C.layers.Dense(2)])(x) >>> loss = C.cross_entropy_with_softmax(z, y) >>> learner = C.universal(my_adagrad, z.parameters) >>> trainer = C.Trainer(z, loss, learner) >>> # now trainer can be used as any other Trainer
- update_func – function that takes parameters and gradients as arguments and
returns a