Optimizer¶

template<typename data_t = real_t> class elsa::ml::SGD : public elsa::ml::Optimizer<real_t>¶

Gradient descent (with momentum) optimizer.

Update rule for parameter $ w $ with gradient $ g $ when momentum is $ 0 $:

\[ w = w - \text{learning_rate} \cdot g \]

Update rule when momentum is larger than $ 0 $:

\[ \begin{eqnarray*} \text{velocity} &=& \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w &=& w \cdot \text{velocity} \end{eqnarray*} \]

When nesterov=True, this rule becomes:

\[ \begin{eqnarray*} \text{velocity} & = & \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w & = & w + \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \end{eqnarray*} \]

Public Functions

SGD(data_t learningRate = data_t(0.01), data_t momentum = data_t(0.0), bool nesterov = false)¶

Construct an SGD optimizer

Parameters

learningRate: The learning-rate. This parameter is optional and defaults to 0.01.
momentum: hyperparameter >= 0 that accelerates gradient descent in the relevant direction and dampens oscillations. This parameter is optional and defaults to 0, i.e., vanilla gradient descent.
nesterov: Whether to apply Nesterov momentum. This parameter is optional an defaults to false.

data_t getMomentum() const¶: Get momentum.

bool useNesterov() const¶: True if this optimizer applies Nesterov momentum, false otherwise.

Private Members

data_t momentum_¶: momentum parameter

bool nesterov_¶: True if the Nesterov momentum should be used, false otherwise.

template<typename data_t = real_t> class elsa::ml::Adam : public elsa::ml::Optimizer<real_t>¶

Optimizer that implements the Adam algorithm.

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

According to Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.

Public Functions

Adam(data_t learningRate = data_t(0.001), data_t beta1 = data_t(0.9), data_t beta2 = data_t(0.999), data_t epsilon = data_t(1e-7))¶

Construct an Adam optimizer.

Parameters

learningRate: The learning-rate. This parameter is optional and defaults to 0.001.
beta1: The exponential decay rate for the 1st moment estimates. This parameter is optional and defaults to 0.9.
beta2: The exponential decay rate for the 2nd moment estimates. This parameter is optional and defaults to 0.999.
epsilon: A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. This parameter is optional and defaults to 1e-7.

data_t getBeta1() const¶: Get beta1.

data_t getBeta2() const¶: Get beta2.

data_t getEpsilon() const¶: Get epsilon.

Private Members

data_t beta1_¶: exponential decay for 1st order momenta

data_t beta2_¶: exponential decay for 2nd order momenta

data_t epsilon_¶: epsilon-value for numeric stability