Optimizer

template<typename data_t = real_t>
class elsa::ml::SGD : public elsa::ml::Optimizer<real_t>

Gradient descent (with momentum) optimizer.

Update rule for parameter $ w $ with gradient $ g $ when momentum is $ 0 $:

\[ w = w - \text{learning_rate} \cdot g \]

Update rule when momentum is larger than $ 0 $:

\[ \begin{eqnarray*} \text{velocity} &=& \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w &=& w \cdot \text{velocity} \end{eqnarray*} \]

When nesterov=True, this rule becomes:

\[ \begin{eqnarray*} \text{velocity} & = & \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w & = & w + \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \end{eqnarray*} \]

Public Functions

SGD(data_t learningRate = data_t(0.01), data_t momentum = data_t(0.0), bool nesterov = false)

Construct an SGD optimizer

Parameters
  • learningRate: The learning-rate. This parameter is optional and defaults to 0.01.

  • momentum: hyperparameter >= 0 that accelerates gradient descent in the relevant direction and dampens oscillations. This parameter is optional and defaults to 0, i.e., vanilla gradient descent.

  • nesterov: Whether to apply Nesterov momentum. This parameter is optional an defaults to false.

data_t getMomentum() const

Get momentum.

bool useNesterov() const

True if this optimizer applies Nesterov momentum, false otherwise.

Private Members

data_t momentum_

momentum parameter

bool nesterov_

True if the Nesterov momentum should be used, false otherwise.

template<typename data_t = real_t>
class elsa::ml::Adam : public elsa::ml::Optimizer<real_t>

Optimizer that implements the Adam algorithm.

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

According to Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.

Public Functions

Adam(data_t learningRate = data_t(0.001), data_t beta1 = data_t(0.9), data_t beta2 = data_t(0.999), data_t epsilon = data_t(1e-7))

Construct an Adam optimizer.

Parameters
  • learningRate: The learning-rate. This parameter is optional and defaults to 0.001.

  • beta1: The exponential decay rate for the 1st moment estimates. This parameter is optional and defaults to 0.9.

  • beta2: The exponential decay rate for the 2nd moment estimates. This parameter is optional and defaults to 0.999.

  • epsilon: A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. This parameter is optional and defaults to 1e-7.

data_t getBeta1() const

Get beta1.

data_t getBeta2() const

Get beta2.

data_t getEpsilon() const

Get epsilon.

Private Members

data_t beta1_

exponential decay for 1st order momenta

data_t beta2_

exponential decay for 2nd order momenta

data_t epsilon_

epsilon-value for numeric stability