Optimizer¶
-
template<typename
data_t
= real_t>
classelsa::ml
::
SGD
: public elsa::ml::Optimizer<real_t>¶ Gradient descent (with momentum) optimizer.
Update rule for parameter $ w $ with gradient $ g $ when momentum is $ 0 $:
\[ w = w - \text{learning_rate} \cdot g \]Update rule when momentum is larger than $ 0 $:
\[ \begin{eqnarray*} \text{velocity} &=& \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w &=& w \cdot \text{velocity} \end{eqnarray*} \]When
nesterov=True
, this rule becomes:\[ \begin{eqnarray*} \text{velocity} & = & \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \\ w & = & w + \text{momentum} \cdot \text{velocity} - \text{learning_rate} \cdot g \end{eqnarray*} \]Public Functions
-
SGD
(data_t learningRate = data_t(0.01), data_t momentum = data_t(0.0), bool nesterov = false)¶ Construct an SGD optimizer
- Parameters
learningRate
: The learning-rate. This parameter is optional and defaults to 0.01.momentum
: hyperparameter >= 0 that accelerates gradient descent in the relevant direction and dampens oscillations. This parameter is optional and defaults to 0, i.e., vanilla gradient descent.nesterov
: Whether to apply Nesterov momentum. This parameter is optional an defaults to false.
-
bool
useNesterov
() const¶ True if this optimizer applies Nesterov momentum, false otherwise.
-
-
template<typename
data_t
= real_t>
classelsa::ml
::
Adam
: public elsa::ml::Optimizer<real_t>¶ Optimizer that implements the Adam algorithm.
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.
According to Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.
Public Functions
-
Adam
(data_t learningRate = data_t(0.001), data_t beta1 = data_t(0.9), data_t beta2 = data_t(0.999), data_t epsilon = data_t(1e-7))¶ Construct an Adam optimizer.
- Parameters
learningRate
: The learning-rate. This parameter is optional and defaults to 0.001.beta1
: The exponential decay rate for the 1st moment estimates. This parameter is optional and defaults to 0.9.beta2
: The exponential decay rate for the 2nd moment estimates. This parameter is optional and defaults to 0.999.epsilon
: A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. This parameter is optional and defaults to 1e-7.
-