YellowFin is a new type of optimizer for deep neural networks that is purported to achieve faster convergence and better results than state-of-the-art optimizers like Adagrad, Adam, and RMSProp in many domains. I’m going to:

optimization in deep learning

The choice of optimizer (and associated parameters) in the process of training a deep neural network plays a large role in the rate of convergence and the overall loss of the model. An optimizer that converges slowly is computationally costly, and one that gets stuck in local minima will prevent the parameters from being tuned optimally. The standard in deep learning is the family of gradient descent optimizers.

Generally speaking, given a model with \(d\) parameters \(\theta \in \mathbb{R}^d\) you need to minimize a loss function \( J(\theta) \). We can do this by calculating the gradient of the loss function with respect to our parameters, \(\nabla_\theta J(\theta)\), and then descending towards a lower value by moving in the opposite direction of the gradient. There’s an excellent overview of gradient descent optimization algorithms by Sebastian Ruder that goes into great detail. He’s updated it as of June 2017 to include AdaMax and Nadam.

YellowFin and momentum tuning

Introduced by Jian Zhang, Ioannis Mitliagkas, and Christopher Ré in the paper YellowFin and the Art of Momentum Tuning, this method is novel due to the tuning of the momentum parameter.

Most state-of-the-art methods store gradients from previous time steps with some decay parameter to form a ‘momentum’ term that is combined with the gradient in the current time step. However, the parameters for the momentum term (eg. \(\beta_1\) and \(\beta_2\) for the Adam optimizer) are static during training. Here are my key takeaways from the YellowFin paper:

  • Recent literature has suggested that algorithms conducting element-wise gradient tuning yield marginal benefits compared to vanilla stochastic-gradient descent. Furthermore, models trained with these techniques actually generalize more poorly.
  • In another recent paper, Mitliagkas et al. showed that asynchronous training introduces ‘momentum-like dynamics’ into optimization (amplifying the momentum that has already been specified in the parameters of the chosen optimizer).
  • The authors have found that tuning momentum as part of the optimizer parameter search improves convergence, but note that conducting a grid-search for the correct momentum and learning rates is challenging. Therefore, they propose automatic momentum tuning.
  • The authors demonstrate that momentum is robust to learning rate misspecification. The source of this property is that the spectral radius of the momentum operator, \(\rho(\mathbf{\mathit{A}}_t)\) is constant in some subset of the hyperparameter space. The spectral radius of a matrix \(\rho(\mathit{X})\) is the largest absolute value of its eigenvectors.

    In fact (for a ‘one dimensional strongly convex quadratic objective’), the spectral radius is constant at \(\rho(\mathbf{\mathit{A}}_t) = \sqrt\mu\) as long as

    where \(\mu\) is the momentum parameter with learning rate \(\alpha\) and curvature \(\mathit{h}\). This means that as long as the learning rate falls within the ‘robust region’ identified in the inequality above, the asymptotic behavior will be the same. The authors point out that this allows them to focus on freely tuning the momentum parameter for optimality, given that the above condition is satisfied.

  • The authors next extend the results regarding learning rate robustness to functions other than strongly smooth convex ones in order to demonstrate robustness to curvature. They define a ‘Generalized condition number’ or GCN, which is a measure of curvature along a scalar slice of a function. They derive optimal hyperparameters \(\mu^*\) and \(\alpha^*\), but do not provide a convergence rate guarnatee. Instead, they provide ‘empirical evidence to support this intuition.’
  • Using a noisy one-dimensional quadratic to model gradient descent with a noisy gradient (is in mini-batch descent), the authors develop tuning rules for the YellowFin optimizer on learning rate \(\alpha\) and momentum \(\mu\). The rules are dependent on an estimate of the model’s distance to a local minima \(\mathit{D}\), an estimate for gradient variance \(\mathit{C}\), and the largest/smallest generalized curvatures \(\mathit{h_{max}}\) and \(\mathit{h_{min}}\).

    At each step, the authors solve what they call the ‘SingleStep’ equation, which is:

    subject to:

    and:

  • Using these rules, the authors demonstrate evidence of superior performance on CIFAR100 on ResNet:

results image

  • As a last (but I think important) note, the authors assume a negative log-probability objective function in their calculations for the measurement functions used as inputs for SingleStep. This means it should play nicely with normal classification loss functions (eg. binary and categorical cross-entropy), but probably not for other ones. I’ll explore this later.



making the TensorFlow version work in Keras

The authors of the YellowFin paper released an implementation of their work in TensorFlow at the time of publication, and another in PyTorch soon afterwards.

I was able to easily drop the YFOptimizer object into my TensorFlow projects, but noticed when I tried to use it in keras with the TFOptimizer wrapper that the methods were not compatible.

Digging into the source code at keras/optimizers, I was quickly able to see that this class defines a get_updates function that requires the TensorFlow optimzer being wrapped to have explicit compute_gradients and apply_gradients attributes. Looking into the the YellowFin source, you can see that the authors simplified the optimization process by aggregating compute_gradients and apply_gradients into a single minimize function.

It’s also worth noting that the YellowFin optimizer itself appears to be a wrapper for the tf.train.MomentumOptimizer object, which becomes an attribute of the YFOptimizer. This saves the authors a significant amount of work, and allows them to devote most of their code to the functions that approximate the distance to a local minima \(\mathit{D}\), an estimate for gradient variance \(\mathit{C}\), and the largest/smallest generalized curvatures \(\mathit{h_{max}}\) and \(\mathit{h_{min}}\).

I was able to make a slightly modified version of the YellowFin optimizer that seems to play nice with Keras by explicitly defining the compute_gradients method and passing the global_steps argument from the Keras API. I’d really like to make this a native keras optimizer from scratch, but right now it can be used in its slightly modified state like this:

from yellowfin import YFOptimizer
from keras.optimizers import TFOptimizer

# define your optimizer
opt = TFOptimizer(YFOptimizer())

# compile a classification model
model.compoile(loss = 'categorical_crossentropy',
               metrics = ['acc'],
               optimizer = opt)

I’ve tested it on GPU and CPU with Ubuntu 16.04, but if you find any issues please don’t hesitate to contact me or contribute to it on github.



benchmark: financial time series forecasting

Using the CIFAR10 deep CNN example from the Keras example repository, I was able to make a working example of my modified YellowFin in Keras.

This is great, but I think it’s useful to validate new tools on data sets that aren’t part of the normal machine learning milieu. I’m playing around with some financial time series data right now (using 100 days of OHLCV data to forecast returns using hybrid recurrent/self normalizing networks), and I’m having a lot of trouble tuning the learning rate to properly train my models. Hopefully YellowFin is the answer I’ve been looking for.