Setting the hyper-parameters seems like a black art that requires years of experience to acquire. Currently, there are no simple and easy ways to set hyper-parameters, especifically, batch size, learning rate, momentum, and weight decay. A grid search or random search maybe sounds like a good idea. In this blog, I’d like to share you my idea from reading papers and my projects.

Hyper-parameters

Batch Size

Learning rate is maybe the most important hyper-parameters, but we choose batch size firstly because large batch size needs a large learning rate in most circumstances.

**A general principle is: use as a large batch size as possible to fit your CPU memory or/both GPU memory. **There are several reasons:

  • larger batch sizes permit the use of larger learning rates
  • A constant number of iterations favors larger batch sizes

However, small batch sizes add regularization while large batch sizes add less. So utilize it while balancing the proper amount of regularization.

Learning Rate

梯度下降算法有两个重要的控制因子:一个是步长,由学习率控制;一个是方向,由梯度指定。因此,要想对梯度下降的 “快” 和 “准” 实现调控,就可以通过调整它的两个控制因子来实现

学习率设置太小,需要花费过多的时间来收敛

学习率设置较大,在最小值附近震荡却无法收敛到最小值

过小的学习率会降低网络优化的速度,增加训练时间,过大的学习率可能导致网络参数在最终的极优值两侧来回摆动,导致网络不能收敛。实践中证明有效的方法是设置一个根据迭代次数衰减的学习率,可以兼顾训练效率和后期的稳定性。

所以说如果最后的loss 比较抖动,那么可能是 learing rate 过大,然后出现的抖动现象。

多项式衰减

余弦衰减

We will introduce the idea from [Cyclical Learning Rates for Training Neural Networks][1]: Cyclical Learning Rates.

Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds. %From Cyclical Learning Rates for Training Neural Networks

An intuitive understanding of why CLR methods work comes from considering the loss function topology. Dauphin et al. argue that the difficulty in minimizing the loss arises from ** saddle points rather than poor local minima.** Saddle points have small gradients that slow the learning process. However, increasing the learning rate allows for more rapid traversal of saddle point plateaus.

But the question is that how can we find the Minimum bound and Maximum bound. There is a simple way to estimate the reasonable minimum and maximum boundary values with one training run of the network for a few epochs. It is a “LR range test”; run your model for several epochs while letting the learning rate increase linearly between low and high LR values. For example, set both the step size and maxiter to the same number of iterations. In this case, the learning rate will increase linearly from the minimum value to the maximum value during this short run. Next, plot the accuracy versus learning rate. Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for bounds; that is, set $ lr_{base}$ to the first value and set $ lr_{max} $ to the latter value.

cycle 参数的初衷是为了防止网络后期 lr 十分小导致一直在某个局部最小值中振荡,突然调大 lr 可以跳出注定不会继续增长的区域探索其他区域。

梯度下降学习率的设定策略

Momentum

Since learning rate is regarded as the most important hyper-parameter to tune then momentum is also important. Like learning rates, it is valuable to set momentum as large as possible without causing instabilities during training.

The large learning rate can deal with local minimum but works fail when it comes to saddle point where momentum comes to rescue.

The local minimum is like the following picture. In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function.

Your first step from the very top would likely take you down, but then you’d be on a flat rice terrace. The gradient would be zero, and you’d have nowhere to go. To remedy this, we employ momentum - the algorithm remembers its last step and adds some psroportion of it to the current step. This way, even if the algorithm is stuck in a flat region, or a small local minimum, it can get out and continue towards the true minimum.

In summary: when performing gradient descent, learning rate measures how much the current situation affects the next step, while momentum measures how much past steps affect the next step.

Weights Decay

When training neural networks, it is common to use “weight decay,” where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large and can be seen as gradient descent on a quadratic (平方)regularization term.

But why?

Large weights might correlate with certain patterns in the input data (x), this means that the model almost hard codes certain values. This then makes our training data fit well but our test data fit less well.

The idea of weight decay is simple: to prevent overfitting, every time we update a weight $w$ with the gradient $∇J$ in respect to $w$, we also subtract from it $λ∙w$. This gives the weights a tendency to decay towards zero, hence the name. L2 is a type of weights decay. $$ J ( W ; X , y ) + \frac { 1 } { 2 } \lambda \cdot | W | ^ { 2 } $$

But weights decay is not necessarily true for all gradient-base algorithms and was recently shown to not be the case for adaptive gradient algorithms, such as Adam.

In addition, weight decay is not the only regularization technique. In the past few years, some other approaches have been introduced such as Dropout, Bagging, Early Stop, and Parameter Sharing which work very well in NNs.

Takeoff

  1. Batch Size

Use as a large batch size as possible to fit your memory

  1. Learning Rate

Perform a learning rate range test to identify a “large” learning rate.

  1. Momentum

Test with short runs of momentum values 0.99, 0.97, 0.95, and 0.9 to get the best value for momentum.

If using the 1-cycle learning rate schedule, it is better to use a cyclical momentum (CM) that starts at this maximum momentum value and decreases with increasing learning rate to a value of 0.8 or 0.85.

  1. Weights Decay

A grid search to determine the proper magnitude but usually does not require more than one significant figure accuracy. A more complex dataset requires less regularization so test smaller weight decay values, such as $10^{−4} $, $10^{−5} $, $10^{−6} $, 0. A shallow architecture requires more regularization so test larger weight decay values, such as $10^{−2} $, $10^{−3} $, $10^{−4} $.

References

[1]. Cyclical Learning Rates for Training Neural Networks [2]. A disciplined approach to neural network hyper-parameters