Overfit

分析机器学习和深度学习中出现的过拟合现象，从不同的角度简述常用的处理手段。

什么是过拟合?

训练数据集规模小是导致过拟合的原因，而网络足够的复杂（有能力）记住了所有的样本，然后在train sets 表现要远远好于 test sets。还有一种说法是网络拟合了噪声数据。

欠拟合的定义

A model is said to be underfit if it is unable to learn the patterns in the data properly. An underfit model doesn’t fully learn each and every example in the dataset.

如何判断过拟合

For a model that’s overfit, we have a perfect/close to perfect training set score while a poor test/validation score.

通过 learning rate 判断过拟合的方式（从图像的角度）

横向：同时跑多个模型，横向对比 train loss vs validation loss （或者 test loss）之间的差距

纵向：按照时间轴，validation loss 一开始就变得很小，随着时间增加，基本不动，就说明 model 没有学习到什么东西，只是在过拟合 train data。

从实验数据的角度：从 acc 这种绝对数值上看

If our model saw 99% accuracy on the training set but only 55% accuracy on the test set.

如何处理过拟合

处理该问题可以从数据和模型两个方面去考虑。

模型角度

简化模型，通过不断降低模型的复杂度（比如随机森林中的估计量，神经网络中的参数），最终达到一个平衡状态：模型足够简单到不产生过拟合，又足够复杂到能从数据中学习。这样操作时一个比较方便的方法是根据模型的复杂程度查看模型在所有数据集上的误差。如图 1 所示。

图 1

简化模型的另一个好处在于训练速度更加快.

数据角度

获取更多的数据
数据增强

获取更多的数据侧重获取得到的原始的训练数据集；而数据增强在图像处理中更加常见，主要是图像的变形，噪声方面进行考虑。

训练过程角度

提前终止 (early stop)

如图 1 ，当 test error 增加的时候，那么模型就应该停止了。

正则化角度

神经网络中有主要有两类实体：神经元和连接神经元的边。所以按照规范化的操作对象的不同可以分成两大类，一类是对于L 层的神经元的激活值或者说对于第 L+1 层网络神经元的输入值进行normalization 操作，比如说 batch normalization / Layer normalization 等方法都是属于这类；另一种是对于神经元之间相连的边上的权重进行规范化操作，比如说 weights normalization就属于这类。广义上讲，一般机器学习中的损失函数中加入的 L1/ L2 等正则项属于第二类。L1 正则项造成参数的稀疏性，使得大量的参数取得 0值， L2 正则项使得原始参数值有效的减小。通过这些规范化的手段改变参数值，已达到避免模型过拟合的目的。

（最初对于输入data 的normalization，是属于神经元的 normalization）

虽然上述方法分别对神经元和weights 进行了规范化，但本质上都实现了对数据的规范化，只是 scale 的参数的来源是不同的。

使用L1 or L2 在 loss function (error function) 中添加正则项，对损失函数中的weights 进行限制其变大。

神经元可以总结为三种方式：输入数据的 normalization、进入激活函数之前的 BN 和 layer normalization。对于神经元的激活值来说，不管哪种方式，其目标都是一样的，将激活值规范到均值为0，方差为1 的正太分布。

神经元的边的处理方式：weights normalization、损失函数中的正则项 L1, L2。

定义： BN针对一个minibatch的输入样本，计算均值和方差，基于计算的均值和方差来对某一层神经网络的输入X中每一个case进行归一化操作。

BN 的优点：

是一种正则化手段，加上BN 之后，学习率可以有很大的提高，可以加快模型的收敛速度
一般来说在激活函数之前比较好解释一些，效果好一些；输入激活函数之前进行了数据的归一化，防止进去到激活函数的饱和区。

所以可以得到BN 的适用场景：每个mini-batch 都比较大，数据分布比较接近。在训练之前，最好是做好了充分的shuffle，否则效果可能不太好。

BN 的不足：

高度依赖 mini-batch 的大小，当batch size 比较小的时候，效果不好。因为数据样本少，得不到有效的统计量，也可以说噪声比较大。当然是可以通过调整 batch size 的大小规避这种问题，但是有的任务要求 batch size 不能太大；并且BN 是无法应用到 online learning 中的，因为online 都是单实例更新模型，很难组织起 mini-batch 的结构。
对于相似级别的图像生成任务，BN 效果不佳对于图片分类等任务，只要能够找出关键特征，就能正确分类，这算是一种粗粒度的任务，在这种情形下通常 BN 是有积极效果的。但是对于有些输入输出都是图片的像素级别图片生成任务，比如图片风格转换等应用场景，使用 BN 会带来负面效果，这很可能是因为在 Mini-Batch 内多张无关的图片之间计算统计量，弱化了单张图片本身特有的一些细节信息。
因为输入的 Sequence 序列是不定长的，这源自同一个 Mini-Batch 中的训练实例有长有短。对于类似 RNN 这种动态网络结构，BN 使用起来不方便
训练时和推理时统计量不一致对于 BN 来说，采用 Mini-Batch 内实例来计算统计量，这在训练时没有问题，但是在模型训练好之后，在线推理的时候会有麻烦。因为在线推理或预测的时候，是单实例的，不存在 Mini-Batch。虽说实际使用并没大问题，但是确实存在训练和推理时刻统计量计算方法不一致的问题。

Layer normalization 也是一种正则化手段，更加偏向于样本进行 normalization。

BN vs LN ：

从图中看可以知道 batch是“竖”着来的，各个维度做归一化，所以与batch size有关系。 layer是“横”着来的，对一个样本，不同的神经元neuron间做归一化。显示了同一层的神经元的情况。假设这个mini-batch一共有N个样本，则Batch Normalization是对每一个维度进行归一。而Layer Normalization对于单个的样本就可以处理。

相同点： BN 和LN 都是可以很好的一直梯度消失和梯度爆炸的。

实践证明， LN 更加适合 RNN ，BN 更加适合CNN

至于各种 Normalization 的适用场景，可以简洁归纳如下：对于 RNN 的神经网络结构来说，目前只有 LayerNorm 是相对有效的；如果是 GAN 等图片生成或图片内容改写类型的任务，可以优先尝试 InstanceNorm；如果使用场景约束 BatchSize 必须设置很小，无疑此时考虑使用 GroupNorm；而其它任务情形应该优先考虑使用 BatchNorm。

参考文献：深度学习中的Normalization模型

BN 在train 和 test 阶段的不同

在train 的时候，BN 使用的是batch 样本的均值和方差；在 test 的时候，因为是 one case，这个时候的均值和方差具有很大的偏差，所以使用的是 pretrained model 中的均值和方差。

四种不同的normalization

归一化是对数据进行标准化，使数据服从均值为0，标准差为1 的高斯分布。从数学的角度就是减去均值除以方差。

Task-specific limitations. A key assumption in BN is the independence between samples appearing in each batch. While this assumption seems to hold for most convolutional networks used to classify images in conventional datasets, it falls short when employed in domains with strong correlations between samples, such as time-series prediction, reinforcement learning, and generative modeling. For example, BN requires modifications to work in recurrent networks [6], for which alternatives such as weight-normalization [27] and layer-normalization [2] were explicitly devised, without reaching the success and wide adoption of BN. Another example is Generative adversarial networks, which are also noted to suffer from the common form of BN. GAN training with BN proved unstable in some cases, decreasing the quality of the trained model [28]. Instead, it was replaced with virtual-BN [28], weight-norm [39] and spectral normalization [32]. Also, BN may be harmful even in plain classification tasks, when using unbalanced classes, or correlated instances. In addition, while BN is defined for the training phase of the models, it requires a running estimate for the evaluation phase – causing a noticeable difference between the two [19]. This shortcoming was addressed later by batch-renormalization [18], yet still requiring the original BN at the early steps of training. 在图像分类中，BN 效果比较好，并且BN 在 unbalanced class 中效果可能也不是很好；但是在风格转换（gan）中BN 一般不行，

Norm matters: efficient and accurate normalization schemes in deep networks

四种 normalization 时间轴

BN（Batch Normalization）于2015年由 Google 提出，开创了Normalization 先河；2016年出了LN（layer normalization）和IN（Instance Normalization）；2018年，Kaiming提出了GN（Group normalization），成为了ECCV 2018最佳论文提名。

正则化的本质是为了减少数据的复杂程度和减少模型的复杂程度，这样就可以设置更大的学习率，更快的进行收敛。

用更简单的语言来讲，各种Norm之间的差别只是针对的维度不同而已。

BN是在batch上，对N、H、W做归一化，而保留通道 C 的维度。BN对较小的batch size效果不好。BN适用于固定深度的前向神经网络，如CNN，不适用于RNN；
LN在通道方向上，对C、H、W归一化，主要对RNN效果明显；
IN在图像像素上，对H、W做归一化，用在风格化迁移；
GN将channel分组，然后再做归一化。

主要在 CNN中使用，适用于不变长的网络结构。

BN 的缺点

效果依赖于batch size，如果batch size 比较小，那么效果比较差
Recurrent connections in an RNN

不依赖于 batch size, 在 RNN 中效果更好。

Batch normalization 不适图像生成，因为在一个 mini-batch 中图像有不同的风格，不能把这个 batch 里面的数据都同一类标准化。

IN针对图像像素做normalization，最初用于图像的风格化迁移。

Instance Normalization is similar to layer normalization but goes one step further: it computes the mean/standard deviation and normalize across each channel in each training example. Originally devised for style transfer, the problem instance normalization tries to address is that the network should be agnostic to the contrast of the original image. Therefore, it is specific to images and not trivially extendable to RNNs. IN是 layer normalization的扩展，在图像中有三个维度（RGB），在某个维度进行layer normalization，所以只是局限于图像领域。

Instance Normalization是在样本N和通道C两个维度上滑动，对Batch中的N个样本里的每个样本n，和C个通道里的每个样本c，其组合[n, c]求对应的所有值的均值和方差，所以得到的是N⋅C个均值和方差。

GN是为了解决BN对较小的mini-batch size效果差的问题。

就像作者自己比较的说法：LN和IN只是GN的两种极端形式。我们对channel进行分组，分组数为G，即一共分G组。当G=1时，GN就是LN了；当G=C时，GN就是IN了。这也是一个有趣的比较，类似于GN是LN和IN的一个tradeoff。文章里对G的默认值是32。

Spectral Normalization

The weights of the generator are normalized using spectral normalization. weights 变化服从利普次子变化。

Spectral normalization for use in GANs was described by Takeru Miyato, et al. in their 2018 paper titled “Spectral Normalization for Generative Adversarial Networks.” Specifically, it involves normalizing the spectral norm of the weight matrix.

Spectral Normalization the spectral normalization normalized the weight matrix by their spectral norm or $\ell_2$ norm: $\hat{W} = \frac{W}{\lvert W\rvert_2}$. Experimental results show that spectral normalization improves the training of GANs with minimal additional tuning.

Note $W\in\mathbb{R}^{M\times (NWH)}$ is 2D representation of the weight tensor $W\in\mathbb{R}^{M\times N\times W\times H)}$ where $M$ is the number of output channels and $N$ is the number of input channels.

Previously, Miyato et al proposed a normalization technique called spectral normalization (SN). In a few words, SN constrains the Lipschitz constant of the convolutional filters. Spectral norm was used as a way to stabilize the training of the discriminator network. In practice, it worked very. Yet, there is one fundamental problem when training a normalized discriminator. Prior work has shown that regularized discriminators make the GAN training slower. For this reason, some workarounds consist of uneven the rate of update steps between the generator and the discriminator. In other words, we can update the discriminator a few times before updating the generator. For instance, regularized discriminators might require 5 or more update steps for 1 generator update. To solve the problem of slow learning and imbalanced update steps, there is a simple yet effective approach. It is important to note that in the GAN framework, G and D train together. In this context, Heusel et al introduced the two-timescale update rule (TTUR) in the GAN training. It consists of providing different learning rates for optimizing the generator and discriminator. The discriminator trains with a learning rate 4 times greater than G - 0.004 and 0.001 respectively. A larger learning rate means that the discriminator will absorb a larger part of the gradient signal. Hence, a higher learning rate eases the problem of slow learning of the regularized discriminator. Also, this approach makes it possible to use the same rate of updates for the generator and the discriminator. In fact, we use a 1:1 update interval between generator and discriminator. Moreover, [6] have shown that well-conditioned generators are causally related to GAN performance. Given that, Self-Attention for GANs proposed using spectral normalization for stabilizing training of the generator network as well. For G, spectral norm prevents the parameters to get very big and avoids unwanted gradients.

An Overview of Normalization Methods in Deep Learning

深度学习中的模型

Dropout 或者Dropconnect. 其理念就是在训练中随机让神经元无效（即dropout）或让网络中的连接无效（即dropconnect）。这个有点类似集成学习，提高网络模型的泛化性能，减少过拟合的问题。（类似bagging 的思想，使用不同的网络结构在不同的训练集上进行训练）.如果从集成学习角度理解dropout，那么resnet 网络是不是也有点集成学习的味道。

Dropout 的具体流程

首先随机（临时）删掉网络中一半的隐藏神经元，输入输出神经元保持不变（图3中虚线为部分临时被删除的神经元）
然后把输入x通过修改后的网络前向传播，然后把得到的损失结果通过修改的网络反向传播。一小批训练样本执行完这个过程后，在没有被删除的神经元上按照随机梯度下降法更新对应的参数（w，b）。
然后继续重复这一过程

恢复被删掉的神经元（此时被删除的神经元保持原样，而没有被删除的神经元已经有所更新）
从隐藏层神经元中随机选择一个一半大小的子集临时删除掉（备份被删除神经元的参数）。
对一小批训练样本，先前向传播然后反向传播损失并根据随机梯度下降法更新参数（w，b）（没有被删除的那一部分参数得到更新，删除的神经元参数保持被删除前的结果）。

Dropout vs Dropconnect

它们的区别大体在于，在训练过程中dropout是随机drop掉一些节点，而dropconnect则是随机drop掉一些边。

dropout 有两种实现方法： Vanilla Dropout 和 Inverted Dropout。前者是原始论文中的朴素版，后者在 Andrew Ng 的 cs231 课程中有介绍。

如果是 vanilla 版本，随机丢掉一些 input vector，总期望相对变小，为了保持 train 和 test 时候的一致性，那么test 的时候也应该缩小一些。

这样的问题在于，predict 的时候需要跟着 train 时候特定层的操作，这样很麻烦。

训练是哟金只有占比 p 的隐藏层参与训练，在测试阶段所有隐藏层都参与训练，那么得到的结果相比训练时候要大 (1/p) ，为了避免这样的情况，在测试时候需要将输出结构乘以 p 使得下一层输入保持不变

所以 inverted 版本就出来，将所有的修改都放在训练阶段，测试阶段保持不变。

在训练的时候直接将 dropout 后的权重扩大 (1/p) 倍，这样结果的 scale 保持不变，那么在测试时候不用做额外的操作。

目前主流的训练框架，实现的都是 inverted 版本，所以在预测的时候不需要费心调整，把dropout 当做透明处理就行。

pytorch 中的实现

1

torch.nn.Dropout(p=0.5, inplace=False)

p – probability of an element to be zeroed. Default: 0.5
inplace – If set to True, will do this operation in-place. Default: False

首先 $p$ 表示丢弃（drop out）的概率，当 $p$ 数值越大，那么程度越高，防止过拟合的效果越明显。

使用多种模型

Bagging：最典型的就是随机森林( Random Forest)，通过不相关的决策树在不同的数据集上进行训练，最后的每个模型使用相同的权重来“融合”。 Boosting: 在简单的网络上不断提升，应该是更加容易 overfit。

集成学习算法也可以有效的减轻过拟合。 Bagging 通过平均多个模型的结果，降低模型的方差。

如何处理欠拟合

增加网络的深度

pytorch 官方实现:https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

主要的区别是将

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


def wide_resnet50_2(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""Wide ResNet-50-2 model from
    `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.
    The model is the same as ResNet except for the bottleneck number of channels
    which is twice larger in every block. The number of channels in outer 1x1
    convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
    channels, and in Wide ResNet-50-2 has 2048-1024-2048.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['width_per_group'] = 64 * 2
    return _resnet('wide_resnet50_2', Bottleneck, [3, 4, 6, 3],
                   pretrained, progress, **kwargs)

1
2
3
4
5
6
7
8
9


def resnet50(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> ResNet:
    r"""ResNet-50 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, progress,
                   **kwargs)

这里的 base_width 是用来设置 wide resnet 的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


    def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int,
                    stride: int = 1, dilate: bool = False) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

总结

降低“过拟合”的方法：（1）获得更多的训练数据（2）降低模型复杂度（3）正则化方法（4）集成学习方法降低“欠拟合”风险的方法：（1）添加新特征（2）增加模型复杂度（3）减小正则化系数

文章目录