Backbones

首先对比一下 CNN, transformer 和 MLP，然后介绍 DLA 论文和NAS 系列中的 RepVGG。

CNN/ Transformer / MLP

MLP -> CNN -> Transformer -> MLP

CNN

不管是直接用下游数据监督训练，还是先预训练然后微调，基于膨胀卷积（也叫作空洞卷积，dilated convolution）或动态卷积的CNN模型都略优于Transformer模型，并且在速度上CNN模型还更加快。

缺点

CNN 无法捕捉足够长的依赖，这是根本缺陷。虽然能够通过空洞卷积等方式扩大 CNN 的感受野，但不是类似 Transformer 理论上的一步到位

Transformer

有两种 attention： self-attention 和 cross-attention

The self-attention mechanism is the highlight of the original Transformers paper for feature extraction. However, self-attention keeps the original input shape, as the output query is also the input X in a self-attention module. In order to reshape the input tensor, an output query with a different shape (the desired output shape) has to be used.

MLP

多层感知机 MLP (Multilayer Perceptron)，也叫作前馈神经网络（feed forward neural network）是最早发明的人工神经网络之一，其结构设计简单，仅由输入层、隐藏层（中间层）和输出层组成。

最大的特点是信息只进行前向传输，没有循环或回路。如果信息从输出层反馈到输入层，那么就称为循环神经网络（recurrent neural network, RNN）

Pay Attention to MLPs

谷歌原 ViT 团队提出了一种不使用卷积或自注意力的 MLP-Mixer 架构，并且在设计上非常简单，在 ImageNet 数据集上也实现了媲美 CNN 和 ViT 的性能。

MLP-Mixer模型的核心思路是把图片分成很多的patch，每个patch就相当于Transformers的一个token，然后用MLP结合矩阵转置去处理数据。它的每层都包含token-mixing MLP block 和channel-mixing MLP block两个部分。前者是把不同的channel当成不同的计算单元，后者是把不同的token当成不同的计算单元。

从理论角度看，MLP与Transformer、卷积在数学概念上几乎完全等价，二者架构类似，只是在优化和实现细节上有差而已

尽管如此简单，Mixer取得极具竞争力的结果，当在大数据(如100M数据量)上预训练时，它可以取得接近SOTA的性能：87.94%top1精度。当在适当尺度数据+先进正则技术进行训练时，所提方法同样取得极强性能。

总的来说，该研究的实验结果表明，自注意力并不是扩展 ML 模型的必要因素。随着数据和算力的增加，gMLP 等具有简单空间交互机制的模型具备媲美 Transformer 的强大性能，并且可以移除自注意力或大幅减弱它的作用。

评价

MLP-Mixer 的性能随着数据量的增加而增长，MLP-Mixer主要依靠大数据来维持其性能，其在结构设计上没有带来理论上的创新，甚至会牺牲模型可解释性和鲁棒性。

MLP 的性能只是接近注意力机制的性能，但它的计算成本很低，所以性价比很高

MLP vs. Transformers

MLP and Transformers have similar input and output interfaces, if we ignore the detailed mechanism of the internal processing (by hidden layers in MLP and by cross attention modules in Transformers), as illustrated below.

三者对比

卷积结构仅包含局部连接，因此计算高效
自注意力采用动态权值（动态权重），因此模型容量更大，同时还具有全局感受野
MLP 同样具有全局感受野，但没有使用动态权值

注意力机制模型上限更大，但是比较难 train

DLA

DLA (Deep Layer Aggregation)

论文：Deep Layer Aggregation （这个是 17年的论文，依然有生命力）

代码：https://github.com/ucbdrive/dla

Even with the depth of fea- tures in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representa- tions improves inference of what and where

resnet backbone 中使用的 skip-connection 是 not enough 的，可以于进一步的提高。resnet 中 skip-connect 是求和，还可以考虑 concatenate 的方式. what and where 一般指的是 object detection 的任务。

（一般认为使用 pooling 进行下采样会造成空间信息的损失）

论文的 intuition

Dense Connections来自DenseNet，可以聚合语义信息。Feature Pyramids空间特征金字塔可以聚合空间信息。DLA则是将两者更好地结合起来从而可以更好的获取what和where的信息。

为了方便叙述，将 CNN 架构进行模块化拆分，一个 CNN 由多个 stage 组成，一个 stage 有多个 block 组成，一个 block 包含了多个 layer

In this work, we investigate how to aggregate layers to better fuse semantic and spatial information for recognition and localization.

IDA ：迭代深度聚合 iterative deep aggregation

HDA：层次深度聚合 hierarchical deep aggregation

ida 和 hda 只是两种不同的扩展方式，前者是横向，后者是纵向的。layer 之间信息的 aggregation。

We introduce two structures for deep layer aggregation (DLA): iterative deep aggrega- tion (IDA) and hierarchical deep aggregation (HDA).

IDA focuses on fusing resolutions and scales while HDA focuses on merging features from all modules and channels. HDA assembles its own hierarchy of tree-structured connections that cross and merge stages to aggregate differ- ent levels of representation. Our schemes can be combined to compound improvements.

IDA能够更好的融合不同尺度和分辨率的特征图，更深的部分具有更多的语义信息，空间分辨率上更加的粗糙。IDA聚合起始于最浅的、最小的尺度，然后迭代合并更深、更大的尺度。通过这种方式，浅层特征在不同阶段的聚合时得到了细化。

HDA的结构类似于一颗树，将不同stage和block块结合在一起，结合了浅层和深层的特征，有更丰富的特征组合。

上图给出了论文中提出的两种结构

DLA网络结合了IDA和HDA，是本文的核心部分。红框为HDA的类树形结构，黄线为IDA的迭代方法。通过通过HDA将数据（图片）的浅层与深层表达进行更好地融合，然后通过IDA不断进行迭代完善得到最终输出。

DLA 是一个通用的架构，可以很方便的融合到现有的 CNN 结构中完成多种计算机视觉任务。

实验结果

提出了IDA和HDA，用来融合不同block和不同stage的特征，并将IDA和HDA融合为DLA模块，以DLA为基础构造CNN；
通过实验证明了在分类和语义分割任务中，相比于同等规模的网络，使用DLA结构有明显的性能提升。

RepVGG

We present a simple but powerful architecture of convo- lutional neural network, which has a VGG-like inference- time body composed of nothing but a stack of 3 × 3 con- volution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training- time and inference-time architecture is realized by a struc- tural re-parameterization technique so that the model is named RepVGG.

方案：

想要使网络具有高性能，又要有高效推理速度，怎么才能解决这个问题？repVGG给了我们答案：结构重参数化思想，也即训练时尽量用多分支结构来提升网络性能，而推理时，采用利用结构重参数化思想，将其变为单路结构，这样，显存占用少，推理速度又快。

这种是既要又要的情况，是有可能的，乐观点。

为什么可以这样做，这个是原理。目前不懂，作为 todolist

评价

RepVGG models are fast, simple and practical ConvNets designed for the maximum speed on GPU and specialized hardware, less concerning the number of parameters. They are more parameter-efficient than ResNets but may be less favored than the mobile-regime models like MobileNets [16, 30, 15] and ShuffleNets [41, 24] for low-power devices.

作者给出的 limitations

工业界非常solid的一个工作，利用重参化技巧，可以将多分支的卷积操作合并成一次卷积操作，且真正做到计算结果完全一致。在我实际的项目中，通过调整各阶段模块数和width得到的小模型，在推理速度一样的情况下精度能甩shufflenet和mobilenet很多。

paper：https://arxiv.org/abs/2101.03697

github：https://github.com/DingXiaoH/RepVGG

视频教程：https://www.bilibili.com/s/video/BV1k54y1V7Sx

其他

解决的是 what and where 的问题

DenseNets是语义融合网络的最具代表性的网络，设计思路为，采用将不同层级之间的特征通过skip connection级联在一起，来达到更好地传播特征与损失的目的。

FPN网络是空间融合网络中最具有代表性的网络，设计思路为，通过自上而下和横向的连接，来获取更均一化的分辨率和标准化的语义信息。

思考：语义融合与空间融合的区别：

语义融合：

融合的是不同层级间的通道内信息；
通道大多在通道数上不同，空间尺度上相同，不需要尺度对齐；
主要保留微观信息；

空间融合：

融合的是不同层级间的特征图信息；
通道数是相同的，空间尺度上等比缩放，需要尺度对齐。也就是文中所说的均一化和标准化；
主要保留宏观信息；

文章目录

CNN/ Transformer / MLP

DLA

RepVGG

其他