Anchor Free Object Detection

anchor-free 目标检测学习整理。

anchor free 和 anchor based 的区别

如果物体用 4D（两组2D坐标）表示，则是 anchor-based，如果把物体用 2D 来表示，则属于 anchor-free。anchor-free 比起一个方框范围，更着眼于点的分析。

为什么要有 anchor？

在深度学习时代，物体检测问题通常被建模成一些候选区域进行分类和回归的问题。在 one-stage 中，这些候选区域是通过滑窗方式产生的 anchor；在two-stage 阶段，候选区域是通过 RPN 生成的 proposal，依然是针对 anchor 进行分类和回归。

在 anchor-based 的方法中，索然每个位置可能只有一个 anchor，但预测的对象是基于这个 anchor 来匹配的，而在 anchor-free 的方法中，通常是基于这个点来匹配的。

为什么 anchor-free 能够卷土重来？

不需要在图像上先验 anchor 框

多尺度和样本数问题是 anchor-free 需要面对的问题，而FPN 和 focal losss 只是一种广为推荐的方法，并不是唯一的，在未来也可能不再适用。

应该归功于 FPN和 focal loss。

Anchor-free 的第一问题就是不同尺度的物件辨识困难，过去一半会通过多尺度的 image pyramid 改善。但是这种代价是额外的训练和部署时候的计算量。这个问题问题直到 16年 FPN（feature pyramid network）出来之后，才让多尺度问题有了精简的解决方案。

anchor-free 的第二个问题是训练时候过多的负样本数会导致网络训练变得困难，这部分也是以往 anchor-base 方法通过计算 IOU 以平衡样本数。 17 年时候，提出 focal loss（retinaNet, reh-tuh-nuh）在一定程度上解决了 anchor-free 样本不均衡的问题。

Focal loss的核心观念在于压低简单的样本的loss weight。以物件侦测来说，落在背景的点是大量且容易分类。Focal Loss的目标即是避免过度关注这些easy example并聚焦Hard example。上图中，蓝线是一般的cross entropy，而其他的线是不同hyper-parameter的focal loss，可以看出整个曲度更加陡峭。当然，同时期存在其他处理类似问题的方法，如hard negative mining。

anchor-free 同样分为两个子问题，即确定物体中心和对四条边框的预测。对于边框的预测都是预测该像素点到 ground truth 框的四条边距离。

anchor free 的方法一般有两种

keypoint-based: follow 了特征点检测的方法
center-based：把点看做是样本，和 anchor-based 方法是相似的

使用 anchor 的优势

通过不同尺寸的anchor，可以减少面对物体 scale 和 ratio 变化范围
控制anchor 的数量，可以降低 image pyramid 层数并依情况调整运算量

使用 anchor 的弊端

设计 anchor 的hyper parameter 不易调整
训练 anchor 的IOU 需要消耗大量的时间和memory

anchor-free算法归纳

cornernet/ cornerNet-lite: 左上角 +右下角

extremenet：上下左右4 个极值点+ 中心点

centerNet：左上角点 + 右下角点 + 中心点

YOLO v1 可以看做是 anchor-free 类别； YOLO-v2 开始引入anchor

anchor-based 类的代表： fasterRCNN， SSD， YOLO-v2/ v3

Anchor-free 类的代表： cornerNet， ExtremeNet, centerNet

anchor-free 方法主要分为基于密集预测和基于关键点估计两种。

densebox

Densebox的主要贡献有两个：证明了单FCN（全卷积网络）可以实现检测遮挡和不同尺度的目标；在FCN结构中添加少量层引入landmark localization，将landmark heatmap和score map融合能够进一步提高检测性能。

基于关键点（keypoint）的解决方案

cornerNet

CornerNet提出了一个比较有意思的思路，就是将Box转化为关键点描述，即用box左上角和右下角两个点去确定一个box，将问题转化为对top-left和bottom-right两个点的检测与匹配问题。

centerNet（这个是有两篇不同的论文，所以需要简单区分一下）

CenterNet的Motivation其实很简单：CornerNet只能提供边缘信息，实际上对于obj来说，最有辨识度的信息应该是它们内部的区域。过于关注边缘很容易引入大量的误检和错检。CenterNet增加了对中心点的检测，来帮助筛选候选框。

extremeNet

采用了另一种点描述方式（top-most，bottom-most，left-most，right-most，center），补充的 center 点可以帮助判断 box 的真实性。

参考文献

目标检测中的Anchor Free方法(一)

cornernet

By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors.

Experiments show that Corner- Net achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.

cornernet 主要是开创性的，使用关键点的思路去做 od。效果是一般的。

Anchor-based 的缺点

First, we typically need a very large set of anchor boxes

Second, the use of anchor boxes introduces many hyperparameters and design choices

上图是比较易懂的方式展示 cornernet 的过程。

包含了更多框架方面的知识。

hourglass 这个backbone 网络结构可以总结一下。

corner pooling

新型的池化层，帮助网络更好的寻找角点

如果我们能在某行和某列检测到同一个object的边界特征，那么这行和这列的交点就是corner，这是一种间接且有效的找corner的方法。以下是具体做法：

对于第1组feature maps，对每行feature scores按从右向左的顺序选择已滑动过范围的最大值，对于第2组feature maps，对每列feature scores按从下向上的顺序选择已滑动过范围的最大值。为了更好地适应corner的检测。在目标检测的任务中，object的corner往往在object之外，所以corner的检测不能根据局部的特征，而是应该对该点所在行的所有特征与列的所有特征进行扫描。

它输入两个特征图；在每个像素位置，它最大池化第一个特征图右侧的所有特征向量，最大池化第二个特征图正下方的所有特征向量，然后将两个合并的结果加在一起。

分别在水平与垂直放心求最大值，然后再叠加，得到输出的结果，需要特别方向是自底向上，从右到左。以上是更加详细的图解。

为了实现模型的训练和测试，仅仅定义网络结构是远远不够的。我们还需要

将真实标签（物体的类别和所在的位置）映射为监督信息（类似网络的输出格式）
根据网络前向过程的输出和上一步的监督信息构建相应的损失函数
根据损失函数进行梯度下降，更新网络参数

概括地说，CornerNet使用单个卷积网络来检测物体的左上角和右下角：

通过预测得到的**热图(heatmaps)**来判别各位置是否属于角点；

基于预测的角点**嵌入向量(embeddings)**来对角点进行配对(属于同一物体的一对角点的embeddings之间的距离会比较小，属于不同物体的则比较大)，从而判断哪些左上角点和右下角点是属于同一物体的；

使用预测的**偏移量(offsets)**对角点位置进行调整；

embedding vector

The network also predicts an embedding vector for each detected corner such that the distance between the embeddings of two corners from the same object is small.

也就是cornerNet在进行预测的时候，会为每个点分配一个embedding vector，属于同一物体的点的vector的距离较小。

heapmaps

只是将 bouding box 的一个角点映射到 heatmap 中对应的一个角点是不合理的，或者说映射关系太苛刻。

所以将物体bounding box的一个角点映射到heatmap中对应的一个小型的圆形区域，这才是合理的。论文中这个临近区域叫做 positive location。

每组 heatmap 的shape 是 $CHW $，其中$ H*W$ 是特征图的尺寸。理想状态，它是一个二值 mask，值为 1 表示该位置属于角点，而通常模型预测出来每个位置上的值是 (0, 1)，表示该位置属于角点的置信度。

半径的大小根据目标大小来设定，保证产生的预测框能至少满足 $I o U > t$ 。对每个角点，都有个正确标注位置信息，其他都是负样本，为了平衡正负样本的比例，作者只对正样本指定半径周围负样本考虑在训练环节。

radius computation

半径是基于这样一个条件计算出来的：在圆内的角点对形成的 bbox （pred bbox）和 gt box 的 IoU 不小于（作者在实验中设置为 0.3）。可以分为以下三种情况考虑：

（1）pred bbox 包围这 gt box，同时两边和圆相切

（2）gt box 包围这 pred bbox，同时两边是和圆相切

（3）pred bbox 和 gt box 部分重叠，两者分别有两边和圆相切

这部分可以参考 CornerNet: 将目标检测问题视作关键点检测与配对

offset

对角点进行小幅度的调整

1
2


# 用来表示最多可能出现的物体数量
max_tag_len =128

损失函数

（1）focal loss

1

# 这里以输出的 heatmapps 和监督的 heatmaps 作为输入，计算 focal loss

（2） embedding 损失

这部分 loss 有两部分， $l_{p u l l}$ 和 $l_{p u s h}$ 分别表示 push and separate the corners. $L_{pull} = \frac{1}{N} \sum_{k = 1}^{N} [{(e_{t_{k}} - e_{k})}^{2} + {(e_{b_{k}} - e_{k})}^{2}]$

$L_{p u s h} = \frac{1}{N (N - 1)} \sum_{k = 1}^{N} \sum_{\binom{j = 1}{j \neq k}}^{N} max (0, Δ - | e_{k} - e_{j} |)$

论文中利用该损失函数来减小同一物体bounding box左上角和右下角embedding的距离，增大不同物体bounding box左上角和右下角embedding的距离。

（3）修正损失

在原来预测的基础上，加上 offset，使得预测更加精确。表示的时候使用 smoothL1Loss 表示

ablation study

corner pooling：角点池化是 cornernet 的关键组成部分

从这个角度分析，确实是非常有效

作者假设了检测角落比检测中心更好的两个原因。

anchor 的中心可能更难定位，因为它取决于目标的 4 个边，而定位角点仅仅取决于 2个边，corner pooling 为定义角点引入了合理的先验
角点提供了一种密集离散化 box空间的方法：我们只需要 $O (w h)$ 的角点，便可以表示出 $O (w^{2} h^{2})$ 可能的 anchor

参考文献

深度解析 CornerNet 网络结构

cornerNet-Lite

CornerNet 作为 keypoint-based 目标检测算法中经典方法，虽然有着不错的准确率，但其推理很慢，大约是 1.1s/ 张。虽然可以简单地缩小输入图片的尺寸来加速推理，但这会极大地降低其准确率，性能比YOLOv3要差很多。论文中提出了两种轻量级的 CornerNet 变种：

（Saccade 扫视）

CornerNet-Saccade：该变种主要通过降低需要处理的像素数量来达到加速的目的，首先通过缩小的图片来获取初步的目标位置，然后根据目标位置截取附近小范围的图片区域来进行目标的检测，准确率和速度分别可达到43.2%AP以及190ms/张。
CornerNet-Squeeze：该变种主要通过降低每个像素的处理次数来达到加速的目的，将SqueezeNet和MobileNets的思想融入hourglass提出新的主干网络，准确率和速度分别可达到34.4%AP以及30ms/张。

CornerNet-Saccade

Estimating Object Locations 获取可能出现目标的初步位置及其尺寸：

将输入的图片缩小至长边为255像素和192像素两种尺寸，小图进行零填充，使其能同时输入到网络中进行计算。
对于缩小的图片，预测3个attention特征图，分别用于小目标(长边<32像素)、中目标(32像素<=长边<=96像素)和大目标(长边>96像素)的位置预测，这样的区分能够帮助判断是否需要对其位置区域进行放大，对于小目标需要放大更大，下一部分会提到。
Attention特征图来源于hourglass上采样部分的不同模块，尺寸较大的模块特征图输出用于更小的目标检测(主干网络结构后面会介绍)，对每个模块输出的特征图使用Conv-ReLU模块接Conv-Sigmoid模块生成Attention特征图。

CornerNet-Squeeze

在CornerNet中，大多数的计算时间花在主干网络Hourglass-104的推理。为此，CornerNet-Squeeze结合SqueezeNet和MobileNet来减少Hourglass-104的复杂度，设计了一个新的轻量级hourglass网络。

centernet

理论学习

centernet 通过预测每个目标的中心点，以中心点为基准回归宽，高以及下采样带来的点的偏置，这三个属性都是中心点的附加属性。将目标检测基于关键点检测的思路，抛弃了由anchor生成的大量需要被抑制的样本，故而不需要NMS做后处理，而且整个网络只有一个检测Head，不基于FPN为BackBone需要多个检测Head，整体速度就快了很多。

在 COCO 数据集上达到了 speed-accuracy 最好的 trade-off

centernet： objects as points, 是 one-stage 目标监测

There are two types of methods to regress and classify bounding box around an object in Anchor Free Object Detection approaches. i) Keypoint based approach ii) Center-based approach.

对于 anchor-free 的监测也是有两种方式： keypoint-based 和 center-based

Keypoint based approach: 使用关键点然后去生成 bounding box（作为属性），然后用于分类

1

The former predicts predefined key points from the network which are then used to generation of the bounding box around an object and classification of the same. CornerNet², CenterNet: Keypoint Triplets³ and Grid- RCNN⁴ are some networks using keypoint based approaches.

center-based: 需要去了解一个具体的网络

1

The latter uses center-point or any part-point of an object to define positive and negative samples(instead of IoU..!) and from these positives, it predicts the distance to four coordinates for the generation of a bounding box. Some of the networks using this approach are FCOS⁵, DenseBox⁶, FSAF⁷, etc. They have their peculiar methods to generate positive samples and use that for regression of boxes, objectness, and class probabilities.

peculiar 奇特的、独特的

feature extractor/ backbone used

Four different feature extractors were used for the experiments. ResNet18, ResNet1⁰¹⁸, Deep Layer Aggregation Networks(DLA)⁹, and Stacked Hourglass Networks¹⁰. ResNets and DLA were modified by adding Deconvolutional and Deformable Convolutional Layers.

还是需要把 deconvolutional and deformable convolutional layers 好好学习一下。下面几个网络都是需要好好学习一下的

Hourglass Module

Modified DLA -34

CenterNet: Object as Points follows the former viz. keypoint based approach for object detection. It considers the center of a box as an object as well as a key point and then uses this predicted center to find the coordinates/offsets of the bounding box.

CenterNet将输入的图片转换成热图，热图中的高峰点对应目标的中心，将高峰点的特征向量用于预测目标的高和宽，如图2所示。在推理时，只需要简单的前向计算即可，不需要NMS等后处理操作。

上图中每个峰值对应的位置表示一个目标的中心。其实这并不是严格的热力图，而是网络训练的 label。因为上图中每个山峰的峰值都是 1，背景点几乎都是 0，每个山峰的峰值并不介于 0-1。 centernet 的做法是找到所有小山丘的峰值，再从小山丘当中选择峰值大于一定阈值的作为正样本。

In this paper, a center prediction is considered as a standard keypoint estimation problem. After passing an image through Fully Convolutional Network, the final feature map outputs heatmaps for different key points. Peaks of these output feature maps are considered as predicted centers.

这个是 centerNet 中的整个流程：使用 CNN 得到 heatmap 之后，然后生成多个 key points，使用 multi-head 处理不同的子问题。

Heatmap Head

This head is used for the estimation of the key points given an input image. In the case of object detection, keypoints are the box center.

（分析代码和结构一个很重要的维度是看向量的 size 的变化）

To form ground truth heatmaps for loss propagation, these centers are splat using Gaussian Kernels after converting them to low-resolution equivalent(In our case division by stride R. Denoted as p~). For example, If we have three classes (C=3) and input image dimensions are 400X400, then with a given stride (R=4), we have to generate 3 heatmaps (Each heatmap corresponding to a given class) of dimensions .

注意这里的类别数是 3，所以 C 等于 3.

Dimension Head

This head is used for the prediction of the dimensions of the boxes: width and height.

This head is used for the prediction of the dimensions of the boxes viz. width and height. This is achieved by solving a standard L1 distance norm.To reduce the computational burden, they use single sized heatmaps for all object categories.

这里的 loss 是L1 loss， regression 任务。这里的 single heatmap for all object categories，不是那么明晰。

Offset Head

This head is used to recover from the discretization error caused due to the downsampling of the input.

After the prediction of the center points, we have to map these coordinates to a higher dimensional input image.

**Nms vs soft nms **

当图像中检测的物体有较大的 overlap，soft-nms 能够得到比较好的效果。

具体方法

NMS 的方法 $s_{i} = {\begin{cases} s_{i}, & iou (M, b_{i}) < N_{t} 0, & iou (M, b_{i}) \geq N_{t} \end{cases}$

NMS 设置了一个 hard threshold，当两个 box iou 重合度很大的时候，直接 remove from the neighborhood。

soft-NMS 的方法 $s_{i} = {\begin{cases} s_{i}, & iou (M, b_{i}) < N_{t} s_{i} (1 - iou (M, b_{i})), & iou (M, b_{i}) \geq N_{t} \end{cases}$

公式中的 s 为置信度。soft-NMS 不再删除所有于 highest-score 的bbox 大于 IoU 阈值的框，而是改为降低它们的置信度。

对于 soft-nms 的总结

Soft-nms 加强了对 highly-overlap objects 的正常区分，同时却也削弱了对 light-overlap objects 的区分能力
本质上是针对 overlap 情景的一种 overfit
只有在highly-overlap objects 的场景下才能发挥真正作用，普通场景下并没有 highly-overlap，所以可能有反效果

涉及到的 backbone

Hourglass：最开始是被用于关键点检测的，作者改为 stacked 的方式，类似渐进式或者说是级联。
resnet 魔改了一下，在每个上采样前添加一个 DCN （变形卷积）
DLA：一个做分类的网络

Additionally, the network predicts the width and height of the box for these centers and each center will have its unique box width and height. This tightly coupled property helps them to remove the Non-Maximal Suppression step in post-processing.

Pose estimation is considered a simple keypoint estimation problem.

Here instead of 80 as a value of C, k = 34 is used(In Case of COCO Dataset: 17 key points). These offsets are predicted for each keypoint directly regressing from the centers.

Inference

object detection

At inference time, peaks of the heatmaps are calculated by seeing the maximum value near the 8-pixel neighborhood in a heatmap and keeping the first 100 peaks of all the different classes independently. This operation is achieved by 3X3 MaxPool Operation on the obtained feature map.

通过 3*3 的maxpool 可以从 heatmap 中得到相应的 maximum value

The obtained peak coordinates are used to get the dimensions and offset predictions. You can get to know this part better by going through this piece of code.

关于 pose estimation 就不说了。

centernet 的特点：

不必基于 anchor 超参数的调整
每个目标仅仅有一个正的锚点，整个 pipeline 不会使用的 NMS。
centerNet 相比传统目标检测（缩放 16 倍尺寸），使用更大分辨率输出特征图（缩放了 4 倍），因此无需多重特征图，所以即使使用了更大的分辨率，速度仍然是很快的。

并且沿着这个思路可以扩展到其他任务，比如 3D目标检测和姿态估计。其方法就是每个预测点要回归到目标的其他属性，比如 3D 目标检测就需要回归更多的参数，比如目标深度，3D边框维度和目标方向，这个进度可以参考 centerTrack。

Objects as Points：预测目标中心，无需NMS等后处理操作 | CVPR 2019

关于 centernet 中涉及到的 backbone 是需要好好看看的。

参考文献：

https://medium.com/visionwizard/centernet-objects-as-points-a-comprehensive-guide-2ed9993c48bc

代码解析

（1）readme 文件夹

https://github.com/xingyizhou/CenterNet/blob/master/readme/DATA.md

在多个数据集上进行了实验， pascal voc, kitti, coco。数据集方面的准备可以参考该 readme

https://github.com/xingyizhou/CenterNet/blob/master/readme/DEVELOP.md

如果想要使用其他的 architecture，可以参考 New architecture

https://github.com/xingyizhou/CenterNet/blob/master/readme/GETTING_STARTED.md

主要是做 train 和 evaluation 的处理脚本。不同的任务和不同的数据集。如果能安装好环境，那么这个是很容易实现的。

https://github.com/xingyizhou/CenterNet/blob/master/readme/INSTALL.md

主要是这个环境非常的老了，torch 都是 0.4，比较难和之后的 cuda 版本适配。难点在于 pytorch 和 DCNv2 的编译。

https://github.com/xingyizhou/CenterNet/blob/master/readme/MODEL_ZOO.md

不同任务下的 pretrained model，比如 pascal voc，human pose estimation， coco

（2）models, data, exp, images 下面都没有代码，可以直接 pass 掉。

（3）experiments 中是各种任务的脚本

不同数据集、不同 backbone、不同测量的训练和测试脚本

（4）src

https://github.com/xingyizhou/CenterNet/blob/master/src/demo.py

脚本就是一个展示功能

https://github.com/xingyizhou/CenterNet/blob/master/src/test.py

这个是 test 的脚本，可以学习的点是：对于数据处理是 dataset_factory, 对于模型（detector）的处理是一个 factory，这样构造的是非常有逻辑和清晰的。

https://github.com/xingyizhou/CenterNet/blob/master/src/main.py

main 是入口脚本，train 脚本。

（4.1）lib

该 sub-repo 是最重要的。

datasets

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/datasets/dataset_factory.py

这个是 dataset_factory，是关于数据集的 factory

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/datasets/dataset_factory.py#L23

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


dataset_factory = {
  'coco': COCO,
  'pascal': PascalVOC,
  'kitti': KITTI,
  'coco_hp': COCOHP
}

_sample_factory = {
  'exdet': EXDetDataset,
  'ctdet': CTDetDataset,
  'ddd': DddDataset,
  'multi_pose': MultiPoseDataset
}

Dataset_factory 对应的是不同的数据集， sample_factory 对应的是不同的任务

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/datasets/dataset

dataset 是 coco, kitti, pascal 三个数据集的处理。

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/datasets/dataset/coco.py

其中的 mean 和 std 虽然有定义，但是没有使用，不同的数据集中的 mean 和 std 是不一样的；这里为什么特征值和特征向量都是定义好的呢？

这个并不完全是在处理数据，还有一些 evaluation 结果的工具函数。

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/datasets/sample

这下面也是有 4 个 python 文件，但目前没有搞清楚这个怎么用

detectors

基本上和 dataset 的代码结构是一样的，首先是 base_detector 然后是针对不同任务实现的 det，比如说 ctdet ddd 文件

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/detectors/base_detector.py

base 的类别。其中的 run 函数不是多线程或者多进程的中 run函数。这个基本的类中定义了子类中需要实现的函数：process, post_process, merge_outputs, debug, show_results 等

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/detectors/ddd.py

单独以 ddd 文件为例，包含了 pre_process, process , pose_process 几个函数，还是十分准确的。

external

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/external

这个就是高效实现了 nms 算法，基于 c 语言

1
2
3
4
5
6


python 的几种扩展文件格式
开发种常见的情况是，用 python 快速生成程序原型，然后对其中有特别要求的部分，用合适的语言改写。比如对性能要求高，那么就使用 c/ c++ 重写，而后封装为 python 可以调用的扩展类库。
.py, 这个是 python 源码的后缀
.pxd, 使用 Cython 编程语言 编写而成的 Pythn 扩展模块头文件; 由其它编程语言 "编写-编译" 生成的 Python 扩展模块。
.pyx, 由 Cython 编程语言 "编写" 而成的 Python 扩展模块源代码文件。类似 C语言中的 .c 源码，必须先被编译成 .pyd （windows 平台）或者 .so （linux）文件，才能作为模块 import 导入使用

models

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/models

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/utils.py

这个就是简单的 utils，没啥

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/data_parallel.py

这个实现的是 data parallel

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/decode.py

根据不同的任务（模型），有不同的 decode 函数。这里有的使用 nms ，有的没有。

需要打印一下网络结构（没有跑通，那么就无法打印）

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/losses.py

可以参考这个代码格式，首先将 loss function 写成一个函数，然后新建一个类来封装该 loss function，在类中只需要实现 __init__ 和 forward 函数即可。

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/model.py

这个更像是 models 的调用脚本，不是真正的 arch

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/model.py#L10

引用当前目录，那么使用 . 符号

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/scatter_gather.py

主要将 variables distributes 到给定的 GPU。这个是不是做了 torch多GPU训练的一些工作量？

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/models/networks

这个才是真正的 arch 网络

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/networks/resnet_dcn.py#L130

这个版本的 resnet 称为 PoseResNet，主要是为了处理 pose 任务吗？

dcn 主要是应用在 deconv 时候，所以 conv 时候仍然使用的普通卷积

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/networks/pose_dla_dcn.py#L224

Tree class 和 DLA class 都是用来构建 dla 网络的。至于要不要深入学习这个网络架构，需要看 state-of-art 是否使用这种网络结构。

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/networks/msra_resnet.py

这个网络结构是没有使用 dcn 机型 deconv，所以当 dcn 编译不通过的时候，按照理论上说，可以只是使用这个 backbone

空洞卷积（dilated convolution） and 可变性卷积（deformable convolutional networks）

对于可变形卷积，依然是看 state-of-art 模型中是否包含

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/models/networks/DCNv2

trains

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/trains

同样是对应这多个任务

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/trains/base_trainer.py

这个是一个 base_trainer,

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/trains/ddd.py#L24

单独以 ddd 任务作为一个 base，肯定会有多个 loss，那么就是一个加权平均和的问题。这个是需要具体拿一个任务去 debug 学习，需要考虑每个 loss 对应的意义。

utils

https://github.com/xingyizhou/CenterNet/tree/master/src/lib/utils

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/ddd_utils.py

这个是训练 ddd 任务的 util

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/debugger.py

没有发现 debugger 类中有什么特别之处

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/image.py

实现了 image 相关的操作，比如封装affine_transform, crop, gaussian radius, gaussian2D 等。

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/oracle_utils.py

这个 oracle 确实没有看懂

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/post_process.py

是 ddd, ctdet, multi_pose 各种任务的 post process 函数

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/utils/utils.py

只有一个 averagemeter 的 class

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/logger.py

自己改进（抄袭）了一个 logger 类

https://github.com/xingyizhou/CenterNet/blob/master/src/lib/opts.py

封装了所有的 arguments 参数

（4.2） tools

（从功能的角度看，更像是 evaluation tools）

https://github.com/xingyizhou/CenterNet/tree/master/src/tools/kitti_eval

这个是 cpp 的函数用于 eval kitti

https://github.com/xingyizhou/CenterNet/tree/master/src/tools/voc_eval_lib

这个是 evaluation voc dataset 的包，等用到的时候再看进行

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/_init_paths.py

coco overlap metric 的计算

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/convert_hourglass_weight.py

hourglass 相关，对于 hourglass 的backbone，依然是采用如果 state-of-art 中使用，那么就优先好好研究，否则，可以稍微推后

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/convert_kitti_to_coco.py

转换脚本

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/eval_coco.py

对于 pycocotools 的封装

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/get_kitti.sh

数据下载工具，用来 download kitti 数据集。如果网速没有问题的话，自己可以下载数据集 kitti 中 data_object_image_2 的数据集

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/get_pascal_voc.sh

数据下载工具，用于 download voc2007 数据集

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/reval.py

没有看懂

https://github.com/xingyizhou/CenterNet/blob/master/src/tools/vis_pred.py

这个是用来可视化 pred 结果的文件

代码进一步详解

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


  def _get_border(self, border, size):
    #border 128  pic_len w or h
    i = 1
    while size - border // i <= border // i:
      # 如果图像宽高小于 boder*2，i增大，返回128 // i
      # 正常返回128，图像小于256，则返回64
        i *= 2
    return border // i


  num_objs = min(len(anns), self.max_objs)      # 目标个数,这里为100
  
  
  
    hm = np.zeros((num_classes, output_h, output_w), dtype=np.float32) # heatmap(80,128,128) 
    # 注意这个热力图的 shape，每个class 都有一个 (output_h, output_w) 这样的结果
    wh = np.zeros((self.max_objs, 2), dtype=np.float32) # 中心点宽高(100*2)
    dense_wh = np.zeros((2, output_h, output_w), dtype=np.float32)# 返回2*128*128
    reg = np.zeros((self.max_objs, 2), dtype=np.float32) # 记录下采样带来的误差,返回100*2的小数
    ind = np.zeros((self.max_objs), dtype=np.int64) # 返回100个ind
    reg_mask = np.zeros((self.max_objs), dtype=np.uint8)# 返回8个 回归mask
    # 这里记录前 max_objs 个点，相当于基于一张图片存在哪些目标，有的话对应索引设置为 1，其余设置为 0
    cat_spec_wh = np.zeros((self.max_objs, num_classes * 2), dtype=np.float32) # 100*80*2
    cat_spec_mask = np.zeros((self.max_objs, num_classes * 2), dtype=np.uint8) # 100*80*2
    
    

inp：input，就是网络的输入图像了，也是做过数据增加的图像

c：center，图像的中心坐标

s：scale，随机缩放比例

作者使用了很多缩写，其实如果写成全拼更好理解

1
2
3
4
5
6
7


def _nms(heat, kernel=3):
    pad = (kernel - 1) // 2
 
    hmax = nn.functional.max_pool2d(
        heat, (kernel, kernel), stride=1, padding=pad)
    keep = (hmax == heat).float()
    return heat * keep

hmax用来寻找8-近邻极大值点，keep为h极大值点的位置，返回heat*keep，筛选出极大值点，为原值，其余为0。

centertrack

paper: Tracking Objects as Points

Github: https://github.com/xingyizhou/CenterTrack

目前多数 MOT 都是 tracking-by-detection，MOT 系统的整体检测速度约等于检测器速度+ 追踪器速度。本文将检测和 embedding 用同一个网络输出，确实加速了整个 MOT 的速度。本文介绍一个真正意义上的将目标检测和数据关联统一的 MOT 框架：centertrack。

centernet 的输出三个分支

HeatMap，大小为（W/4,H/4,80），输出不同类别（80个类别）物体中心点的位置
Offset，大小为（W/4,H/4,2），对HeatMap的输出进行精炼，提高定位准确度
Height&Width,大小为（W/4,H/4,2），预测以关键点为中心的检测框的宽高

相比于 centernet，centertrack 多出来 4个额外的输入通道：两个RGB图片（当前帧和前一帧）+一张heatmap图（前一帧中物体中心分布的热力图）。目标追踪实际上是一个物体在时间上的关联匹配问题，仅仅知道一帧，而不知道之前帧的信息，是不能实现目标追踪的。

这三个不同的输入是如何进行信息融合？

作者在这里用了非常简单的方法：先是通过简单的卷积层、批归一化层和激活函数，然后按位相加即可。

centertrack 的四个输出特征

HeatMap，大小为（W/4,H/4,80）,检测框中心点位置分布热力图
Confidence，大小为（W/4,H/4,1），相关点为前景中心的置信度图
Height&Width,大小为（W/4,H/4,1），点对应的检测框的**宽高
Displacement prediction**, 大小为（W/4,H/4,2），检测框中心点在前后帧的**位移**（有点类似于光流）

displacement prediction 作为两帧之间的位移差。

1
2
3
4
5
6
7


_network_factory = {
    'resdcn': PoseResDCN,
    'dla': DLASeg,
    'res': PoseResNet,
    'dlav0': DLASegv0,
    'generic': GenericNetwork
}

这么多函数我不一一解释了，这些网络有个共同的特点，这些网络都会经历一系列下采样与一定比例的上采样，输入特征图宽高为（W,H）,输出特征图宽高为（W/4,H/4）。

在测试环节的数据关联部分，作者直接通过中心点的距离来判断是否匹配，是一种贪婪的方法，并非基于匈牙利算法的那种全局的数据关联优化。在训练过程中，作者并非只用相邻帧进行训练，允许使用 3帧信息。

效果

centertrack 在 MOT, KITTI 和 nuScenes 等数据集上的 2D/ 3D 多行人/ 车辆跟踪任务上都取得了 SOTA 的成绩。

centertrack 只关联连续两帧之间的检测框

总结：

centertrack 在 loss、backbone，necks 方面和之前看过的项目“3d-od” 基本上是相同的。主要是 dataloader 不同，centertrack 中支持了 tracking 的算法，所以有 3 个输入，而正常的 od 只是有 1个输入。

achor-free 的论文

(1) anchor-ree methods: cornetNet and CenterNet

YOLO divides the feature map into 7*7 squares. object detection is carried in each square. Because teh scale of the object varies too much, the network is difficult to learn.

CornerNet transforms the position detection of an object into the detection of the key points of the top-left corner and bottom-right corner of the boundary box.

Because CornerNet is based on hourglass network, the hourglass network is redundantly calculated, which makes it difficult to train in the absence of computing resources.

CenterNet uses the midpoint of the object as the detection center. It can detect the object by adding the boundary and size information of the object.

ExtremeNet uses standard key points to estimate four poles (top, left, bottom, right) and a center point of the network detection object.

anchor-based 常用的改进方法(网络层面)

optimizing the shape and size of the convolution kernel, for example, atrous convolution, depthwise separable convolutions, and deformable convolutional

This method simplifies the learning task of convolutional networks, but make the detection method less flexible.

anchor-based 的主要缺点, IOU 的计算和 anchor 中超参调整.

Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

这个论文中有很多知识点的讲解，可以总结一下。

(2) CenterNet: Keypoint Triplets for Object detection

Our approach, named CenterNet, detects each object as a triplet, rather than a pair, of keypoints, which improves both precision and recall.

In this paper, we present a low-cost yet effective solution named CenterNet, which explores the central part of a proposal, i.e., the region that is close to the geometric center, with one extra keypoint. Our intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in ist central regionis predicted as the same calss is high, and vice versa.

Two-stage approaches divide the object detection task into two stages: extract RoIs, then classify and regress the RoIs.

(Faster-RCNN is allowed to be trained end to end by introducing RPN. rpn CAN generate RoIs by regressing the anchor boxes. )

One stage approaches remove the RoIs extraction process and directly classify and candidate anchor boxes.

(3) Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

anchor-free detecotrs have become popular due to hte proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative traning samples.

achor-based detector

keypoint-based method, cornerNet detects an object bounding box as a paire of keypoiints (top-left corner and bottom-down corner)

centerNet extends CornerNet as a triplet rather than a pair of keypoints to improve both precision and recall

Center-based method

YOLO divides the image into an S* S grid, and the grid cell that contains the center of an object is responsible for detecting this object. FCOS regards all the locations inside the object bounding box as positives with rour fistances and a ovel centerness score to detect objects.

Objects as Points

We model an object as a single point - the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, sush as size, 3D location, orientation, and even pose.

In this paper, we provide a much simpler and more efficient alternative. We represent objects by a single point at their bounding box center. Other properties, suach as object size, dimension, 3D extent, orientation, and pose are then regressed directly from image features at teh center location. We simply feed the input image to a fully convolutional network that generates a heatmap. Peaks in this heatmap correspond to object centers. Image features at each peak predict the objects bounding box height and weight.

github: https://github.com/Duankaiwen/CenterNet

Anchor-free object detection with mask attention

non-maximum suppression method is used to filter to most of the overlap bounding boxes.

(4) retinanet

Focal Loss for Dense Object Detection

首先这个是 one-stage 的检测，作者认为 one-stage 存在的问题是正负样本差距大，及类别严重不均衡。优点在于速度很快。

In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the center cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.

把这个 detector 叫做 retinaNet

这个是总的流程，其中涉及到的关键点：FPN, multi-task (classification subnet & regression subnet),

Let’s take a sample image and feed it to the network. First stop, FPN. Here, the image will be processed at different scales (4 levels), and at each level, it will output a feature map. The feature map from each level will be fed to the next bundle of components, i.e. Classification Subnet and Regression Subnet. Each feature map that the FPN outputs are then processed by the classification subnet and it outputs a tensor with shape (W, H, K×A). Similarly, the regression subnet will process the feature map and will output a (W, H, 4×A). Both these outputs are processed simultaneously and are sent to the loss function. The multi-task loss function in RetinaNet is made up of the modified focal loss for classification and a smooth L1 loss calculated upon 4×A channelled vector yielded by the Regression Subnet. Then the loss is backpropagated. So, this was the overall flow of the model. Next, let’s see how the model performed when compared to other Object Detection models.

下面是针对 centernet 的一些理解

In this paper, a center prediction is considered as a standard keypoint estimation problem. After passing an image through Fully Convolutional Network, the final feature map outputs heatmaps for different key points. Peaks of these output feature maps are considered as predicted centers.
Additionally, the network predicts the width and height of the box for these centers and each center will have its unique box width and height. This tightly coupled property helps them to remove the Non-Maximal Suppression step in post-processing.
For classification, these heatmap peaks are also linked to a particular class to which it belongs to. So using these centers, dimensions, and class probabilities, object detection task is achieved.

这个是总的architecture

(1) FPN (feature pyramid network backbone) 的特点：FPN augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image,

(2) classification subnet (这部分只有对照代码才能大概知道其在干什么)： cnn+ FCN (fully connectedly network)+ sigmoid 函数。这个是 each pyramid level 的

(3) box regression subnet：In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby group-truth object.

论文中 focal loss 的定义如下 $F L (p_{t}) = - α_{t} (1 - p_{t})^{γ} l o g (p_{t})$ $α_{t}$ 与类别 $t$ 的样本数量成反比，即数量越少的类别，其loss权重越大. $γ$ can be said a relaxtion parameter in laymen’s( 教徒) terms.

More the value of $γ$ , more importance will be given to misclassified examples and very less loss will be propagated from easy examples. $γ = 2$ 效果最好。当 $γ = 0$ 的时候，和传统的 cross entropy 损失函数是相同的。

Focal loss was designed to be a remedy to class imbalance observed during dense detector training with cross-entropy loss. By class imbalance, I mean (or the authors meant) the difference in the foreground and background classes, usually on the scale of 1:1000.

(所以类别之间的不均衡的比例是 1:1000，和之前认为的 scale 还是有点差别的)

参考论文：Focal Loss for Dense Object Detection

提出了 focal loss 来替换 CE，能够缓解正负样本的问题。

在 resnet +fpn 提出了 retinanet， one-stage 架构，主干网络和两个 task-specific 子网组成。主干网络用来提取特征，第一个子网用于类别分类，第二个子网用于 bbox 回归。

1
2
3
4
5
6
7
8


#Formula for Cross-Entropy Loss Function
1. -log(x) # For positives
2. -log(1-x) # For negatives
#Formula for Focal Loss (Alpha Form)
alpha = 0.25
gamma = 2
1. -alpha * (1 - x)^gamma * log(x) #For positives
2. -(1-alpha) * x^gamma * log(1-x) #For negatives

loss 在正负样本中起的作用，可以详细参考：https://medium.com/visionwizard/centernet-objects-as-points-a-comprehensive-guide-2ed9993c48bc 中关于 loss 的分析，非常的 detail，虽然还没有很懂哦。

focal loss implications on solving the class imbalance problem

来自 kaiming 大神

对比一下 focal loss 和 CE loss （其中 $α_{t}$ 为 0.25, $γ$ 是 4），输入是在 [0, 1] 之间。

这个网络中也是有 anchor 的概念。

超参数调整

（1）In general, α should be decreased slightly as γ is increased. The configuration that worked the best for the authors was with γ = 2, α = 0.25.

在当时的效果对比上，精度和速度达到了最好的平衡。

object detection had started off as a two-phase implementation where is detects the object in the image in the first phrase (localization) and classifies it in the second (classification)

RetinaNet: The beauty of Focal Loss

其他

如果 centertrack 实在是跑不通，因为安装的环境，那么可以尝试一下以下的方法（不要在一棵树上吊死）

github：AB3DMOT

教程：AB3DMOT：这个是 step by step，建议尝试一下。

已经下载 paper，阅读最后的性能指标和 centertrack 进行对比，效果决定这是否值得尝试

Deep Learning on 3D object detection paper 閱讀路徑

3D od 的学习路径，经典论文，可以作为一个专题好好看看。论文都已经下载到本地了。

我推薦大家利用 connectedpapers.com，只要輸入一篇 paper，就能看到相關重要 paper 的視覺化呈現

这个

路线：

（2019/07）STD: Sparse-to-Dense 3D Object Detector for Point Cloud

STD 和上面的 PointPillar 是相同的论文

（2020/02）3DSSD: Point-based 3D Single Stage Object Detector

一刷完成，还需要深入。主要是网络结构方面的优化不是很理解。

github：3DSSD: https://github.com/dvlab-research/3DSSD

（2020/03）Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud

这个看完之后，更多是学术界的作品，因为没有和其他网络结构进行对比。

github：Point-GNN: https://github.com/WeijingShi/Point-GNN

（2020/06）AFDet: Anchor Free One Stage 3D Object Detection

这个是基于 pointpillar encoder 进行改进的，使用的 anchor-free 的思路。

（2020/06）1st Place Solution for Waymo Open Dataset Challenge – 3D Detection and Domain Adaptation

这个和上面的 AFDet 是同一篇论文： AFDet。是实践性比较强的论文，优先尝试一下。

这几个都是比较好的一些 3D detection 的项目：

OpenPCDet: https://github.com/open-mmlab/OpenPCDet

SA-SSD: https://github.com/skyhehe123/SA-SSD

3D_adapt_auto_driving: https://github.com/cxy1997/3D_adapt_auto_driving

pseudo-LiDAR_e2e: https://github.com/mileyan/pseudo-LiDAR_e2e

3D point cloud data augmentation

做 3D 数据增强的论文，这个还可以看看

教程： Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

https://www.coderbridge.com/series/6c11455870b74d9f973aa73445e90dbf/posts/dbe7a5e994f5462ba741f582b22e8d17

这个是视频教程，优先看

(1) batch normalization

At testing stage

we do not have batch at testing stage

Benefit

BN reduces training times, and make very deep net trainable
- Because of less Covariate Shift, we can use larger learning rates
- Less exploding /vanishing gradients (especially effective for sigmoid, tanh, etc)
learning is less affected by iniitalization

BN 在 training 和 testing 时候都有用，但主要是在 training 时候帮助比较大。

(2) cross-entropy loss

讲解 cross-entropy 为什么适合做分类

intermedidate layer: extract feature map for proposal generation

regression layer: predicts the box parameters of all proposals

classification layer: predicts the object/ background probabilities fo all proposals

image pyraimd: 将原始图像 resize 成不同 size，然后得到不同尺寸的 feature map，然后在不同尺寸上进行 detection 操作。

filter pyramid：这个就是卷积核

但是 faster-rcnn 中使用的是 multi-scale anchor

multi-scale anchors centered at the same point share the same feature, addresses scaling variance without extra cost.

loss 是由两部分组成，一个是 regression term 一个是 classification term

文章目录