Image Caption

看图说话（Image Caption）任务是结合CV和NLP两个领域的一种比较综合的任务，Image Caption模型的输入是一幅图像，输出是对该幅图像进行描述的一段文字。

其实，图像描述（Image Caption）本质上是图像信息到文本信息的翻译，通俗来讲，就是“看图说话”。

“看图说话”对人类而言很容易，但是对于机器却非常具有挑战性，它不仅需要利用模型去理解图片的内容并且还需要用自然语言去表达它们之间的关系。除此之外，模型还需要能够抓住图像的语义信息，并且生成人类可读的句子。

此外，还存在一个困扰性难题，由于模型的结构过于简单，导致机器生成的句子风格往往过于单一。所以，本文将基于传统的Image Caption实现方式，介绍如何利用深度生成模型来生成多样化图片描述。

常用的数据集

Microsoft COCO Caption数据集

Common Objects in Context (COCO) literally implies that the images in the dataset are everyday objects captured from everyday scenes. This adds some “context” to the objects captured in the scenes.

“COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints.” 原COCO数据集中约330,000张图像，人工地为每张图像都生成了至少5句标注，标注语句总共超过了约150万句

（object category 和 stuff category 的区别？）

关于 COCO 数据集更加详细的介绍

coco dataset 的数据格式

INFO

LICENSES

（上述两个对深度学习的训练不是很重要，可以忽略）

IMAGES

Note that image ids need to be unique (among other images), but they do not necessarily need to match the file name (unless the deep learning code you are using makes an assumption that they’ll be the same… developers are lazy, it wouldn’t surprise me). 注意 image 中的id 是唯一的，但是 file_name 和 id 不一定是一一对应的。

FIVE COCO ANNOTATION TYPES

COCO has five annotation types: for object detection, keypoint detection, stuff segmentation, panoptic segmentation, and image captioning. The annotations are stored using JSON. 五种不同的标注，那么就可以对应着五种不同的任务

OBJECT DETECTION (SEGMENTATION)

This is the most popular one; it draws shapes around objects in an image. It has a list of categories and annotations.

CATEGORIES

The “categories” object contains a list of categories (e.g. dog, boat) and each of those belongs to a supercategory (e.g. animal, vehicle). The original COCO dataset contains 90 categories.

ANNOTATIONS

The “annotations” section is the trickiest to understand. It contains a list of every individual object annotation from every image in the dataset. For example, if there are 64 bicycles spread out across 100 images, there will be 64 bicycle annotations (along with a ton of annotations for other object categories). Often there will be multiple instances of an object in an image. Usually this results in a new annotation item for each one.

两种标注的区别 I say “usually” because regions of interest indicated by these annotations are specified by “segmentations”, which are usually a list of polygon vertices around the object, but can also be a run-length-encoded (RLE) bit mask. Typically, RLE is used for groups of objects (like a large stack of books). I’ll explain how this works later in the article.

bounding box 格式是一个非常小的细节 The COCO bounding box format is [top left x position, top left y position, width, height].

pycocotools 下有三个模块：coco ，cocoeval，mask 和_mask

segmentation的两种格式：RLE（run-length encoding）和polygon

The first annotation:

Has a segmentation list of vertices (x, y pixel positions)
Has an area of 702 pixels (pretty small) and a bounding box of [473.07,395.93,38.65,28.67]
Is not a crowd (meaning it’s a single object)
Is category id of 18 (which is a dog)

Corresponds with an image with id 289343 (which is a person on a strange bicycle and a tiny dog)

The second annotation:

Has a Run-Length-Encoding style segmentation
Has an area of 220834 pixels (much larger) and a bounding box of [0,34,639,388]
Is a crowd (meaning it’s a group of objects)
Is a category id of 1 (which is a person)

Corresponds with an image with id 250282 (which is a vintage class photo of about 50 school children)

iscrowd=1时表示格式是RLE，iscrowd=0时表示格式是polygon：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


"annotations": [
    {
        "segmentation": [[510.66,423.01,511.72,420.03,...,510.45,423.01]],
        "area": 702.1057499999998,
        "iscrowd": 0,
        "image_id": 289343,
        "bbox": [473.07,395.93,38.65,28.67],
        "category_id": 18,
        "id": 1768
    },
    ...
    {
        "segmentation": {
            "counts": [179,27,392,41,…,55,20],
            "size": [426,640]
        },
        "area": 220834,
        "iscrowd": 1,
        "image_id": 250282,
        "bbox": [0,34,639,388],
        "category_id": 1,
        "id": 900100250282
    }
]

KEYPOINT DETECTION FORMAT

关于这个segmentation 可以不考虑，因为没有类似的任务。

PANOPTIC SEGMENTATION

其实我的任务需要使用到的就是这种类型的segmentation

（注意对比和上面做 object detection 任务所用到的图片是不一样的。）

这部分是需要进一步调研的，重点在于如何使用代码实现相应的功能。

IMAGE CAPTIONING

Image caption annotations are pretty simple. There are no categories in this JSON file, just annotations with caption descriptions. Both of the pictures I checked actually had 4 separate captions for each image, presumably from different people. 关键信息是每张配图有 5句描述性信息。（这种量级还是很好的）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


"annotations": [
    {
        "image_id": 289343,
        "id": 433580,
        "caption": "A person riding a very tall bike in the street."
    },
    ...
    {
        "image_id": 250282,
        "id": 511309,
        "caption": "A group of school children posing for a picture. "
    },
]

对于coco data 的操作

Class Filtering （不用像之前那样重新解析 json 脚本，可以直接使用调用 api 进行相应的实现）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# Define the classes (out of the 81) which you want to see. Others will not be shown.
filterClasses = ['laptop', 'tv', 'cell phone']

# Fetch class IDs only corresponding to the filterClasses
catIds = coco.getCatIds(catNms=filterClasses) 
# Get all images containing the above Category IDs
imgIds = coco.getImgIds(catIds=catIds)
print("Number of images containing all the  classes:", len(imgIds))

# load and display a random image
img = coco.loadImgs(imgIds[np.random.randint(0,len(imgIds))])[0]
I = io.imread('{}/images/{}/{}'.format(dataDir,dataType,img['file_name']))/255.0

plt.axis('off')
plt.imshow(I)
plt.show()

Now, the imgIDs variable contains all the images which contain all the filterClasses. The output of the print statement is:

如果想要展示 annotation 标注信息，那么是可以使用以下的代码

1
2
3
4
5
6


# Load and display instance annotations
plt.imshow(I)
plt.axis('off')
annIds = coco.getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)
anns = coco.loadAnns(annIds)
coco.showAnns(anns)

并且是可以个性化组合的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# 这里是有一个去重的 function

########## ALl POSSIBLE COMBINATIONS ########
classes = ['laptop', 'tv', 'cell phone']

images = []
if classes!=None:
    # iterate for each individual class in the list
    for className in classes:
        # get all images containing given class
        catIds = coco.getCatIds(catNms=className)
        imgIds = coco.getImgIds(catIds=catIds)
        images += coco.loadImgs(imgIds)
else:
    imgIds = coco.getImgIds()
    images = coco.loadImgs(imgIds)
    
# Now, filter out the repeated images    
unique_images = []
for i in range(len(images)):
    if images[i] not in unique_images:
        unique_images.append(images[i])

dataset_size = len(unique_images)

print("Number of images containing the filter classes:", dataset_size)

上述信息的参考文献：

Create COCO Annotations From Scratch
pycocoDemo 给出了一些demo，如何调用api
master-the-coco-dataset-for-semantic-image-segmentation-part-1-of-2

{% endfold %}

Flickr8K和30K

图像数据来源是雅虎的相册网站Flickr ：数据集中图像的数量分别是8,000张和30,000张

数据集包含8,000张图像，每张图像都与五个不同的标题配对，这些标题提供了对图片中物体和事件的内容描述

1
2
3
4
5
6


#0    A child in a pink dress is climbing up a set of stairs in an entry way .
#1    A girl going into a wooden building .
#2    A little girl climbing into a wooden playhouse .
#3    A little girl climbing the stairs to her playhouse .
#4    A little girl in a pink dress going into a wooden cabin .

关于数据详细的信息图像标注

比较经典的论文

Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption Generator

2.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

主要的框架

特征提取模型是一种神经网络。给定一张图像，它可以提取出显著的特征，通常用固定长度的向量表示。提取出的特征是该图像的内部表征，不是人类可以直接理解的东西。

语言模型：一般而言，当一个序列已经给出了一些词时，语言模型可以预测该序列的下一个词的概率。

Show and Tell: A Neural Image Caption Generator

编码器-解码器架构：RNN 需要保证输入和输出等长, 但是 encoder-decoder 可以不等长度.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

使用注意机制的描述模型：

编码器-解码器的一个局限性是使用了单个固定长度的表征来保存提取出的特征。

基于注意力机制的方法也已经被用于改进用于图像描述的编码器-解码器架构的表现水平——让解码器可以学习在生成描述中每个词时应该关注图像中的哪些部分。

生成的 caption 中还要考虑目标之间的关系，可以直观的表示为：

从网络结构上：可以看出整体仍是 Encoder-Decoder 结构，Encoder 部分没有做改变，在 Decoder 中引入了 attention。

2015年2月，《 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》将人类视觉系统中的 attention 机制引入深度学习

比较新的几篇论文

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

github vae_captioning

样化Image Caption的尝试

常规的做法

pass： code 比较难跑起来

asg2cap (Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graph, CVPR 2020)

github

Code accompanying the paper “Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs” (Chen et al., CVPR 2020, Oral). 论文

看图说话之随心所欲：细粒度可控的图像描述自动生成

有点麻烦，但是是可行的。重点看一下吧。

Vision-Language Pre-training for Image Captioning and Question Answering

可以试试，看一下code，可以尝试的。

CutMix

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

和场景不太吻合

vmess://ew0KICAidiI6ICIyIiwNCiAgInBzIjogIjIzM3YyLmNvbV8xOC4xNjMuOTEuMTEyIiwNCiAgImFkZCI6ICIxOC4xNjMuOTEuMTEyIiwNCiAgInBvcnQiOiAiODAiLA0KICAiaWQiOiAiZDQ0NDQ0NjktZmE3OC00ODAwLWExMTEtZDE5YjhlZGIwMTlhIiwNCiAgImFpZCI6ICIyMzMiLA0KICAibmV0IjogInRjcCIsDQogICJ0eXBlIjogImh0dHAiLA0KICAiaG9zdCI6ICJ3d3cuYmFpZHUuY29tIiwNCiAgInBhdGgiOiAiIiwNCiAgInRscyI6ICIiDQp9

评价标注

Image Caption评价标准

BLEU

• 图像标注结果评价中使用最广泛，设计初衷并不是针对图像标注问题，而是针对机器翻译问题

• 分析待评价的翻译语句和参考翻译语句之间n元组的相关性

实际应用

图像标注问题如果能够得到很好的解决，那么价值是显而易见的

辅助或者替代设计师进行图片描述的书写图像检索（更细粒度的搜索）

其他的材料

Show and Tell——图像标注（Image Caption）任务技术综述

图像描述生成Image Caption知识资料全集

https://www.floydhub.com/api/v1/resources/xdVEtkPLJjMAAyD66cxNjB/data/annotations/stuff_train2017.json?content=true&rename=stuff_train2017json

文章目录