OpenAI CLIP(Connecting Text and Images) 调研学习

CLIP 调研学习

How to Try CLIP: OpenAI’s Zero-Shot Image Classifier

对应的 colab：Roboflow-CLIP-Zero-Shot-Classification.ipynb 直觉上效果一般

CLIP (Contrastive Language-Image Pre-training), is a zero-shot model, meaning it can identify an enormous range of things it has never seen before.

In traditional classifiers, the meaning of the labels is ignored (in fact, they’re often simply discarded and replaced with integers internally). By contrast, CLIP creates an encoding of its classes and is pre-trained on over 400 million text to image pairs. This allows it to leverage transformer models' ability to extract semantic meaning from text to make image classifications out of the box without being fine-tuned on custom data.

传统上说，图像分类中的文字被映射成数字，但 CLIP 不是这样的。数据量是 4 亿 (image- text) pair.

In this post, we will walk through a demonstration of how to test out CLIP’s performance on your own images so you can get some hard numbers and an intuition for how well CLIP actually does on various use case. We found that CLIP does better than our custom trained ResNet classification models on a flower classification task. It also does surprisingly well over a range of more obscure and challenging tasks (including identifying mushroom species in pictures from our camera roll and identifying breeds of dogs and cats).

有实践说明效果好，大概看一下 link 上的数据集，貌似数据有点少呀。

CLIP from OpenAI: what is it and how you can try it out yourself

对应的 google jupyter： CLIP Face descrimination 建议尝试一下。

Intuition

For this to work successfully, the network must learn good visual representations and good connections between visual cues and text.

How CLIP works

(1) contrastive pre-training

CLIP achieves this by reframing the problem and using the contrastive pre-training. Instead of predicting label text, CLIP is training on predicting how likely this image is to correspond to that text.

Input images and texts are encoded, and their vector representations are used to build a similarity matrix (I*T is an inner product). Now, we know (during training) that the values on the diagonal represent correct classifications, so their similarity must be higher than those in the same row/column. This approach contrasts what we know go together (diagonal values) to what we know doesn’t go together (non-diagonal values). You can see that each row is a classification task: given an input image I, predict the text. Similarly, each column is a classification task: given an input text T, predict the image. During training, OpenAI used a very large size of mini-batches 32768 (N on the figure above).

(2) create dataset classifier from label text

During inference one takes a set of labels, creates texts based on labels and runs these texts through the text encoder. Text embeddings are later matched to image representation.

(3) use for zero-shot prediction

Classic classification training cares only about the predefined labels. If it is successful in findings dogs, then it doesn’t care if it is a photo or a sketch of a dog or a specific breed. Whereas CLIP training coupled with a large dataset makes the network learn various aspects of images and point attention to details.

**One detail that is worth mentioning is that CLIP is sensitive to words used for image descriptions. **Texts “a photo of a bird”, “a photo of a bird siting near bird feeder”, or “an image of a bird” all produce different probability paired with the same image:

Another result shown here is that CLIP was not training with image similarity in mind. Yet it learned useful representations that may be used in image similarity scenarios.

博主使用该模型作为 image retrieval，发现效果很好。

(4) zero-shot ImageNet accuracy

This Zero-shot learning approach coupled with natural language supervision is what differentiates CLIP from the other vision models. By training a wide variety of data easily accessible on the internet and no direct optimizing for benchmark, CLIP is much more generalized and representative.

We can see in the above image that the CLIP achieved the language model accuracy at just 33M parameters compared to 400M. CLIP is 12 times more efficient!! As a result of this methodology, CLIP can easily be applied to nearly any visual classification tasks and achieve great performance.

从部署方面考虑（parameter 小），效率高。

CLIP current limitations

CLIP authors are open about its limitations. CLIP struggles on more abstract or systematic tasks such as counting the number of objects and on a more complex tasks such as estimating relative distances between objects. On such datasets, CLIP is only slightly better than random guessing. CLIP also struggles with very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

抽象文字问题上吃力；在 fine-grained 的分类上有点吃力。

CLIP model itself is data hungry and expensive to train. If pre-trained model doesn’t work well for you, it may be not feasible to train your own version.

划重点，不能在自己的数据集上train，因为 CLIP model 本身是 data hungry and expensive to train. 如果能用则用，不能用那么是一种intuition。

Open AI CLIP: learning visual concepts from natural language supervision

从知识点的角度讲解还是很好的。

main selling point

For instance, ImageNet (the largest images dataset) is only able to classify images that belong to the classes that it was trained on. It doesn’t make sense to keep adding a new class to the dataset and re-train the network long-term.

这个还是在说 zero-shot

Contrastive learning

定义

Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for an ML model. Using this approach, one can train a machine learning model to classify between similar and dissimilar images.

You can think of Contrastive learning as a matching problem. If you were to match the picture of a cat to another similar one, you can do it easily. First, recognize the first cat, then find an image of another cat. So, you can contrast between similar and dissimilar things.

The model starts off with contrastive pre-training where image text pairs are matched with the similarity from a batch of images. This is done using an image encoder and a text encoder. Contrastive pre-training attempts to learn noise invariant sequence representations which encourage consistency between the learned representations and the original sequence.

They got the inspiration from VirTex which is a pretraining approach using semantically dense captions to learn visual representations. This approach has been shown to surpass other supervised approaches such as classic high-end ImageNet networks.

关键技术是 VirTex，因为其验证了使用 text + image 一块train 的效果是要好于 image +label 的效果。

缺点

The first one is that it doesn’t perform too well on systematic tasks such as counting the number of objects in images

Week generalization ability on images not covered in its pre-training dataset.

Sensitive to wording and phrasing

OpenAI’s DALL-E and CLIP 101: a brief introduction

CLIP is multimodal neural network.

**Multimodal neural networks: what are they? **

Our experiences as humans are multimodal, meaning that we receive inputs from the world surrounding us in different formats (sound, image, odors, textures, etc.) that we combine using different senses (touch, sight, hearing, smell and taste) to produce learnings and to retain information.

如果想要了解更深，那么可以参考这个paper Multimodal Machine Learning: A Survey and Taxonomy

In Deep Learning, it is very common to train models in only one data format (single modality).

之前的model 好多都是 single modality

CLIP is trained not by using labeled image datasets but from images and their descriptions (captions) taken from the internet .

爬虫相比于 label 是 label free。

Hands-on Guide to OpenAI’s CLIP – Connecting Text To Images

CLIP implements several existing learning visual representations from natural language supervision techniques. This involves modern and advanced architectures like the Vision and text Transformers, ICMLM, which explores masked language modelling, VirTex, which is applied to autoregressive language modelling, and ConVIRT, used in the contrastive objective that is used in CLIP for medical imaging.

CLIP 实现时候使用的基本模型。

OpenAI Introduces CLIP: A Neural Network That Efficiently Learns Visual Concepts From Natural Language Supervision

CLIP builds on a vast body of work on zero-shot transfer, multimodal learning, and natural language supervision.

这句话可以将不同的 selling points 联系起来。

大概的意思是站在巨人的肩膀上

CLIP is part of a group of papers revisiting visual learning representations from natural language supervision in the past year. This work uses more modern architectures like the Transformer and includes:

VirTex: It explored autoregressive language modeling.

ICMLM: It investigated masked language modeling.

ConVIRT: It studied the same contrastive objective that the team used for CLIP but in medical imaging.

Beyond tags and entering the semantic search era on images with OpenAI CLIP

上面有对应的 code，不是完整的，但是有思路。所以可以看看考虑一下。

Semantic search refers to the ability of search engines to consider the intent and contextual meaning of search phrases. Instead of trying to find exact matches for the word in the input phrase, semantic search captures broader context and relationships between words and retrieves results that are more closely related to the context of the search query.

对于 semantic search 的定义很好，consider the intent and contextual meaning of search phrases. 不是 find exact matches for the word in the input phrase. 实现前者的功能可以考虑的是 sentence-bert

如何在自己的数据集上进行 train？

contrastive learning

Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for an ML model. Using this approach, one can train a machine learning model to classify between similar and dissimilar images.

具体这个是怎么 train 的，可以参考这个论文

https://arxiv.org/pdf/2006.06666.pdf

CLIP 的一些缺陷：

The first one is that it doesn’t perform too well on systematic tasks such as counting the number of objects in images

Week generalization ability on images not covered in its pre-training dataset.

Sensitive to wording and phrasing

对于一些有条理的任务，比如物体的数量是无法 handle 的

在原来数据集上没有的数据表现不好

对于语言的表达比较敏感

clip 的训练方式主要参考 virtex

VirTex：从文本标注中学习视觉表示

目的：从较少的图像中学习高质量的视觉表示，并寻求数据效率高的替代方法来替代基于分类的预训练。

方案：提出VirTex，一种使用语义密集的字幕来学习视觉表示的预训练方法。在COCO Captions上从头开始训练卷积网络，并将其迁移到下游识别任务（downstream recognition tasks）中，包括图像分类、目标检测和实例分割。

结果：在所有任务中，不管是监督还是无监督学习，即使仅用原来图像数量的1/10，VirTex匹配或优于使用ImageNet进行预训练的模型。

这种是 unsupervised pretraining methods（因为图像是 unlabeled）

VirTex: Learning Visual Representations from Textual Annotations

train 的时候依赖于 image caption 的思路

Given a dataset of image-caption pairs, our goal is to learn visual representations that can be transferred to down- stream visual recognition tasks. As shown in Figure 2, cap- tions carry rich semantic information about images, includ- ing the presence of objects (cat, plate, cake); attributes of objects (orange and white cat); spatial arrangement of ob- jects (cat near a plate); and their actions (looking at apples). Learned visual representations that capture such rich se- mantics should be useful for many downstream vision tasks.

文本中的这些 feature 其实有助于提取更好的 image embedding，比如attributes：color 和材质

To this end, we train image captioning models to predict captions from images. As shown in Figure 3, our model has two components: a visual backbone and a textual head. The visual backbone extracts visual features from an input image I.

The textual head accepts these features and pre- dicts a caption C = (c0,c1,…,cT ,cT+1) token by token, where c0 = [SOS] and cT +1 = [EOS] are fixed special to- kens indicating the start and end of sentence. The textual head performs bidirectional captioning (bicaptioning): it comprises a forward model that predicts tokens left-to-right, and a backward model that predicts right-to-left.

图像方面的处理

Visual Backbone: The visual backbone is a convolutional network which computes visual features of images. It in- puts raw image pixels, and outputs a spatial grid of image features. During pretraining, these features are used to pre- dict captions. In downstream tasks, we either train linear models on features extracted from the visual backbone, or fine-tune the visual backbone end-to-end.

In principle we could use any convolutional network ar- chitecture for the visual backbone. In our experiments we use a standard ResNet-50 [2] as the visual backbone to fa- cilitate comparison with our baseline methods (Section 4). It accepts a 224 × 224 image and produces a 7 × 7 grid of 2048-dimensional features after the final convolutional layer. During pretraining, we apply a linear projection layer to the visual features before passing them to the textual head to facilitate decoder attention over visual features. This pro- jection layer is not used in downstream tasks.

文本方面的处理

Textual Head: The textual head receives features from the visual backbone and predicts captions for images. It pro- vides a learning signal to the visual backbone during pre- training. Our overall goal is not to predict high-quality cap- tions, but instead to learn transferable visual features.

The textual head comprises two identical language mod- els which predict captions in forward and backward di- rections respectively. Following recent advances in lan- guage modeling, we use Transformers [29], which use mul- tiheaded self-attention both to propagate information along the sequence of caption tokens, as well as to fuse visual and textual features. We closely follow the transformer de- coder architecture from [29], but use GELU [85] rather than ReLU, following [64, 79]. We briefly review the architec- ture here; refer to [29] for a more complete description.

文章目录