搞懂 attention？

从 attention is all you need -> Bidirectional Encoder Representation from Transformers -> vision transformer -> swin transformer系列学习笔记。

沐神讲解 attention

attention is all you need

(1) multi-head 机制类似 CNN 中的多个卷积核的输出（多个输出通道的感觉）

(2) attention 机制可以一次性建模句子（图像）中的关系，CNN 则需要多层才能可能得到相邻像素的关系

自回归：上一时刻的输出作为当前时刻的输入，比如机器翻译过程中。

输入的单词会经过向量化，加上 position encoding，然后才会作为输入放到 multi-head attention 中。

如果特征是二维：

batch norm：针对特征维度（列）

layer norm：针对样本进行 norm （行）

如果特征是三维，见下图。一种理解方式，是因为 sequence2sequence 模型中，句子是不等长的，所以这样计算均值、方差更加合理。但正规的论文中，更多是从梯度角度去分析的。

transformer 是 encoder 和 decoder 的结构，encoder 中 2 个 sublayer，decoder 中有 3 个sublayer。

你看完之后，如果懂，那么就懂了；如果不懂，那么还是不懂。hhh

multi-head 的机制

multi-head 机制类似 CNN 中的多个卷积核的输出（多个输出通道的感觉）

现将 query key value 通过 linear 投影到低位维度，然后进入 attention 计算。上述过程重复 h 次，然后concat 起来，最后 linear 投影成原来 512 维度。

这里的 h=8，那么经过投影之后的维度是 64。

自注意力机制：表示输入的 query key value 是相同的东西。一共有三种 attention 机制。

position-wise feed-forward networks

MLP：对每个词作用一次 MLP，每个词作用的是同一个 MLP 。 $FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$ $x$ (512 维度)通过 $W_{1}$ 映射到了 2048 维度，然后结果 $W_{2}$ 映射回 512 维度。

单隐藏层的 MLP

总结

attention 使用了一个更加简单的假设，意味着需要更大的时间成本和数据量去训练，

模型本身可以调节的参数很少，所以对于后来者也是一个优势。

BERT: Bidirectional Encoder Representations from Transformers

bidirectional 双向，之前的工作：GPT（单项信息，trasformer）, EMOL（双向信息，rnn 架构）

两种方式：

feature-based：使用 backbone 提取特征，然后和输入一块

Fine-tuning：稍微修改网络，原始的权重进行微调

实验结果：绝对精度（相对精度相对于其他的工作）

masked language model：双向（从左往右，从右往左）

大量没有标注的数据上训练得到的模型，可能比你有标注的数据集上训练的模型更好（imagenet 1百万数据量）

nlp 和 cv 领域都可以好好思考一下。

计算 transformer 的可训练参数

transformer 的输入是一个序列（两个句子），bert 只有一个 encoder ，输入是一个句子（所以如果是两个句子，那么需要拼接成一个句子）

wordpiece：为了解决词典比较大的情况，对长且不经常出现的次，进行切词成常见的 subword，这样使得词典比较小。3w。

一对句子经过 bert 模型之后得到的结果。

两个任务：（1）完形填空，使用 mask 随机替换某个词， 80%, 10% 10% 分情况讨论的（2）next sentence prediction (NSP)，句子层面的信息。

self-attention 是能够看到完整句子的，如果是 transformer（encoder-decoder ）中 encoder 是看不到 decoder 的信息。所以 bert 在这点上会更好一些，缺点在于不能像 transformer 那样做机器翻译了。

bert 应用于下游任务

QA 数据集：bert fine-tune 的时候需要使用 adam 正常版本，并且 epoch 要长一些。（因为开源中给出的 adam 是阉割版，epoch 只有3）

bert base 是 1个亿的可学习参数量

bert large 是 3 个亿的可学习参数量

分类问题在 NLP 领域更加常见一下，bert 在文本摘要（文本生成），文本翻译中是一个劣势，但在分类中效果很好。

vision transformer

Swin transformer: 在检测上效果最好。多尺度的 vit

Mae 采用自监督的方式训练 transformer

cnn 不一样的特性，参考论文：intriguing properties of vision transformers

transformer 中最重要的操作是 self-attention，是对输入的元素进行两两求解相似度，目前的设备能够承载的长度：小几千（比如 bert 中是 512 ）。

直接在像素层面使用 transformer 是不显示的。

分类任务： 224 224

检测 or 分割：600 600 or 800 800

用特征图（resnet 14 *14 ）当做 transformer 的输入。cnn 和 transformer 结合起来进行做的，有很多这样的工作。

使用patch 的思路，一个 patch 是 16*16 ，所以 An Image is Worth 16*16 Words: Transformers for Image 。

transformers lack some of the inductive biases （归纳偏置）

归纳偏置：一些先验知识，归纳假设。

cnn 的归纳偏置

locality，局部性，相近的像素点，内容相近，所以滑动窗口；
translation equivariance 平移等变性

$f (g (x)) = g (f (x))$ ， $f (x)$ 表示卷积， $g (x)$ 表示平移。

一般来说：判别式网络要比生成式网络效果好。

这篇论文的重点是如何将图像当做 nlp 中的 words 进行处理。借助的是 patch 的思路，将图像打成 $16 * 16$ patch，给定顺序（为了之后的 position），然后就可以当做是 nlp 中的文本进行处理了。

patch 的图像经过一个线性的映射得到 embedded patches，然后进入 transformer encoder（右图中，可以认为是一个 block，只不过是重叠了 L 次，如果是 base 网络，那么是 12），transformer encoder 的输入和输出都是相同维度，更加方便进行堆叠。

进入一个 block 之前需要加上 embedded patches 需要加上 position embedding（不是图像中 0，1，2 等自然数，也是和 embedding patches 相同维度的向量，比较方便相加）。

$i n d e x = 0$ 相当于 transformer 中的 cls token，使用这个位置的向量用于下游的任务，如分类中的MLP head。当做图像的特征。

MLP head 就是一个线性层，完成的功能是空间转换，一般是 $* 4$ 维度，然后 $/ 4$ ，这样进入 MLP 前后维度是保持不变的。

D =768 ，向量的维度

消融实验

2d 的位置编码， $d / 2$ 用于编码横坐标， $d / 2$ 用于编码纵坐标，两者 concat 起来就生成长度为 $d$ 的1D 位置编码。

transformers lack some of the inductive biases

缺少一些先验的假设

结论：一个标注的 transformer 也是可以做 CV 的任务

vision transformer 需要多少数据集才能训练得比较好？

Imagenet1k, 1.2M, imagenet-21k, 14M，JFT-300M, 3亿数据集

vision transformer 在中小型数据集上训练（imagenet-1k），不如 resnet；当 14 百万，两者差不多；当达到亿数量级，那么 vision transformer 表现比较好。

transformer：又大又贵，难训练。

这是大多数的人概念。

Swin transformer: hierarchical vision transformer using shifted windows

shifted windows：移动窗口。多尺度的 vit。（vit 只是做了分类任务）

transformer 如何处理图像？

使用特征图 or 使用 patch

主要是用于下游任务，比如od，主要看 coco（od）和 ade20k（segmentation）。

Swinv2：1536 *1536 尺寸上进行预训练，coco 刷到 63.1

对于下游任务（比如 od or segmentation），能够很好的处理多尺度图像是很重要的，比如之前的 FPN（feature pyramid network），能够有不同的感受野。

u-net（分割）中的 skip-connection 用来处理多尺寸信息。

od 和 segmentation 一般都是 $800800 $o r$ 1000 1000$ 的输入 resolution

全局自注意力 or local 自注意力

Attention 机制

Attention 机制的本质是人眼的视觉观察，当观察一个物体的时候是有重点的观察某个位置，对信息重点关注并学习的技术。在数学中中的表现形式就是加权平均和。attention机制在机器翻译、语音识别、图像标注（Image Caption）和文本摘要等领域十分流行，在序列模型是非常有效的手段。

（1）来源

很多介绍Attention机制的时候都要从，Sequence to Sequence 模型说起，这个不是没有原因的。Attention 机制的提出（不是提出，而是受到广泛的关注）就是为了解决Sequence to Sequence中的问题，准确是是为了解决机器翻译中Sequence to Sequence的问题。

经典的机器翻译的模型是使用一个Encoder + Decoder的结构，将一种语言比如说函数压缩成一个固定长度的向量，称为隐藏层，然后使用一个Decoder将隐藏层的信息映射到另外一种语言，比如英语。其中的Encoder 和Decoder 一般是基于LSTM 或者GRE等RNN的网络结构，在进行机器翻译的时候有两处缺陷： 1). 源句子信息必须能够压缩到一个固定长度的向量中； 2). 翻译成目标句子时候，源句子中的每个token 的权重是相同的。

对于短文本来说，上述模型是没有问题的，但是对于长文本来说，很有可能固定长度的向量表示无法有效得得到句子的特征表示。所以这个时候提出了（借用）了Attention机制。

So Lets talk about the intuition first. In the past conventional methods like TFIDF/CountVectorizer etc. we used to find features from text by doing a keyword extraction. Some word are more helpful in determining the category of a text than others. But in this method we sort of lost the sequential structure of text. With LSTM and deep learning methods while we are able to take case of the sequence structure we lose the ability to give higher weightage to more important words. Can we have the best of both worlds?

从文本分类角度说明了 attention 机制的好处：既可以保证时序关系，又可以保证关键信息

Without attention, The input in decoder based on 2 component: the initial decoder input (often we set it to EOS token first (start word)) and the last hidden encoder. This way has the drawback in case some informations of very first encoder cell would be loss during the process. To handle this problem, the attention weight is added to all encoder outputs.

原先 RNN 的结构，没有 attention 机制

After attention weight was caculated, now we have three components: decoder input, decoder hidden, (attention weights * encoder outputs), we feed them to decoder to return decoder output

（2）Attention 机制

在机器翻译中，Attention用于关联输出序列中每个单词与输入序列中某个特定单词的关联程度。使用attention机制还有一个好处：对齐（将原文的片段和对应的译文片段进行匹配）。当然也并不是盲目的将输出的第一个单词和输入的第一个词对齐，权重关系是学习到的。

Seq2Seq直接把最后一个时序 $i$ 的输出 $h_{i}$ 作为上下文，作为Decoder的全部的输入。而使用了Attention机制的模型中，上下文向量（隐藏层向量）包含了各个时序输出的权重信息，也就是对于当前生成的文字，在源句子中哪部分是重要的，哪些部分是不重要的。

To do this we start with a weight matrix(W), a bias vector(b) and a context vector u. All of them will be learned by the optimization algorithm. On this note I would like to highlight something I like a lot about neural networks - If you don’t know some params, let the network learn them. We just have to worry about creating architectures and params to tune.

这种 weights 的学习是通过网络的反向传播完成的。

Attention is an interface connecting the encoder and decoder that provides the decoder with information from every encoder hidden state. With this framework, the model is able to selectively focus on valuable parts of the input sequence and hence, learn the association between them. This helps the model to cope efficiently with long input sentences.

对于 attention 机制和 encoder、decoder 的关系介绍很清楚了。

Attention的计算流程：

1). 准备隐藏状态 2). 得到每个隐藏状态和解码器之间的score 分数（点积只是其中一种） 3). 将所有的分数经过softmax 归一化 4). 每一个编码器隐藏状态和上面的分数相乘 5). 将对齐之后的向量相加，产生上下文向量 6). 将上下文向量送到解码器中

使用动画的形式完整展示上述描述的过程：

上述过程还可以用query，key 和value进行表示：

用数学表示为： $a_{i} = soft max (f (Q, K_{i})) = \frac{\exp (f (Q, K_{i}))}{\sum_{j} \exp (f (Q, K_{j}))}$

其中函数 $f$ 可以有以下几种选择

$f (Q, K_{i}) = {\begin{cases} Q^{T} K_{i} & d o t \\ Q^{T} W K_{i} & g e n e r a l \\ W [Q; K_{i}] & c o n c a t \end{cases}$

最后得到Attention如下所示： $Attention (Q, K, V) = \sum_{i} a_{i} V_{i}$

优点：1). 当前词和全局联系 2). 并行，Attention计算不依赖上一步的结果

缺点：不能捕捉语序，是一个词和周围词关系的模型，当然是可以通过位置向量（position embedding）处理

（3）常见的Attention机制的分类

1). Soft Attention & Hard Attention：上述的经典模型也被称为soft Attention，因为每个输入词的隐藏层 $h_{i}$ 都参与了最后权重的计算，这样方便梯度的反向传播。对应的是 hard Attention，该类型是说在输入中中澳大某个特定的词，其权重为1，其他都是0，这种方法比较粗暴，同时因为输入和输出的一一对应关系难度很大，所以在训练的时候非常困难，需要许多技巧，所以在NLP 中不是很常见，但在图像处理中，hard attention是比较有用的。

2). Global Attention & Local Attention：这两者的区别在于是否所有的Encoder 的隐状态（ $h_{i}$ ）都参与了计算，如果是，那么就是Global Attention，否则就是Local Attention。

3). Self-Attention：传统的attention是基于源句的隐变量和目标句子之间计算attention，得到的结果是源句子的每个词和目标句子中每个词之间的依赖关系。但是self attention是指在一端（源句子或者目标句子）进行，得到的是自身相关的attention，捕捉的是源句子或者目标句子自身词与词之间的依赖关系，然后该依赖关系和源句子或者目标句子进行相乘，得到一端句子的依赖关系。

In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence:

Between the input and output elements (General Attention)

Within the input elements (Self-Attention)

代码时间部分

pytorch 实现一个具有self attention 翻译模型，机器翻译的模型或者序列标注的模型

参考文献

Sequence to Sequence Learning with Neural Networks Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation 动图逐步讲解

Transformer 模型

The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: $Attention (Q, K, V) = softmax (\frac{{Q K}^{⊤}}{\sqrt{n}}) V$

这里面Multi-head Attention其实就是多个Self-Attention结构的结合，每个head学习到在不同表示空间中的特征，如下图所示，两个head学习到的Attention侧重点可能略有不同，这样给了模型更大的容量

对于 encoder 的介绍

The encoder generates an attention-based representation with capability to locate a specific piece of information from a potentially infinitely-large context.

A stack of N=6 identical layers.

Each layer has a multi-head self-attention layer and a simple position-wise fully connected feed-forward network.

Each sub-layer adopts a residual connection and a layer normalization. All the sub-layers output data of the same dimension dmodel=512dmodel=512.

对于 decoder 的介绍

The decoder is able to retrieval from the encoded representation.

A stack of N = 6 identical layers

Each layer has two sub-layers of multi-head attention mechanisms and one sub-layer of fully-connected feed-forward network. （注意是两个 encoder 得到的输入放到了 decoder 中的 self-attention）

Similar to the encoder, each sub-layer adopts a residual connection and a layer normalization.

The first multi-head attention sub-layer is modified to prevent positions from attending to subsequent positions, as we don’t want to look into the future of the target sequence when predicting the current position. （注意这个 mask 是为了防止看到之后的信息）

In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

注意力机制是使用周围的信息权重之后用来表示当前的目标词汇。

An encoder processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.

A decoder is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

关于 encoder 和decoder 的理解： encoder 是 summary of input sequence，decoder 使用上面的输出作为 initial state。

While the context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. The weights of these shortcut connections are customizable for each output element.

分类汇总

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

Here are a summary of broader categories of attention mechanisms:

Name	Definition	Citation
Self-Attention(&)	Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence.	Cheng2016
Global/Soft	Attending to the entire input state space.	Xu2015
Local/Hard	Attending to the part of input state space; i.e. a patch of the input image.	Xu2015; Luong2015

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

Name	Alignment score function	Citation
Content-base attention	score(st,hi)=cosine[st,hi]score(st,hi)=cosine[st,hi]	Graves2014
Additive(*)	score(st,hi)=v⊤atanh(Wa[st;hi])score(st,hi)=va⊤tanh⁡(Wa[st;hi])	Bahdanau2015
Location-Base	αt,i=softmax(Wast)αt,i=softmax(Wast) Note: This simplifies the softmax alignment to only depend on the target position.	Luong2015
General	score(st,hi)=s⊤tWahiscore(st,hi)=st⊤Wahi where WaWa is a trainable weight matrix in the attention layer.	Luong2015
Dot-Product	score(st,hi)=s⊤thiscore(st,hi)=st⊤hi	Luong2015
Scaled Dot-Product(^)	score(st,hi)=s⊤thin√score(st,hi)=st⊤hin Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.	Vaswani2017

Self-Attention

Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

Soft vs Hard Attention

In the show, attend and tell paper, attention mechanism is applied to images to generate captions. The image is first encoded by a CNN to extract features. Then a LSTM decoder consumes the convolution features to produce descriptive words one by one, where the weights are learned through attention. The visualization of the attention weights clearly demonstrates which regions of the image the model is paying attention to so as to output a certain word.

This paper first proposed the distinction between “soft” vs “hard” attention, based on whether the attention has access to the entire image or only a patch:

Soft Attention: the alignment weights are learned and placed “softly” over all patches in the source image; essentially the same type of attention as in
- Pro: the model is smooth and differentiable.
- Con: expensive when the source input is large.
Hard Attention: only selects one patch of the image to attend to at a time.
- Pro: less calculation at the inference time.
- Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train. (Luong, et al., 2015)

目前需要 code 级别的关于 attention 的应用。

文章目录

沐神讲解 attention

Attention 机制

Transformer 模型