Machine_translation

介绍机器翻译的五个发展阶段，然后是关于机器翻译的论文阅读笔记（持续更新）

机器翻译的发展过程

why difficult?

Machine translation is challenging given the inherent ambiguity and flexibility of human language.
Statistical machine translation replaces classical rule-based systems with models that learn to translate from examples.
Neural machine translation models fit a single model rather than a pipeline of fine-tuned models and currently achieve state-of-the-art results.

（1）Rule-based Machine Translation

Classical machine translation methods often involve rules for converting text in the source language to the target language. The rules are often developed by linguists and may operate at the lexical, syntactic, or semantic level. This focus on rules gives the name to this area of study: Rule-based Machine Translation, or RBMT.

The key limitations of the classical machine translation approaches are both the *** expertise*** required to develop the rules, and the vast number of rules and exceptions required.

（2）Statistical Machine Translation

Statistical machine translation, or SMT for short, is the use of statistical models that learn to translate text from a source language to a target language given a large corpus of examples.

Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we wish to choose S so as to maximize $P_{r} (S | T)$ .

The approach is data-driven, requiring only a corpus of examples with both source and target language text. This means linguists are not longer required to specify the rules of translation. Although effective, statistical machine translation methods suffered from a ***narrow focus on the phrases being translated ***, losing the broader nature of the target text. The hard focus on data-driven approaches also meant that methods may have ***ignored important syntax distinctions *** known by linguists. Finally, the statistical approaches required careful tuning of each module in the translation pipeline.

（3）Neural Machine Translation

The key benefit to the approach is that a single system can be trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

As such, neural machine translation systems are said to be*** end-to-end systems*** as only one model is required for the translation.

（4）Encoder-Decoder Model

Multilayer Perceptron neural network models can be used for machine translation, although the models are limited by a ***fixed-length input sequence ***where the output must be the same length.

These early models have been greatly improved upon recently through the use of recurrent neural networks organized into an encoder-decoder architecture that allow for variable length input and output sequences.

The key to the encoder-decoder architecture is the ability of the model to encode the source text into an internal fixed-length representation called the context vector. Interestingly, once encoded, different decoding systems could be used, in principle, to translate the context into different languages.

The power of this model lies in the fact that it can map sequences of different lengths to each other.

（5）Encoder-Decoders with Attention

Although effective, the Encoder-Decoder architecture has problems with long sequences of text to be translated. The problem stems from the fixed-length internal representation that must be used to decode each word in the output sequence. The solution is the use of an attention mechanism that allows the model to learn where to place attention on the input sequence as each word of the output sequence is decoded.

The encoder-decoder recurrent neural network architecture with attention is currently the state-of-the-art on some benchmark problems for machine translation. And this architecture is used in the heart of the Google Neural Machine Translation system, or GNMT, used in their Google Translate service.

Although effective, the neural machine translation systems still suffer some issues, such as scaling to larger vocabularies of words and the slow speed of training the models. There are the current areas of focus for large production neural translation systems, such as the Google system.

（6）Attention VS LSTM

A limitation of the LSTM architecture is that it encodes the input sequence to a fixed length internal representation. This imposes limits on the length of input sequences that can be reasonably learned and results in worse performance for very long input sequences.

After reading this, you will know:

This (LSTM) is believed to limit the performance of these networks, especially when considering long input sequences, such as very long sentences in text translation problems. Put another way, each item in the output sequence is conditional on selective items in the input sequence.

而对于 Attention 而言:

Each time the proposed model generates a word in a translation, it (soft-) searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.… it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.

And, This (Attention) increases the computational burden of the model, but results in a more targeted and better-performing model.

同样，除了在 machine translation 中有应用，在其他领域如image description, CNN 中同样有借鉴的意义。Convolutional neural networks applied to computer vision problems also suffer from similar limitations, where it can be difficult to learn models on very large images.

参考文献

introduction-neural-machine-translation attention-long-short-term-memory-recurrent-neural-networks

Sequence to Sequence Learning with Neural Network论文阅读

研究目的：CNNs需要输入、输出维度是已知和固定的。而语音识别、机器翻译、问答系统等序列到序列问题的序列长度是未知的。CNN有一个明显的缺陷：CNN只能处理输入、输出向量维度是定长的情形。对于输入、输出可变长的情况，使用RNN-Recurrent Neural Network更易求解。

论文贡献之一在于网络结构上：通过学习编码一个可变长度的序列成一个固定长度的向量表示，解码一个给定的固定长度的向量成一个可变长度的序列。实现的时候首先将source sequence通过一个encode LSTM map成一个vector，然后再通过另一个decoder LSTM进行翻译得出output，这也恰恰是image caption里的思想呀（通过CNN将输入图像conv成一个vector或者feature map，然后再输入LSTM），原来大体是这样，接着看。

另外还有一个小的策略：LSTM在长句翻译中的表现也不俗。这归功于对源序列中词序的逆转。虽然LSTM能够基于长期的相关性处理问题，但我们发现在把原句序列逆转的情况下LSMT能学习得更加出色。逆转之后，LSTM测试的复杂度从5.8降至4.7，并且在BLEU上的得分从25.9提升至30.6。

不足之处：其他方面都比较普通，或者说很多论文中都有提到过，比如LSTM可以解决vanishing的问题但没法解决gradient exploding的问题，因此採取gradient crop。模型採用了SGD without momentum。实用的LSTM结构式Grave的《Generating sequence from RNN》中的LSTM结构，等等。

总结：总体来说，这个模型还是採取了贪婪的算法，换句话说，后面的预测对前面的状态有极强的依赖，一旦前面的预测出现问题，后面的预测就不可靠了，这也是一个值得思考和改进的地方。

Effective Approaches to Attention-based Neural Machine Translation论文阅读

这篇文章的核心在于 attention。

Attention 的作用可以看作是一个对齐模型，传统 SMT 我们用 EM 算法来求解对齐，这里做一个隐式的对齐，将 alignment model 用一个 feedforward neural network 参数化，和其他部分一起训练，神经网络会同时来学习翻译模型(translation) 和对齐模型(alignment)。

Attention 可以分成 hard and soft两种模型，简单理解 hard attention 就是从 source sentence 中找到一个能产生单词 $t^{t h}$ 对齐的特定单词，把 $s_{t, i}$ 设为1，其他所有单词硬性的认为其概率为0; soft attention 对于source sentence中每个单词都给出一个对齐概率，得到一个概率分布，context vector 就是这些概率分布的一个加权和，整个模型是平滑的且处处可分。

而在该篇论文中提出了一个新的 attention 机制 local attention，在得到 context vector 时，我们不想看所有的 source hidden state，而是每次只看一个 hidden state 的子集(subset)，这样的 attention 其实更集中，也会有更好的结果。Global attention 其实就是 soft attention， local model 实际相当于 hard 和 soft attention 的一个混合或者说折中，主要是用来降低 attention 的花费，简单来说就是每次计算先用预测函数得到 source 相关信息的窗口。

soft or hard attention 还是 global or local attention是从不同的角度进行分类的，前者是在概率分布上，后者是在 context上。

这个是global attention：

这个是 local attention

总结：三种不同的attention 种类

attention 分成 hard 和soft 两种模式，简单理解 hard attention 就是从source sentence中国找到一个能够产生 $t^{t h}$ 的特定单词，把这个单词设置成1，其他单词设置成0；而soft attention （global attention）是把source sentence中每个单词都给出一个对齐的概率模型，得到一个分布，然后整个模型是处处可分的。而前者不是处处可分的。local attention是相对于 global attention而言的，不是得到一个全局的attention，而是全局的子集，这样可以降低计算attention 的花费，得到一个相关信息的窗口。

BLEU论文阅读

作者提出了BLEU（Bilingual Evaluation Understudy，中文读作波勒）指标用于评价句子翻译的效果。其基本假设是

The closer a machine translation is to a professional human translation, the better it is. 翻译句子和专业人员的翻译越近，那么效果越好。

那么该BLEU 的计算需要由两部分组成：

a numerical “translation closeness” metric

a corpus of good quality human referrence translations

一个相似度计算的指标和高质量的 referrence 语料库。

BLEU 的计算分为三个步骤：

为了方便说明，沿用了论文中记号：Candidate 表示待评测的句子；Reference 是专业人员翻译的句子。

（1）n-gram

These matches are position-independent. The more the matches, the better the candidate translation is.

可以发现n-gram 是统计模型，该维度主要统计句子中词的共现。该维度是准确率（precision），要求Candidate 中词语是在Reference 句子中出现的。

缺点：较少地涉及语序，虽然语序也很重要。给出的解释是

The cat is on the mat. There is a cat on the mat.

上面两个句子是同一个意思，但是语序不一样。并且n-gram 是包含了部分语序，当 $n = 2$ 或者 $n = 3$ 的时候。

（2）惩罚模型（Modified n-gram）

模型有时候会生成如下的base case：

Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat.

翻译的结果具有 high precision，但是明显是不好的结果。作者提出的解决方案：

To compute this, one first counts the maximum number of times a word occurs in any single reference translation. Next, one clips the to- tal count of each candidate word by its maximum reference count, adds these clipped counts up, and divides by the total (unclipped) number of candidate words. BLEU修正了这个算法，提出取机器翻译译文N-gram的出现次数和参考译文中N-gram最大出现次数中的最小值的算法，具体如下：

$C o u n t_{c l i p} = min (C o u n t, Max_Ref_Count)$

所以该步骤precision score, $p_{n}$ 可以表达为：

$P_{n} = \frac{\sum_{C \in C a n d i d a t e s} \sum_{n - g r a m \in C} C o u n t_{c l i p} (n - g r a m)}{\sum_{C \in C a n d i d a t e s} \sum_{n - g r a m \in C^{‘}} C o u n t_{c l i p} (n - g r a m^{‘})}$

（3）Sentence brevity penalty

N -gram precision penalizes spurious words in the candidate that do not appear in any of the reference translations. Additionally, modified precision is penalized if a word occurs more frequently in a candidate translation than its maximum reference count. N-gram precision策略惩戒冗余长句子；但是对于短句子却没有处理，如下的bad case

Candidate: of the Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.

Candidate translations longer than their references are already penalized by the modified n-gram precision measure: there is no need to penalize them again. They consider the range of reference translation lengths in the target language. 长句子已经使用 n-gram 策略处理了，所以这部分策略主要处理具有高precision但效果不好的短句子。

$B P = {\begin{cases} 1 & c > r \\ e^{(1 - r / c)} & c <= r \end{cases}$

这里的 $c$ 是机器译文的词数， $r$ 是参考译文的词数

所以总的 BLEU 可以表示为以下的形式：

$B L E U = B P \cdot \exp (\sum_{n = 1}^{N} w_{n} \log p_{n})$

或者写成 log 的形式：

$\log B L E U = min (1 - \frac{r}{c}, 0) + \sum_{n = 1}^{N} w_{n} \log p_{n}$

当 $N = 4$ 的时候， $w_{n} = \frac{1}{N}$ 。

（4）结论

优点：计算速度快；容易理解；已经被广泛使用。缺点：短译句的测评精度有时会较高；它没有考虑句子意义 summary：BLEU本身就不追求百分之百的准确性，也不可能做到百分之百，它的目标只是给出一个快且不差的自动评估解决方案。该指标虽然是针对机器翻译提出，但是同样适用于其他的NLP 的模型。

参考文献

1). Sequence to Sequence Learning with Neural Network 2). Effective Approaches to Attention-based Neural Machine Translation 3). BLEU: a Method for Automatic Evaluation of Machine Translation

文章目录

机器翻译的发展过程

Sequence to Sequence Learning with Neural Network论文阅读

Effective Approaches to Attention-based Neural Machine Translation论文阅读

BLEU论文阅读