Word2vec
文章目录
介绍Word2vec 的两种训练模式(DBOW和Skip-gram)和两种优化策略(Hierarchical Softmax 和Negative Sampling)。
两种训练模式
(1)Context-Based: Continuous Bag-of-Words (CBOW)
For example, the target word “swing” in the above case produces four training samples: (“swing”, “sentence”), (“swing”, “should”), (“swing”, “the”), and (“swing”, “sword”).
(2)Skip-Gram Model
The Continuous Bag-of-Words (CBOW) is another similar model for learning word vectors. It predicts the target word (i.e. “swing”) from source context words (i.e., “sentence should the sword”).
Because there are multiple contextual words, we average their corresponding word vectors, constructed by the multiplication of the input vector and the matrix W.
In the “skip-gram” mode alternative to “CBOW”, rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict ‘ate’ from average(‘The’, ‘cat’, ‘the’, ‘mouse’)], the network is presented with four skip-gram examples [predict ‘ate’ from ‘The’], [predict ‘ate’ from ‘cat’], [predict ‘ate’ from ‘the’], [predict ‘ate’ from ‘mouse’]. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)
(3)损失函数
skip-gram 和CBOW 都是采用的交叉熵损失函数。
\begin{equation} L_{\theta}=-\sum_{i=1}^{V} y_{i} \log p\left(w_{i} | w_{I}\right)=-\log p\left(w_{O} | w_{I}\right) \end{equation}
其中 $y_i$表示真实的数据的label, $p$是网络输出的概率值,具体表示为下式子: \begin{equation} p\left(w_{O} | w_{I}\right)=\frac{\exp \left(v_{w_{o}}^{\prime} \top_{v_{w_{i}}}\right)}{\sum_{i=1}^{V} \exp \left(v_{w_{i}}^{\prime} \tau_{v_{w_{i}}}\right)} \end{equation}
其中$p\left(w_{O} | w_{I}\right)$ 表示给定了 $w_l$ 的条件下 $w_O$的概率。
(4)超参数windows size
One heuristic is that smaller window sizes (2-15) lead to embeddings where high similarity scores between two embeddings indicates that the words are interchangeable (notice that antonyms are often interchangable if we’re only looking at their surrounding words – e.g. good and bad often appear in similar contexts). Larger window sizes (15-50, or even more) lead to embeddings where similarity is more indicative of relatedness of the words. 经验结论:小的window size得到的词向量是 interchangeable,比如说bad 和good 词向量相近;大的window size得到的词向量更加具有解释性。当然时间成本也更加大。
优化策略
(1)Hierarchical Softmax
层次softmax (Hierarchical softmax)是在最后一层softmax中计算上的优化,从原来的$O(V)$ 优化成$O(log_2V)$,其中 $V$表示字典的大小。
当词典 V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,那么最后的softmax层可以表示为以下的结构:
每个叶子节点表示一个单词。那么 $p(unseen) = p(left) \* p(right) \* p(right) \* p(right)$。其中 $sigmoid(x \* w +b) $ 得到了相应概率。并且由于左右子树的概率相加之和为1,所以当已知左子树的概率时候,右子树的概率是不用重新计算的,即 $p(left) + p(right) =1$。
(2)Negative Sampling (NEG)
1). Simple Sampling 该采样方式是根据数据随机采样,那么出现频率高的数据被采样中的次数多,而频率少的那么被计算的机会少。
2). Adjusted Sampling
\begin{equation} p(w_i) =\frac{fre(w_i)^c }{ \sum_{j}^{V} fre(w_j) ^c} \end{equation}
其中 $c =\frac{3}{4}$ 是实验中的经验值,一般使用该值效果比较好。
参考文献:
文章作者 jijeng
上次更新 2019-09-28