介绍 weights initialization(权重初始化),重点介绍 xaiver 和 he initialization 方式。

https://machinelearningmastery.com/weight-initialization-for-deep-learning-neural-networks/

Historically, weight initialization involved using small random numbers, although over the last decade, more specific heuristics have been developed that use information, such as the type of activation function that is being used and the number of inputs to the node.

从历史上看,weights初始化 使用小的随机值;初始化考虑 激活函数的种类,和输入尺寸问题

Weight initialization is used to define the initial values for the parameters in neural network models prior to training the models on a dataset.

weight initialization 是用于初始化网络中的参数(在 train 之前使用的)

How to implement the xavier and normalized xavier weight initialization heuristics used for nodes that use the Sigmoid or Tanh activation functions.

How to implement the he weight initialization heuristic used for nodes that use the ReLU activation function.

一般性的结论:如果使用 sigmoid or tanh,那么使用 xavier 或者 normalized xavier进行初始化;如果使用 relu,那么使用 he kaiming initialization 方式。

(1)xavier weight initialization

The xavier initialization method is calculated as a random number with a uniform probability distribution (U) between the range -(1/sqrt(n)) and 1/sqrt(n), where n is the number of inputs to the node.

uniform distribution:均匀分布

均匀分布的概率密度函数。 $$ f(x)={\begin{cases}{\frac {1}{b-a}}&\mathrm {for} \ a\leq x\leq b,\[8pt]0&\mathrm {for} \ x<a\ \mathrm {or} \ x>b\end{cases}} $$ (2)normalized xavier weight initialization

The normalized xavier initialization method is calculated as a random number with a uniform probability distribution (U) between the range -(sqrt(6)/sqrt(n + m)) and sqrt(6)/sqrt(n + m), where n us the number of inputs to the node (e.g. number of nodes in the previous layer) and m is the number of outputs from the layer (e.g. number of nodes in the current layer).

这个 normalized 体现在考虑到了 input layer nodes 和 output layer nodes

(3)weight initialization for Relu

As such, a modified version of the approach was developed specifically for nodes and layers that use ReLU activation, popular in the hidden layers of most multilayer Perceptron and convolutional neural network models.

he initialization 的提出背景

The he initialization method is calculated as a random number with a Gaussian probability distribution (G) with a mean of 0.0 and a standard deviation of sqrt(2/n), where n is the number of inputs to the node. weight = G (0.0, sqrt(2/n))

he initialization 的数学表达

高斯分布

$$ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}} $$

https://towardsdatascience.com/understand-kaiming-initialization-and-implementation-detail-in-pytorch-f7aa967e9138

这个是需要再理解一下的。