Differences Between L1 and L2

L1 和L2 既可以作为 loss function 也可以作为 reguarization，分别介绍了一下两者，最后介绍 overfitting 的概念。

As loss function

loss function or error function 是用来衡量真实 $y$ 和生成的 $f (x)$ 之间差距的函数。在模型训练中我们一般情况下不断训练模型使得loss function不断下降（如果task要求loss function是增大，这时候一般加上符号或者转换成 1- loss fucntion，最后实现的还是loss function下降）。好的回到L1 loss function和L2 loss function.

概括：从鲁棒性（对待异常值）的角度看，L1 是比L2 具有更好的性质，因为L2 是把误差进行了平方处理，误差回放大；从稳定性角度（数据集的一个小的移动），L2 是比L1 具有更好的性质，这个是可以从实验的角度进行总结。

L1-norm loss function is also known as least absolute deviations(LAD), least absolute errors. It is basically minimizing the sum of the absolute differences between the target value Y and the estimated values (f(x)).

$S = \sum_{i = 1}^{n} | y_{i} - f (x_{i}) |$

L2-norm loss function is also known as least squares error(LSE). It is basically minimizing the sum of the square of the differences between the target value Y and the estimated values (f(x)).

$S = \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2}$

比较L1-norm 和L2-norm在前两个评价指标中的表现：

对于是否具有唯一解，可以从这个角度分析：L1是可以有多种解，而L2只有一种解。 (曼哈顿距离)

使用范围：

L1 适合在稀疏数据上使用，具有特征选择功能，得到的稀疏解；L2 适合在稠密数据上使用，得到的是唯一解，没有特征选择的功能。

As regularization

从XGBoost调参指南中我们知道objective function = loss funcion + regularization. 而我们大多数情况下提及的都是loss function,常常忽略了regularization 的作用。

The regularization term controls the complexity of the model, which helps us to avoid overfitting. 对于模型训练，一开始的想法是尽量的overfitting, 因为就现在不成熟的经验而言，对于overfitting这个问题有很多处理方法，比如卷积深度神经网络中的dropout, LightGBM中的early stop 和随机采样的思想。这些方法都是可以缓解overfitting，所以可以出现overfitting。相反，如果你的模型是underfitting，那么你就微显尴尬了。好，收回到L1 and L2。

总结： L1 在系数weights 中使用，L2 在非稀疏的情况下使用更好。可以从计算效率、是否稀疏输出和特征选择进行分析。

	计算效率	稀疏结果	特征选择
L1	在稀疏解上效率比较高，在非稀疏解上效率比较低	产生稀疏解	具有特征选择的功能
L2	在非稀疏解上效率高	不产生稀疏解	没有特征选择的功能

首先理解为什么要正则化？

如果网络足够的强大，数据量不是那么充足，那么网络完全是可以通过“记住” 这些样本，然后得到很高的 training acc，但是这个 test error 会比较大，这个时候就出现了过拟合。正则化就类似减轻了数据的复杂程度。关于过拟合是可以从两个维度进行解决：网络和数据。

常常使用的 dropout 就是减弱网络复杂性的一种手段，随机减少了某些网络中的节点，那么网络就没有那么强的“记忆”功能，但是最后的loss（要求）还是不变的，所以只能去寻找数据更加“简单普遍”的规律；正则化是对数据和weights 两种类型的，前者是对于数据的操作，然后weights 更加倾向于对于网络结构的操作，使得系数更加“光滑”。

为什么L1 相比于L2 产生了更加稀疏的解？

（这里从数据分布的角度进行解读，L2 的数据来源是高斯分布；L1 可以看成是来自laplace 分布）

这里从数学角度和空间角度进行了解释。但是从更加 tuition 的角度去理解，L1 没有一个连续的导数，能产生权值为0，所以类似剔除了某些特征，产生了稀疏的权值，而L2 是具有比较连续的导数，产生了比较平滑的权值。

更加数学化的理解：

$f (x) = \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(x - μ)^{2}}{2 σ^{2}})$ 如果把高斯密度函数取对数，那么就只会剩下一个平方项，这个就是L2正则项的来源，所以数据是比较稠密的。

同样，如果是数据比较稀疏，不妨假设来自 laplace 分布，下面是

其中的 $u$ 表示位置参数， $b$ 表示尺度参数, 当 $b$ 越大的时候，图像越低矮，数据越不集中，越类似均匀分布，所以有一种稀疏的感觉。这两个都是可以对标正太分布的。可以看到图像的尾部都是比较平滑，然后大多数是接近于0的。

公式： $f (x | μ, b) = \frac{1}{2 b} \exp (- \frac{| x - μ |}{b})$ 同样取对数，那么得到是L1 正则项。

所以L2 得到的是一个比较平滑的weights 适合处理稠密的向量，而L1 得到是一个稀疏的结果，因为有很大程度的被置为0.

先上公式 L1 regularization on least squares: $w^{*} = \underset{w}{\arg min} \sum_{j} {(t (x_{j}) - \sum_{i} w_{i} h_{i} (x_{j}))}^{2}$

L2 regularization on least squares:

The difference between their properties can be promptly summarized as follows:

对于第一点computational efficient的理解：平方比绝对值更容易计算，平方可以求导直接求最值，但是绝对值就无法求导。并且L1 regularization在 non-sparse cases中是 computational inefficient，但是在 sparse(0比较多) cases中是有相应的稀疏算法来进行优化的，所以是computational efficient. 对于第二点是否具有sparse solution可以从几何意义的角度解读：

The green line(L2-norm) is the unique shortest path, while the red, blue yellow(L1-norm) are all same length for the same route.

Built-in feature selection is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.

所以表格中第三点也是顺理成章的了。至此，我们区分了L1-norm vs L2-norm loss function 和L1-regularization vs L2-regularization，下面说一下 overfitting的东西。

overfitting

现在，我们的训练优化算法是一个由两项内容组成的函数：一个是损失项，用于衡量模型与数据的拟合度，另一个是正则化项，用于衡量模型复杂度。

对于过拟合有两种解读方式，一种是模型是复杂，然后是去拟合了所有的数据，没有了泛化性能；一种是从数据角度，模型去拟合了noise ，这些random的数据。然后从loss function的角度去优化的话，就是降低模型的复杂度，正则项就是降低模型复杂度的一种手段。所以从这个角度，L2 是降低模型复杂度的一种手段，也是一种减轻overfit的手段，这两者是相辅相成的。

We become “biased towards simpler models, on the basis that they are capturing something more fundamental, rather than some artifact of the specific data set”

模型复杂程度可以从两个维度考虑：一个是参数（weights）比较多，一个是参数比较大。L1 是和L2 正好可以从上述两方面 handle 问题.

With regularization: minimize (loss (model). + complexity (model))

Two main reasons that cause a model to be complex are:

total number of features (handled by L1 regularization)

The weights of features (handled by L2 regularization)

Another way of thinking about this is in the context of using gradient descent to minimize the loss function. We can follow the gradient of the loss function to the point where loss is minimized. Regularization then adds a gradient to the gradient of the unregularized loss. L1 regularization adds a fixed gradient to the loss at every value other than 0, while the gradient added by L2 regularization decreases as we approach 0. Therefore, at values of w that are very close to 0, gradient descent with L1 regularization continues to push w towards 0, while gradient descent on L2 weakens the closer you are to 0.

从 gradient descent 角度考虑 loss function。

L1 encourages weights to 0.0 if possible, resulting in more sparse weights (weights with more 0.0 values). L2 offers more nuance, both penalizing larger weights more severely, but resulting in less sparse weights. The use of L2 in linear and logistic regression is often referred to as Ridge Regression. This is useful to know when trying to develop an intuition for the penalty or examples of its usage.

Weight decay 其实就是我们常说的 L1/ L2 regularization，但为什么加入了 regularizer 就会造成 weight decay？

以 L2 regularization 为例 $w_{i}^{t + 1} ⟵ w_{i}^{t} - η \cdot \frac{\partial L^{'}}{\partial w_{i}}, where L^{'} (W) = L (W) + λ \cdot \sum_{n = 1}^{N} {(w_{n}^{t})}^{2}$ 推导出： $w_{i}^{t + 1} ⟵ w_{i}^{t} - η \cdot (\frac{\partial L}{\partial w_{i}} + 2 λ w_{i}^{t})$

$w_{i}^{t + 1} ⟵ (1 - 2 η λ) w_{i}^{t} - η \cdot \frac{\partial L}{\partial w_{i}}$

对比第一个公式和最后一个公式，使用 L2 正则时候，每次必然会将参数先乘以 $(1 - 2 η λ)$ 这一项，这会导致参数向着 0 靠近，这就是 weight decay 的本意。

L2 的缺点

L2 is not robust to outliers as square terms blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights.

weight decay 的缺点

weight decay 在实际中并不常用，原因是他的整体效果并没有办法和类似 early-stop 这样的 regularization 优秀。

L0 范数

对于范数的一般定义, p-norrm 如下：

$$ |x|{p} :=\left(\sum{i=1}^{n}\left|x_{i}\right|^{p}\right)^{\frac{1}{p}} $$ 虽然L0严格说不属于范数，我们可以采用上述等式给出l0-norm得定义：

$$ |x|{0} :=\sqrt[0]{\sum{i=0}^{n} x_{i}^{0}} $$

0的指数和平方根严格意义上是受限条件下才成立的。因此在实际应用中，多数人给出下面的替代定义： $$ |x|{0}=#(i) \text { with } x{i} \neq 0 $$ 其表示向量中所有非零元素的个数。正是L0范数的这个属性，使得其非常适合机器学习中稀疏编码，特征选择的应用。通过最小化L0范数，来寻找最少最优的稀疏特征项。但不幸的是，L0范数的最小化问题在实际应用中是NP难问题。因此很多情况下，L0优化问题就会被relaxe为更高维度的范数问题，如L1范数，L2范数最小化问题。

从最优化问题解的平滑性来看，L1范数的最优解相对于L2范数要少，但其往往是最优解，而L2的解很多，但更多的倾向于某种局部最优解。

L0,L1,L2范数及其应用

复习总结

作为loss function L1和L2 的区别： (1) 鲁棒性（对待异常值）角度，L1 是比L2 具有更好的性质，因为后者把误差进行了平方放大。(2 ) L2 是具有唯一解L1 是有多种解（3） L1适合在稀疏矩阵上使用，具有特征选择的功能（因为是有0 存在）
关于正则项L1 是如何求导？判断导数是否为0 有两种方式，一种是令导数为0；另一种是判断一阶导数在该点左右两个导数的符号，如果符号相反，那么该点导数为0. 对于L1，就是使用这种方式进行求导。这个链接解释从导数这个角度解释了为什么L1 是容易产生稀疏解，是因为在0 点产生了极值点。对于L1 和L2 的一种解读方式，L1是来自拉普拉斯分布，L2 是来自正太分布。
过拟合问题可以从数据和模型两方面进行考虑。(1 )数据角度，使用更多的数据集 (2 ) 模型分成节点和连线(weights) 两部分。dropout 随机减少了网络中的节点，减少网络复杂程度；正则化可有作用于weights，可以作用于训练数据集，减少了noise，都是使得系数更加”光滑“。

文章目录

As loss function

As regularization

overfitting

L0 范数

复习总结