机器学习系列(十)- $L1$ 和 $L2$ 正则化(二)
- 视频参考:“L1和L2正则化”直观理解(之二),为什么又叫权重衰减?到底哪里衰减了?
- 知乎参考:4 L1和l2正则化详解(花书7.1 参数范数惩罚)
- 博客参考:花书教我明白伤痛—L1正则、花书让我明白伤痛—参数范数惩罚
权重衰减
-
正则化前
损失函数:$J(W,b)$
权重更新:$W=W-\eta·\nabla_{W}J(W)$
-
$L2$ 正则化后
损失函数:$\hat{J}(W)=J(W)+\lambda\Vert{W}\Vert_{2}=J(W)+\frac{\alpha}{2}W^{T}W$ ,其中 $\lambda=\dfrac{\alpha}{2}$
权重更新:$\begin{array}{l} W & = W-\eta·\nabla_{W}\hat{J}(W) \newline & = W-\eta·\nabla_{W}J(W)-\eta·\alpha·W \newline & = (1-\eta·\alpha)W-\eta·\nabla_{W}J(W) \end{array}$
-
当 $0\lt\eta·\alpha\lt1$ 时,每次权重 $W$ 更新时都会缩小,故称为权重衰减;
正则化偏差
表示正则化前与正则化后的最值点偏差,该验证主要将两者作差已得到相关差值;
- $W^{\ast}$ 表示正则化前的最值点,$\hat{W}$ 表示正则化后的最值点。
泰勒展开式
-
原损失函数:$\begin{array}{l} J(W) & \approx J(W^{\ast})+\nabla_{W}J(W^{\ast})(W-W^{\ast})&+\frac{1}{2}(W-W^{\ast})^{T}H(W-W^{\ast}) \newline & = J(W^{\ast})&+\frac{1}{2}(W-W^{\ast})^{T}H(W-W^{\ast}) \end{array}$
其中 $H$ 表示 $Hessian$ 矩阵,即二阶导;因 $W^{\ast}$ 为最值点,故 $\nabla_{W}J(W^{\ast})=0$
则 $\nabla_{W}J(W)=H(W-W^{\ast})$
$L2$ 正则化
-
损失函数:$\hat{J}(W)=J(W^{\ast})+\frac{1}{2}(W-W^{\ast})^{T}H(W-W^{\ast})+\dfrac{\alpha}{2}W^{T}W$
则 $\nabla_{W}\hat{J}(W)=H(W-W^{\ast})+\alpha·W$ 代入最值点 $\hat{W}$
得 $\begin{array}{l} \nabla_{W}\hat{J}(\hat{W})&=H(\hat{W}-W^{\ast})+\alpha·\hat{W}=0 \newline &\Rightarrow H·\hat{W}-H·W^{\ast}+\alpha·\hat{W}=0 \newline & \Rightarrow (H+\alpha·I)\hat{W}=H·W^{\ast} \newline & \Rightarrow \hat{W}=(H+\alpha·I)^{-1}H·W^{\ast} \end{array}$
其中根据 $Hessian$ 矩阵定理与性质可知:$\begin{cases} H=Q\Lambda Q^{T} \newline Q^{T}=Q^{-1}=Q^{T}Q=QQ^{T}=I \end{cases}$
则继续推导得 $\begin{array}{l} \hat{W}&=(Q\Lambda Q^{T}+\alpha·I)^{-1}Q\Lambda Q^{T}·W^{\ast} \newline & = (Q\Lambda Q^{T}+\alpha QQ^{T})^{-1}Q\Lambda Q^{T}·W^{\ast} \newline & = (Q(\Lambda+\alpha·I) Q^{T})^{-1}Q\Lambda Q^{T}·W^{\ast} \newline & = Q(\Lambda+\alpha·I)^{-1} Q^{T}Q\Lambda Q^{T}·W^{\ast} \newline & = Q(\Lambda+\alpha·I)^{-1}\Lambda Q^{T}·W^{\ast} \end{array}$
拆解为矩阵元素得 $\hat{W_{i}}=\dfrac{\lambda_{i}}{\lambda_{i}+\alpha}·W_{i}^{\ast}$
-
当 $\alpha=0$ 时,$\hat{W_{i}}=W_{i}^{\ast}$ ;当 $\alpha>0$ 时,则对 $W_{i}^{\ast}$ 进行缩小,反之则进行扩大;
$L1$ 正则化
-
由于 $L^{1}$ 惩罚项在完全一般化的 $Hessian$ 的情况下,无法得到直接清晰的代数表达式,因此我们将进一步简化假设 $Hessian$ 是对角的,即 $H=diag(\begin{bmatrix}H_{1,1},\dots,H_{n,n}\end{bmatrix})$ ,其中每个 $H_{1,1}\gt 0$ 。如果线性回归问题中的数据已被预处理(如可以使用 $PCA$),去除了输入特征之间的相关性,那么这一假设成立。
-
损失函数:通过定义提取 $Hessian$ 每一项得
$\begin{array}{l} \hat{J}(W)&=J(W^{\ast})+\frac{1}{2}(W-W^{\ast})^{T}H(W-W^{\ast})+\alpha\Vert W\Vert_{1} \newline & =J(W^{\ast})+\sum_{i}^{n}(\frac{1}{2}H_{i,i}·(W_{i}-W_{i}^{\ast})^{2}+\alpha·\vert W_{i}\vert_{1}) \end{array}$
设 $f(W_{i})=\frac{1}{2}H_{i,i}·(W_i-W_{i}^{\ast})^{2}+\alpha·\vert W_{i}\vert$
则原式得 $\hat{J}(W)=J(W^{\ast})+\sum_{i}^{n}(f(w_{i}))$
代入最值点 $\hat{W}$ 则 $\nabla_{W}\hat{J}(\hat{W})=0\Rightarrow\nabla_{w_{i}}f(\hat{W_{i}})=0$
可得 $\begin{array}{l} \nabla_{w_{i}}f(\hat{W_{i}})=H_{i,i}·(\hat{W_{i}}-W_{i}^{\ast})+\alpha·sign(\hat{W_{i}})=0 \newline \Rightarrow \hat{W_{i}}=W_{i}^{\ast}-\dfrac{\alpha·sign(\hat{W_{i}})}{H_{i,i}} \end{array}$ ;
-
当 $\hat{W_{i}}>0$ 时,$\begin{array}{l} \hat{W_{i}}=W_{i}^{\ast}-\dfrac{\alpha}{H_{i,i}}>0 \newline W_{i}^{\ast}>\dfrac{\alpha}{H_{i,i}} \end{array}$ ,则当 $W_{i}^{\ast}>\dfrac{\alpha}{H_{i,i}}$ 时,$\hat{W_{i}}=W_{i}^{\ast}-\dfrac{\alpha}{H_{i,i}}$ ;
-
当 $\hat{W_{i}}<0$ 时,$\begin{array}{l} \hat{W_{i}}=W_{i}^{\ast}+\dfrac{\alpha}{H_{i,i}}<0 \newline W_{i}^{\ast}<-\dfrac{\alpha}{H_{i,i}} \end{array}$ ,则当 $W_{i}^{\ast}<-\dfrac{\alpha}{H_{i,i}}$ 时,$\hat{W_{i}}=W_{i}^{\ast}+\dfrac{\alpha}{H_{i,i}}$ ;
-
当 $\hat{W_{i}}=0$ 时,$sign(\hat{W_{i}})=0 \Rightarrow \hat{W_{i}}=W_{i}^{\ast}=0$ ,则当 $W_{i}^{\ast}=0$ 时,$\hat{W_{i}}=0$ ;
-
当 $-\dfrac{\alpha}{H_{i,i}}<W_{i}^{\ast}<\dfrac{\alpha}{H_{i,i}}$ 时,$\hat{W_{i}}=0$ ;
因 $\nabla_{w_{i}}f(W_{i})=H_{i,i}·(W_{i}-W_{i}^{\ast})+\alpha·sign(W_{i})$ ;
- 当 $W_{i}>0$ 时,$\nabla_{w_{i}}f(W_{i})=H_{i,i}·W_{i}-H_{i,i}·W_{i}^{\ast}+\alpha>0$ ;
- 当 $W_{i}<0$ 时,$\nabla_{w_{i}}f(W_{i})=H_{i,i}·W_{i}-H_{i,i}·W_{i}^{\ast}-\alpha<0$ ;
故当 $W_{i}=0$ 是为最值,即 $W_{i}=W_{i}^{\ast}=0$ ,故 $\hat{W_{i}}=0$ ;
整理可得
$\hat{W_{i}}=\begin{cases} W_{i}^{\ast}-\dfrac{\alpha}{H_{i,i}},&当W_{i}^{\ast}>\dfrac{\alpha}{H_{i,i}} \newline W_{i}^{\ast}+\dfrac{\alpha}{H_{i,i}},&当W_{i}^{\ast}<-\dfrac{\alpha}{H_{i,i}} \newline 0,&当-\dfrac{\alpha}{H_{i,i}}\le W_{i}^{\ast}\le\dfrac{\alpha}{H_{i,i}} \end{cases} \Rightarrow sign(W_{i}^{\ast})·max\begin{Bmatrix}\vert W_{i}^{\ast}\vert-\dfrac{\alpha}{H_{i,i}},0\end{Bmatrix}$
-