机器学习笔记之EM算法(二)EM算法公式推导过程_em方法公式

作者：IT小白 | 2024-04-02 11:29:37

踩

em方法公式

机器学习笔记之EM算法——EM算法公式推导过程

引言
- 回顾：EM算法公式
- 推导过程

引言

上一节介绍了隐变量和EM算法，以及 以EM算法公式为条件，证明了随着EM算法迭代步骤的增加，每次迭代得到新的模型参数 $\theta^{(t+1)}$ 总是优于之前迭代结果 $\theta^{t},\theta^{t-1},\cdots$ 。最终 至少达到局部最优。本节将介绍EM算法公式的推导过程。

回顾：EM算法公式

EM算法本质上是 求解包含隐变量 $\mathcal Z$ 的概率模型 $P(\mathcal X \mid \theta)$ 的最优参数。

而隐变量是人为定义的一种变量——其原因是仅观察样本集合，很难观测到概率模型 $P(\mathcal X \mid\theta)$ 的分布规律。通过定义隐变量来协助求解概率模型。

EM算法的底层逻辑依然是极大似然估计：
隐变量是协助求解概率模型所定义的一种手段，它并不真实存在。因此只是在求解过程中‘引入隐变量’而不是像 $\mathop{\arg\max}\limits_{\theta} \log P(\mathcal X,\mathcal Z \mid \theta)$ 直接写在概率模型中。
$\hat \theta = \mathop{\arg\max}\limits_{\theta} \log P(\mathcal X \mid \theta)$
$\mathcal X$ 称作观测数据(Observed Data)，它是基于真正的样本集合得到的真实信息；
$\mathcal Z$ 称作非观测数据(隐变量(Latent Variable))，它可看作隐藏在样本集合内的规律信息；
$(\mathcal X,\mathcal Z)$ 称作完整数据(Complete Data)；
$\theta$ 是概率模型 $P(\mathcal X \mid \theta)$ 的模型参数(Parameter)；

EM算法公式表示如下：
$\theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta}\int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z$

上述公式本质上是一个迭代过程，每一次迭代均分为两个步骤，并且在迭代过程中两个步骤交替执行：

E步(Expectation-step)：将 $\int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z$ 视作 $\log P(\mathcal X,\mathcal Z\mid \theta)$ 在概率分布 $P(\mathcal Z \mid \mathcal X ,\theta^{(t)})$ 下的期望结果。即：
$\mathbb E_{\mathcal Z \mid \mathcal X,\theta^{(t)}} [\log P(\mathcal X,\mathcal Z \mid \theta)]$
M步(Maximization step)：选择合适的模型参数 $\theta^{(t+1)}$ ，使得 E步的期望结果最大：
$\theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta} \left\{\mathbb E_{\mathcal Z \mid \mathcal X,\theta^{(t)}} [\log P(\mathcal X,\mathcal Z \mid \theta)]\right\}$

并且以 EM算法公式为条件，从 极大似然估计角度 验证了对于模型参数 $\theta^{(t+1)}$ 的似然结果确实优于模型参数 $\theta^{(t)}$ 的似然结果。即EM公式的合法性：
$\log P(\mathcal X \mid \theta^{(t+1)}) \geq \log P(\mathcal X \mid \theta^{(t)})$

本节将介绍EM算法公式的推导过程。

推导过程

依然从极大似然估计的角度出发，引入隐变量 $\mathcal Z$ ，将概率模型的 $\log$ 似然表示如下：
$\log P(\mathcal X \mid \theta) = \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{P(\mathcal Z \mid \mathcal X,\theta)} = \log P(\mathcal X,\mathcal Z \mid \theta) - \log P(\mathcal Z \mid \mathcal X,\theta)$

这里出现一个技巧性操作：引入一个关于隐变量 $\mathcal Z$ 概率分布的 $\log$ 结果： $\log \mathcal Q(\mathcal Z)$ 。则有：
$\begin{aligned} \log P(\mathcal X \mid \theta) & = \log P(\mathcal X,\mathcal Z \mid \theta) - \log \mathcal Q(\mathcal Z) - \left[\log P(\mathcal Z \mid \mathcal X,\theta) - \log \mathcal Q(\mathcal Z)\right] \\ & = \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} \end{aligned}$
将等式两边分别基于 $\mathcal Q(\mathcal Z)$ 分布求解期望：

基于 $\mathcal Q(\mathcal Z)$ 分布对 $\log P(\mathcal X \mid \theta)$ 求解期望：
$\int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log P(\mathcal X \mid \theta) d\mathcal Z$
由于 $\log P(\mathcal X \mid \theta)$ 不含变量 $\mathcal Z$ ，视为常数；上式可转化为：
$\log P(\mathcal X \mid \theta) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z$
由于概率密度积分 $\int_{\mathcal Z}\mathcal Q(\mathcal Z) d\mathcal Z = 1$ ，因此， $\mathcal Q(\mathcal Z)$ 对 $\log P(\mathcal X \mid \theta)$ 的期望结果为 $\log P(\mathcal X \mid \theta)$ 自身：
$\log P(\mathcal X \mid \theta) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X \mid \theta) \cdot 1 = \log P(\mathcal X \mid \theta)$
基于 $\mathcal Q(\mathcal Z)$ 分布对 $\log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}$ 求解期望：
$\int_{\mathcal Z} \mathcal Q(\mathcal Z) \left[\log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z$
将上述式子展开，分为两个部分：
$\int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z + \left[- \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z\right]$
观察该式的第二项：它就是 $\mathcal Q(\mathcal Z)$ 和 $P(\mathcal Z \mid \mathcal X,\theta)$ 两种概率分布的相对熵，也称 $\mathcal K\mathcal L$ 散度(Kullback-Leibler Divergence)。
从实际意义的角度，它描述的是 $\mathcal Q(\mathcal Z)$ 和 $P(\mathcal Z \mid \mathcal X,\theta)$ 两种概率分布之间差异性的一种度量。
$\int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z = \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta))$
观察该式第一项，它同样有一个词描绘它：证据下界(Evidence Lower Bound,ELBO)。它的实际意义可表示为： $\log P(\mathcal X \mid \theta)$ 的下界。
可以将上述结果进行整理，表示如下：
$\begin{aligned}\int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log P(\mathcal X \mid \theta) d\mathcal Z & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \left[\log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ \to \log P(\mathcal X \mid \theta) & = ELBO + \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) \end{aligned}$
基于 $\mathcal K\mathcal L$ 散度的性质：
$\mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) \geq 0$
则有： $\log P(\mathcal X \mid \theta) \geq ELBO$ 恒成立。因此 $\log P(\mathcal X \mid \theta)$ 存在下界 $E L BO$ 。什么时候可以取等？根据 $\mathcal K\mathcal L$ 散度的性质，当：
实际意义即： $\mathcal Q(\mathcal Z)$ 和 $P(\mathcal Z \mid \mathcal X,\theta)$ 的概率分布完全相同。
$\begin{aligned}\mathcal Q(\mathcal Z)=P(\mathcal Z \mid \mathcal X,\theta) & \to \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0 \\ & \to \log P(\mathcal X \mid \theta) = ELBO \end{aligned}$
观察该式子，实际上， $\mathcal Q(\mathcal Z)$ 和 $P(\mathcal Z \mid \mathcal X,\theta)$ 在迭代过程中总是要越来越近似的(这两个式子都表示关于隐变量 $\mathcal Z$ 的概率分布，两个分布越来越大自然是不合理的)。

基于上述推测， $K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta))$ 会逐渐向0逼近；
但我们的核心目标依然是 让 $\log P(\mathcal X \mid \theta)$ 最大，但由于 $K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta))$ 虽然 $\geq0$ 恒成立，但因其向0逼近，导致 $K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta))$ 不能分担 $\log P(\mathcal X \mid \theta)$ 增大的任务。

因此，基于上述推测，一个朴素想法是：
在 $\mathcal Q(\mathcal Z)$ 和 $P(\mathcal Z \mid\mathcal X,\theta)$ 相等的条件下，使得 $E L BO$ 结果达到最大。从而使 $\log P(\mathcal X \mid \theta)$ 达到最大。即：
注意：这是两个步骤~
$\hat \theta = \mathop{\arg\max}\limits_{\theta} \log P(\mathcal X \mid \theta) = \mathop{\arg\max}\limits_{\theta} ELBO \quad(\mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta))$
将 $E L BO$ 带入，有：
$\hat \theta = \mathop{\arg\max}\limits_{\theta} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z$
此时，将 $\mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta^{(t)})$ 代入：
注意：此时的 $P(\mathcal Z \mid \mathcal X,\theta^{(t)})$ 表示‘上一次迭代’隐变量 $\mathcal Z$ 的后验概率分布，而不是抽象的 $P(\mathcal Z \mid \mathcal X,\theta)$ 本身。
个人理解：
解释一下这里将上标 $(t)$ 加上：我们需要 $K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0$ ,因此需要 $\mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta)$ ;
当前迭代步骤最优的后验概率 $P(\mathcal Z \mid \mathcal X,\theta)$ 理论上应该是 $P(\mathcal Z \mid \mathcal X,\theta^{(t+1)})$ ,但是 $\theta^{(t+1)}$ 是本次迭代需要求解的模型参数。因此 $P(\mathcal Z \mid \mathcal X,\theta^{(t+1)})$ 在当前迭代步骤中不存在。
因此，只能找一个当前迭代步骤下，最优的一组关于隐变量 $\mathcal Z$ 的概率分布；根据EM算法公式的收敛性，当前迭代步骤下的最优结果自然是‘上一次迭代的模型参数’ $\theta^{(t)}$ 产生的后验概率结果 $P(\mathcal Z \mid \mathcal X,\theta^{(t)})$ 。
因此 $\mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta^{(t)})$ 可能不会使 $K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0$ ,但不可否认的是，它绝对是距离 $0$ 最近的那一个。
$\hat \theta = \mathop{\arg\max}\limits_{\theta} \int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log \left[\frac{P(\mathcal X,\mathcal Z \mid \theta)}{ P(\mathcal Z \mid \mathcal X,\theta^{(t)})}\right] d\mathcal Z$
将上式展开：
$\hat \theta = \mathop{\arg\max}\limits_{\theta} \left\{\int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal X,\mathcal Z \mid \theta) d\mathcal Z - \int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \right\}$
观察大括号中的第二项：
$\int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal Z \mid \mathcal X,\theta^{(t)})$
其中 $\theta^{(t)}$ 是上一次迭代产生的最优模型参数，在本次迭代过程中相当于常数。因此第二项和 $\theta$ 无关。整理后可得：
$\hat \theta = \mathop{\arg\max}\limits_{\theta}\int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z$

证毕。

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】