《Learning to Reweight Examples for Robust Deep Learning》笔记

作者：键盘狂人 | 2024-01-26 11:30:07

踩

[1] 用 meta-learning 学样本权重，可用于 class imbalance、noisy label 场景。之前对其 (7) 式中 $\epsilon_{i,t}=0$ （对应 Algorithm 1 第 5 句、代码 ex_wts_a = tf.zeros([bsize_a], dtype=tf.float32)）不理解：如果 $\epsilon$ 已知是 0，那 (4) 式的加权 loss 不是恒为零吗？(5) 式不是优化了个吉而 $\hat\theta_{t+1}(\epsilon) \equiv \theta_t$ ？有人在 issue 提了这个问题^[2]，但其人想通了没解释就关了 issue。

看到 [3] 代码中对 $\epsilon$ 设了 requires_grad=True 才反应过来：用编程的话说， $\epsilon$ 不应理解成常量，而是变量；用数学的话说，(5) 的求梯度（ $\nabla$ ）是算子，而不是函数，即 (5) 只是在借梯度下降建立 $\hat\theta_{t+1}$ 与 $\epsilon$ 之间的函数（或用 TensorFlow 的话说，只是在建图），即 $\hat\theta_{t+1}(\epsilon)$ ，而不是基于常量 $\theta_t$ 、 $\epsilon=0$ 算了一步 SGD 得到一个常量 $\hat\theta_{t+1}$ 。

一个符号细节：无 hat 的 $\theta_{t+1}$ 指由 (3) 用无 perturbation 的 loss 经 SGD 从 $\theta_t$ 优化一步所得； $\hat\theta_{t+1}$ 则是用 (4) perturbed loss。文中 (6)、(7) 有错用作 $\theta_{t+1}$ 的嫌疑。

所以大思路是用 clean validation set 构造一条关于 $\epsilon$ 的 loss $J(\epsilon)$ ，然后用优化器求它，即 $\epsilon_t^*=\arg\min_\epsilon J(\epsilon)$ 。由 (4) - (6) 有：

\begin{aligned} J (ϵ) & = \frac{1}{M} \sum_{j = 1}^{M} f_{j}^{v} ({\hat{θ}}_{t + 1} (ϵ)) & (6) \\ = \frac{1}{M} \sum_{j = 1}^{M} f_{j}^{v} (θ_{t} - α \underset{g_{1} (ϵ; θ_{t})}{\underset{⏟}{[\nabla_{θ} \sum_{i = 1}^{n} f_{i, ϵ} (θ)] |_{θ = θ_{t}}}}) & (5) \\ = \frac{1}{M} \sum_{j = 1}^{M} f_{j}^{v} (θ_{t} - α [\nabla_{θ} \sum_{i = 1}^{n} ϵ_{i} f_{i} (θ)] |_{θ = θ_{t}}) & (4) \\ = g_{2} (ϵ; θ_{t}) \end{aligned}

$\begin{aligned} J(\epsilon) &= \frac{1}{M}\sum_{j=1}^M f_j^v \left(\hat\theta_{t+1}(\epsilon) \right) & (6) \\ &= \frac{1}{M}\sum_{j=1}^M f_j^v \left(\theta_t - \alpha \underbrace{\left[ \nabla_{\theta} \sum_{i=1}^n f_{i,\epsilon}(\theta) \right] \bigg|_{\theta=\theta_t}}_{g_1(\epsilon; \theta_t)} \right) & (5) \\ &= \frac{1}{M}\sum_{j=1}^M f_j^v \left(\theta_t - \alpha \left[ \nabla_{\theta} \sum_{i=1}^n \epsilon_i f_i(\theta) \right] \bigg|_{\theta=\theta_t} \right) & (4) \\ &= g_2(\epsilon; \theta_t) \end{aligned}$

J (ϵ) = \frac{1}{M} j = 1 \sum M f_{j}^{v} (\hat{θ}_{t + 1} (ϵ)) = \frac{1}{M} j = 1 \sum M f_{j}^{v} θ_{t} - α g_{1} (ϵ; θ_{t}) [\nabla_{θ} i = 1 \sum n f_{i, ϵ} (θ)]_{θ = θ_{t}} = \frac{1}{M} j = 1 \sum M f_{j}^{v} (θ_{t} - α [\nabla_{θ} i = 1 \sum n ϵ_{i} f_{i} (θ)]_{θ = θ_{t}}) = g_{2} (ϵ; θ_{t}) (6) (5) (4)

要注意的就是 (5) 那求导式，本质是个函数，而不是常量，其中

\epsilon

是自由的，

\theta

由于被

|_{\theta=\theta_t}

指定了，所以看成常量，所以记为

g_1(\epsilon;\theta_t)

，于是整个

J(\epsilon)

也可以看成一个

g_2(\epsilon; \theta_t)

。

按 (6) 求 $\epsilon_t^*$ 的思路就是：

随机初始化 $\epsilon_t^{(0)}$ ；
$\epsilon^{(s+1)}_t \leftarrow \epsilon^{(s)}_t - \eta \nabla_{\epsilon} J(\epsilon) \big|_{\epsilon=\epsilon^{(s)}_t}$ ，即 (7) 右边。可能由于 $J(\epsilon)$ 形式上是带梯度的表达式， $\S$ 3.3 就称此为「unroll the gradient graph」，而求 $\epsilon^{(s+1)}_t$ 的这一步就称为「backward-on-backward」吧。

而文章的 online approximation 就是：

$\epsilon^{(0)}_t=0$
$\epsilon^*_t \approx \epsilon^{(1)}_t$

初始化为 0 可能不是最好的初始化方法，但不影响后续迭代优化，可参考 LoRA^[7]，它也用到全零初始化。

References

(ICML’18) Learning to Reweight Examples for Robust Deep Learning - paper, code
gradients of noisy loss w.r.t parameter \theta #2
（PyTorch 复现 1）TinfoilHat0/Learning-to-Reweight-Examples-for-Robust-Deep-Learning-with-PyTorch-Higher
（PyTorch 复现 2）danieltan07/learning-to-reweight-examples
facebookresearch/higher
Stateful vs stateless
(ICLR’22) LoRA: Low-Rank Adaptation of Large Language Models - paper, code

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/article/detail/41999