当前位置:   article > 正文

Error-RuntimeError: one of the variables needed for gradient computation has been modified by

runtimeerror: one of the variables needed for gradient computation has been

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 2, 72, 768]], which is output 0 of AsStridedBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

翻译一下:

RuntimeError:梯度计算所需的变量之一已被就地操作修改:[torch.cuda.FloatTensor [4, 2, 72, 768]],这是 AsStridedBackward0 的输出 0,版本为 1;预期的版本 0 代替。提示:上面的回溯显示了无法计算其梯度的操作。有问题的变量后来在那里或任何地方都发生了变化。祝你好运!

发现报错位置:

  1. total_loss, loss, loss_prompt, loss_pair = self._train_step(batch, step, epoch)
  2. total_loss.backward()

发现很多地方是修改添加,retain_graph=True,可以让前一次的backward()的梯度保存在buffer内:

  1. # 原始代码
  2. total_loss.backward()
  3. # 修改成:
  4. total_loss.backward(retain_graph=True)

但是还是报相同的错误,只能查看错误位置,添加代码with torch.autograd.detect_anomaly()

  1. with torch.autograd.detect_anomaly():
  2. total_loss, loss, loss_prompt, loss_pair = self._train_step(batch, step, epoch)
  3. total_loss.backward()

 运行之后发现出错位置是在这一句

mean_token = x[:, :, -end:, :].mean(1, keepdim=True).expand(-1, num, -1, -1)

出错可能是由于张量的计算在计算图中被修改,导致版本不匹配的错误。我们可以通过重写这段代码,避免潜在的就地操作,并确保计算图的正确性,所以修改成:

  1. # 深拷贝
  2. x = x.clone()
  3. sub_tensor = x[:, :, -end:, :]
  4. # 深拷贝
  5. sub_tensor_clone = sub_tensor.clone()
  6. mean_token = sub_tensor_clone.mean(1, keepdim=True)
  7. mean_token = mean_token.expand(-1, num, -1, -1)

对两个地方进行了拷贝,避免切片操作对原向量的影响,同时仔细检查代码中的所有操作,确保没有任何就地修改 (in-place) 操作,例如使用 .add_().mul_() 等操作符等

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/爱喝兽奶帝天荒/article/detail/978330
推荐阅读
相关标签
  

闽ICP备14008679号