当前位置:   article > 正文

deepspeed全参数训练模型报错exits with return code = -7

exits with return code = -7

报错

exits with return code = -7

解决方案

docker run 官方质量:https://docs.docker.com/engine/reference/commandline/run/

``https://hub.yzuu.cf/microsoft/DeepSpeed/issues/4002

https://github.com/microsoft/DeepSpeed/issues/2897

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

I ran with this aks cluster yaml
https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc
or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename
it worked

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size=“2gb” to your docker run command. The default is pretty small and can sometimes cause issues like this.

Thank you for your advice. I check the default docker shm and find it’s only 64M. When I change it up to 64g the script goes well. And I also try “deepspeed all_reduce_bench_v2.py”, it exits successfully. Appreciate it for your answer.

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号