deepspeed全参数训练模型报错exits with return code = -7

作者：空白诗007 | 2024-07-20 21:24:11

踩

exits with return code = -7

报错

exits with return code = -7

解决方案

docker run 官方质量：https://docs.docker.com/engine/reference/commandline/run/

``https://hub.yzuu.cf/microsoft/DeepSpeed/issues/4002

https://github.com/microsoft/DeepSpeed/issues/2897

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

I ran with this aks cluster yaml
https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc
or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename
it worked

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size=“2gb” to your docker run command. The default is pretty small and can sometimes cause issues like this.

Thank you for your advice. I check the default docker shm and find it’s only 64M. When I change it up to 64g the script goes well. And I also try “deepspeed all_reduce_bench_v2.py”, it exits successfully. Appreciate it for your answer.

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】