daemon=False, ) File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(f._runtimee">
当前位置:   article > 正文

多级多卡分布式训练时,报错 RuntimeError: Socket Timeout

runtimeerror: socket timeout
  1. Traceback (most recent call last):
  2. File "distribute_prune_erfnet_cluster.py", line 643, in <module>
  3. daemon=False, )
  4. File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
  5. return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  6. File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
  7. while not context.join():
  8. File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
  9. raise Exception(msg)
  10. Exception:
  11. -- Process 0 terminated with the following error:
  12. Traceback (most recent call last):
  13. File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
  14. fn(i, *args)
  15. File "/workspace/geniii-trainingcode-lane/lanenet/distribute_prune_erfnet_cluster.py", line 601, in _distributed_worker
  16. distributed.init_process_group(backend="gloo", init_method=dist_url, world_size=world_size, rank=global_rank)
  17. File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
  18. timeout=timeout)
  19. File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
  20. timeout=timeout)
  21. RuntimeError: Socket Timeout

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/220555
推荐阅读
相关标签
  

闽ICP备14008679号