2021年3月26日星期五

How to solve the famous `unhandled cuda error, NCCL version 2.7.8` error?

I've seen multiple issue about the:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8  ncclUnhandledCudaError: Call to CUDA function failed.  

but none seem to fix it for me:

I've tried to do torch.cuda.set_device(device) manually at the beginning of every script. That didn't seem to work for me. I've tried different GPUS. I've tried downgrading pytorch version and cuda version. Different combinations of 1.6.0, 1.7.1, 1.8.0 and cuda 10.2, 11.0, 11.1. I am unsure what else to do. What did people do to solve this issue?


very related perhaps?


More complete error message:

('jobid', 4852)  ('slurm_jobid', -1)  ('slurm_array_task_id', -1)  ('condor_jobid', 4852)  ('current_time', 'Mar25_16-27-35')  ('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))  ('gpu_name', 'GeForce GTX TITAN X')  ('PID', '30688')  torch.cuda.device_count()=2    opts.world_size=2    ABOUT TO SPAWN WORKERS  done setting sharing strategy...next mp.spawn  INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1  INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0  rank=0  mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>  os.getpid()=30704  setting up rank=0 (with world_size=2)  MASTER_ADDR='127.0.0.1'  59264  backend='nccl'  --> done setting up rank=0  setup process done for rank=0  Traceback (most recent call last):    File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>      main_distributed()    File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed      spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn      return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes      while not context.join():    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join      raise ProcessRaisedException(msg, error_index, failed_process.pid)  torch.multiprocessing.spawn.ProcessRaisedException:     -- Process 0 terminated with the following error:  Traceback (most recent call last):    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap      fn(i, *args)    File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train      tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)    File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp      model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__      self._sync_params_and_buffers(authoritative_rank=0)    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers      self._distributed_broadcast_coalesced(    File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced      dist._broadcast_coalesced(  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8  ncclUnhandledCudaError: Call to CUDA function failed.  
https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error March 26, 2021 at 04:28AM

没有评论:

发表评论