I've seen multiple issue about the:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.
but none seem to fix it for me:
- https://github.com/pytorch/pytorch/issues/54550
- https://github.com/pytorch/pytorch/issues/47885
- https://github.com/pytorch/pytorch/issues/50921
- https://github.com/pytorch/pytorch/issues/54823
I've tried to do torch.cuda.set_device(device)
manually at the beginning of every script. That didn't seem to work for me. I've tried different GPUS. I've tried downgrading pytorch version and cuda version. Different combinations of 1.6.0, 1.7.1, 1.8.0 and cuda 10.2, 11.0, 11.1. I am unsure what else to do. What did people do to solve this issue?
very related perhaps?
More complete error message:
('jobid', 4852) ('slurm_jobid', -1) ('slurm_array_task_id', -1) ('condor_jobid', 4852) ('current_time', 'Mar25_16-27-35') ('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb')) ('gpu_name', 'GeForce GTX TITAN X') ('PID', '30688') torch.cuda.device_count()=2 opts.world_size=2 ABOUT TO SPAWN WORKERS done setting sharing strategy...next mp.spawn INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0 rank=0 mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started> os.getpid()=30704 setting up rank=0 (with world_size=2) MASTER_ADDR='127.0.0.1' 59264 backend='nccl' --> done setting up rank=0 setup process done for rank=0 Traceback (most recent call last): File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module> main_distributed() File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size) File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train tactic_predictor = move_to_ddp(rank, opts, tactic_predictor) File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu]) File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__ self._sync_params_and_buffers(authoritative_rank=0) File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers self._distributed_broadcast_coalesced( File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.
https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error March 26, 2021 at 04:28AM
没有评论:
发表评论