My code runs for a few interactions but before ending the training it sends a SIGKILL for some unknown reason:
backend='nccl' rank=1 mp.current_process()=<SpawnProcess name='SpawnProcess-2' parent=13950 started> os.getpid()=13987 setting up rank=1 (with world_size=4) MASTER_ADDR='127.0.0.1' 44109 backend='nccl' --> done setting up rank=0 --> done setting up rank=2 --> done setting up rank=1 --> done setting up rank=3 setup process done for rank=0 setup process done for rank=2 setup process done for rank=1 setup process done for rank=3 Starting training... n_epoch=0 Traceback (most recent call last): File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module> main_distributed() File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size) File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join raise Exception( Exception: process 1 terminated with signal SIGKILL
I don't understand why it does that. I am not incrementally storing anything as training goes so I don't think it should be a memory issue (especially if it trains fine for a few batches.
How do I even start debugging this with the error not giving me any information? Ideas?
In my research I've checked these links but none seem to help:
- How does one fix a `Exception: process 0 terminated with signal SIGSEGV` error and if the single gpu code works fine?
- https://github.com/huggingface/transformers/issues/3660
- https://discuss.pytorch.org/t/exception-process-0-terminated-with-signal-sigkill/75570/5
- https://github.com/PyTorchLightning/pytorch-lightning/issues/1590
- Python script terminated by SIGKILL rather than throwing MemoryError
- https://discuss.pytorch.org/t/torch-utils-data-dataloader-issue/92770
- https://www.reddit.com/r/pytorch/comments/mdsljr/why_does_my_pytorch_distributed_training_ddp_code/
- https://www.quora.com/unanswered/Why-does-my-PyTorch-distributed-training-DDP-code-send-a-SIGKILL-signal-on-its-own
- https://github.com/pytorch/pytorch/issues/54823
没有评论:
发表评论