2021年3月26日星期五

Why does my pytorch distributed training (DDP) code send a SIGKILL signal on its own?

My code runs for a few interactions but before ending the training it sends a SIGKILL for some unknown reason:

backend='nccl'  rank=1  mp.current_process()=<SpawnProcess name='SpawnProcess-2' parent=13950 started>  os.getpid()=13987  setting up rank=1 (with world_size=4)  MASTER_ADDR='127.0.0.1'  44109  backend='nccl'  --> done setting up rank=0  --> done setting up rank=2  --> done setting up rank=1  --> done setting up rank=3  setup process done for rank=0  setup process done for rank=2  setup process done for rank=1  setup process done for rank=3  Starting training...    n_epoch=0  Traceback (most recent call last):    File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>      main_distributed()    File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed      spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)    File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn      return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')    File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes      while not context.join():    File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join      raise Exception(  Exception: process 1 terminated with signal SIGKILL  

I don't understand why it does that. I am not incrementally storing anything as training goes so I don't think it should be a memory issue (especially if it trains fine for a few batches.

How do I even start debugging this with the error not giving me any information? Ideas?


In my research I've checked these links but none seem to help:

https://stackoverflow.com/questions/66820865/why-does-my-pytorch-distributed-training-ddp-code-send-a-sigkill-signal-on-its March 27, 2021 at 12:34AM

没有评论:

发表评论