I have a heavy process I want to use Multiprocessing for and execute simultaneously. However, when I run it using multiprocessing -- the processes are spawned but they are not using any system resources to execute. I looked up a multiprocessing example shown on GeeksForGeeks
Expected Results : Running Multiprocessing on the dataset using multiprocess_dataset() function given below
Code:
import Extractor import DatasetReader import gc import pandas as pd def batched_extraction(sub_data_df, file_batch_name, COL_NAME): print('File {} started processing!'.format(file_batch_name)) result = [Extractor(row[COL_NAME]).get_all_phrases() for idx, row in sub_data_df.iterrows()] sub_data_df['result'] = result sub_data_df.to_csv('Outputs/file_batch_name', index=False, encoding='utf-8-sig') del phrases, sub_data_df gc.collect() def multiprocess_dataset(formatted_df): batched_num_reviews = len(formatted_df)//4 print('Multicore process 4 batches of {} reviews each'.format(batched_num_reviews)) sub_df_1, sub_df_2 = formatted_df[:batched_num_reviews], formatted_df[batched_num_reviews:2*(batched_num_reviews)] sub_df_3, sub_df_4 = formatted_df[2*(batched_num_reviews):3*(batched_num_reviews)], formatted_df[3*(batched_num_reviews):] file_batch_name_1 = 'output_multiprocess_input_file_b1_{}_reviews.csv'.format(len(sub_df_1)) file_batch_name_2 = 'output_multiprocess_input_file_b2_{}_reviews.csv'.format(len(sub_df_2)) file_batch_name_3 = 'output_multiprocess_input_file_b3_{}_reviews.csv'.format(len(sub_df_3)) file_batch_name_4 = 'output_multiprocess_input_file_b4_{}_reviews.csv'.format(len(sub_df_4)) p1 = multiprocessing.Process(name="process1", target=batched_extraction, args = (sub_df_1, file_batch_name_1, COL_NAME)) p2 = multiprocessing.Process(name="process2", target=batched_extraction, args = (sub_df_2, file_batch_name_2, COL_NAME)) p3 = multiprocessing.Process(name="process3", target=batched_extraction, args = (sub_df_3, file_batch_name_3, COL_NAME)) p4 = multiprocessing.Process(name="process4", target=batched_extraction, args = (sub_df_4, file_batch_name_4, COL_NAME)) p1.start() p2.start() p3.start() p4.start() p1.join() p2.join() p3.join() p4.join() def main(): INPUT_FILENAME = input('Enter input filename : ') COL_NAME = input('Enter column on which you want to process on : ') df = DatasetReader('Datasets/{}'.format(INPUT_FILENAME)).read_dataset() df[COL_NAME].replace(np.nan, "EMPTY") # num_reviews = len(df) # print('Running process for {} reviews'.format(num_reviews)) multiprocess_dataset(formatted_df = df) if __name__ == '__main__': main() Actual Result / Stack Trace at running the program:
(phrase) viole@viole-X510UNR:~/Documents$ python3.6 program.py /home/viole/Documents/phrase/lib/python3.6/site-packages/allennlp/service/predictors/__init__.py:23: FutureWarning: allennlp.service.predictors.* has been depreciated. Please use allennlp.predictors.* "Please use allennlp.predictors.*", FutureWarning) /home/viole/Documents/phrase/lib/python3.6/site-packages/torch/nn/modules/container.py:434: UserWarning: Setting attributes on ParameterList is not supported. warnings.warn("Setting attributes on ParameterList is not supported.") Enter input filename : accomm_dataset.csv Enter column on which you want to process on : comments Multicore process 4 batches of 6386 reviews each File output_multiprocess_input_file_b1_6386_reviews.csv started processing! <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10205.12it/s] 0%| | 0/1 [00:00<?, ?it/s]File output_multiprocess_input_file_b2_6386_reviews.csv started processing! <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10255.02it/s] 0%| | 0/1 [00:00<?, ?it/s]File output_multiprocess_input_file_b3_6386_reviews.csv started processing! <---Expanding Contraction---> 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.89it/s] <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25115.59it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3446.43it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13706.88it/s] File output_multiprocess_input_file_b4_6386_reviews.csv started processing! 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.78it/s] <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26546.23it/s] 0%| | 0/1 [00:00<?, ?it/s]<---Expanding Contraction---> 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3440.77it/s] 0%| | 0/1 [00:00<?, ?it/s]Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for `non_padded_namespaces` parameter in Vocabulary. 0%| | 0/1 [00:00<?, ?it/s]Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for `non_padded_namespaces` parameter in Vocabulary. 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 171.89it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.97it/s] <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27962.03it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2983.15it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.82it/s] <---Expanding Contraction---> 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19508.39it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2906.66it/s] 0%| | 0/1 [00:00<?, ?it/s]Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for `non_padded_namespaces` parameter in Vocabulary. Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for `non_padded_namespaces` parameter in Vocabulary. I checked my System Monitor to see resources being used -- all cores are lying idle. I tried running Extractor without multiprocessing and it seemed to work alright. Is there anything I am missing here?
Any help would be appreciated!
https://stackoverflow.com/questions/67238443/multiprocessing-not-working-for-same-function April 24, 2021 at 09:04AM
没有评论:
发表评论