I am trying to repartition and save my dataframe which contains around 20 million records into multiple CSV files.
df.repartition('col1','col2','col3').write.csv(path)
I would like to save it into as many CSV files are there are unique combinations of ('col1', 'col2', 'col3'), which sometimes can be around 4000.
Approaches I tried:
- I tried setting the shuffle partition value to 4000 explicitly
spark.conf.set("spark.sql.shuffle.partitions", 4000)
- Tried to perform group by and set the partition no as the number of groups.
partitioned = final_df.groupBy('col1','col2','col3').count()
partition_no = partitioned.count()
spark.conf.set("spark.sql.shuffle.partitions", 4000)
Both the approaches yielded the same result. The number of files was less than the number of partitions. How can I ensure that the no. of CSV files saved is the same as the no. of partitions?
Any help is appreciated.
https://stackoverflow.com/questions/66645552/repartition-by-multiple-columns-in-pyspark March 16, 2021 at 04:57AM
没有评论:
发表评论