2021年3月18日星期四

Repartition by multiple columns in pyspark

I am trying to repartition and save my dataframe which contains around 20 million records into multiple CSV files.

df.repartition('col1','col2','col3').write.csv(path)

I would like to save it into as many CSV files are there are unique combinations of ('col1', 'col2', 'col3'), which sometimes can be around 4000.

Approaches I tried:

  1. I tried setting the shuffle partition value to 4000 explicitly

spark.conf.set("spark.sql.shuffle.partitions", 4000)

  1. Tried to perform group by and set the partition no as the number of groups.

partitioned = final_df.groupBy('col1','col2','col3').count()

partition_no = partitioned.count()

spark.conf.set("spark.sql.shuffle.partitions", 4000)

Both the approaches yielded the same result. The number of files was less than the number of partitions. How can I ensure that the no. of CSV files saved is the same as the no. of partitions?

Any help is appreciated.

https://stackoverflow.com/questions/66645552/repartition-by-multiple-columns-in-pyspark March 16, 2021 at 04:57AM

没有评论:

发表评论