2021年4月29日星期四

PySpark reassign values of duplicate rows

I have a dataframe like this:

id,p1  1,A  2,NaN  3,B  4,NaN  4,NaN  2,C  

Using PySpark, I want to remove all the duplicates. However, if there is a duplicate in which the p1 column is not NaN I want to remove the NaN one. For example, I want to remove the first occurrence of id 2 and either of id 4. Right now I am splitting the dataframe into two dataframes as such:

id,p1  1,A  3,B  2,C  
id,p1  2,NaN  4,NaN  4,NaN  

Removing the duplicates from both, then adding the ones which are not in the first dataframe back. Like that I get this dataframe.

id,p1  1,A  3,B  4,NaN  2,C  

This is what I have so far:

from pyspark.sql import SparkSession    spark = SparkSession.builder.appName('test').getOrCreate()  d = spark.createDataFrame(      [(1,"A"),       (2,None),       (3,"B"),       (4,None),       (4,None),       (2,"C")],      ["id", "p"]      )  d1 = d.filter(d.p.isNull())  d2 = d.filter(d.p.isNotNull())  d1 = d1.dropDuplicates()  d2 = d2.dropDuplicates()  d3 = d1.join(d2, "id", 'left_anti')  d4 = d2.unionByName(d3)  

Is there a more beautiful way of doing this? It really feels redundant like this but I can't come up with a better way. I tried using groupby but couldn't achieve it. Any ideas? Thanks.

https://stackoverflow.com/questions/67327196/pyspark-reassign-values-of-duplicate-rows April 30, 2021 at 09:48AM

没有评论:

发表评论