I have a dataframe like this:
id,p1 1,A 2,NaN 3,B 4,NaN 4,NaN 2,C
Using PySpark, I want to remove all the duplicates. However, if there is a duplicate in which the p1 column is not NaN I want to remove the NaN one. For example, I want to remove the first occurrence of id 2 and either of id 4. Right now I am splitting the dataframe into two dataframes as such:
id,p1 1,A 3,B 2,C
id,p1 2,NaN 4,NaN 4,NaN
Removing the duplicates from both, then adding the ones which are not in the first dataframe back. Like that I get this dataframe.
id,p1 1,A 3,B 4,NaN 2,C
This is what I have so far:
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('test').getOrCreate() d = spark.createDataFrame( [(1,"A"), (2,None), (3,"B"), (4,None), (4,None), (2,"C")], ["id", "p"] ) d1 = d.filter(d.p.isNull()) d2 = d.filter(d.p.isNotNull()) d1 = d1.dropDuplicates() d2 = d2.dropDuplicates() d3 = d1.join(d2, "id", 'left_anti') d4 = d2.unionByName(d3)
Is there a more beautiful way of doing this? It really feels redundant like this but I can't come up with a better way. I tried using groupby but couldn't achieve it. Any ideas? Thanks.
https://stackoverflow.com/questions/67327196/pyspark-reassign-values-of-duplicate-rows April 30, 2021 at 09:48AM
没有评论:
发表评论