2021年5月3日星期一

pySpark dataframe event deduplicate

I have this pySpark dataframe

event_id occurred_at logged_at
1001 2021-05-03 11:00:00 2021-05-03 11:00:01
1001 2021-05-03 11:00:00 2021-05-03 11:00:02
1002 2021-05-03 11:00:02 2021-05-03 11:00:03
1002 2021-05-03 11:00:03 2021-05-03 11:00:03
1003 2021-05-03 11:00:04 2021-05-03 11:00:05

I would like to keep only one event for one event_id - with the minimal occurred_at and logged_at. For the sample above, I would like to get

event_id occurred_at logged_at
1001 2021-05-03 11:00:00 2021-05-03 11:00:01
1002 2021-05-03 11:00:02 2021-05-03 11:00:03
1003 2021-05-03 11:00:04 2021-05-03 11:00:05

How can I write a pySpark snippet to do this?

https://stackoverflow.com/questions/67377794/pyspark-dataframe-event-deduplicate May 04, 2021 at 09:48AM

没有评论:

发表评论