I have this pySpark dataframe
event_id | occurred_at | logged_at |
---|---|---|
1001 | 2021-05-03 11:00:00 | 2021-05-03 11:00:01 |
1001 | 2021-05-03 11:00:00 | 2021-05-03 11:00:02 |
1002 | 2021-05-03 11:00:02 | 2021-05-03 11:00:03 |
1002 | 2021-05-03 11:00:03 | 2021-05-03 11:00:03 |
1003 | 2021-05-03 11:00:04 | 2021-05-03 11:00:05 |
I would like to keep only one event for one event_id
- with the minimal occurred_at
and logged_at
. For the sample above, I would like to get
event_id | occurred_at | logged_at |
---|---|---|
1001 | 2021-05-03 11:00:00 | 2021-05-03 11:00:01 |
1002 | 2021-05-03 11:00:02 | 2021-05-03 11:00:03 |
1003 | 2021-05-03 11:00:04 | 2021-05-03 11:00:05 |
How can I write a pySpark snippet to do this?
https://stackoverflow.com/questions/67377794/pyspark-dataframe-event-deduplicate May 04, 2021 at 09:48AM
没有评论:
发表评论