I have a spark dataframe that has two columns ("time_stamp" and "ID").
Example dataframe:
**ID** **time_stamp**
1AB 2015-01-23 08:23:16
1AB 2015-01-23 08:54:40
25CD 2015-01-23 09:02:20
1AB 2015-01-23 10:15:36
1AB 2015-01-23 12:38:40
1AB 2015-01-24 08:25:16
1AB 2015-01-24 08:53:40
25CD 2015-01-24 09:01:20
1AB 2015-01-24 10:14:36
1AB 2015-01-24 12:30:40
I want to remove duplicate IDs (keeping the 1st occurrence) if the timestamp difference is less than 3Hrs from the first occurence (keep ID which appears first) and want to keep IDs if the difference is greater than 3 Hrs.
Expected output:
**ID** **time_stamp**
1AB 2015-01-23 08:23:16
25CD 2015-01-23 09:02:20
1AB 2015-01-23 12:38:40
1AB 2015-01-24 08:25:16
25CD 2015-01-24 09:01:20
1AB 2015-01-24 12:30:40
Thanks
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…