pyspark - Extract IDs based on difference in timestamp- spark

Question

Welcome To Ask or Share your Answers For Others

pyspark - Extract IDs based on difference in timestamp- spark

posted Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Extract IDs based on difference in timestamp- spark

I have a spark dataframe that has two columns ("time_stamp" and "ID").

Example dataframe:

      **ID**                **time_stamp**
       1AB               2015-01-23 08:23:16
       1AB               2015-01-23 08:54:40
      25CD               2015-01-23 09:02:20
       1AB               2015-01-23 10:15:36
       1AB               2015-01-23 12:38:40
       1AB               2015-01-24 08:25:16
       1AB               2015-01-24 08:53:40
      25CD               2015-01-24 09:01:20
       1AB               2015-01-24 10:14:36
       1AB               2015-01-24 12:30:40

I want to remove duplicate IDs (keeping the 1st occurrence) if the timestamp difference is less than 3Hrs from the first occurence (keep ID which appears first) and want to keep IDs if the difference is greater than 3 Hrs.

Expected output:

      **ID**                **time_stamp**
       1AB               2015-01-23 08:23:16
      25CD               2015-01-23 09:02:20
       1AB               2015-01-23 12:38:40
       1AB               2015-01-24 08:25:16
      25CD               2015-01-24 09:01:20
       1AB               2015-01-24 12:30:40

Thanks

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-16T21:16:24+0000

You can use first to get the required difference in timestamp, and do a filter on the difference:

from pyspark.sql import functions as F, Window

result = df.withColumn(
    'lag',
    F.col('time_stamp').cast('long') -
    F.first('time_stamp')
     .over(Window.partitionBy('ID', F.date_trunc('day', 'time_stamp'))
                 .orderBy('time_stamp'))
     .cast('long')
).filter('lag > 60*60*3 or lag = 0 or lag is null').drop('lag')

result.show()
+----+-------------------+
|  ID|         time_stamp|
+----+-------------------+
| 1AB|2015-01-23 08:23:16|
| 1AB|2015-01-23 12:38:40|
|25CD|2015-01-23 09:02:20|
|25CD|2015-01-24 09:01:20|
| 1AB|2015-01-24 08:25:16|
| 1AB|2015-01-24 12:30:40|
+----+-------------------+

Categories

pyspark - Extract IDs based on difference in timestamp- spark

pyspark - Extract IDs based on difference in timestamp- spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags