Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
152 views
in Technique[技术] by (71.8m points)

python - How to calculate date difference in pyspark?

I have data like this:

df = sqlContext.createDataFrame([
    ('1986/10/15', 'z', 'null'), 
    ('1986/10/15', 'z', 'null'),
    ('1986/10/15', 'c', 'null'),
    ('1986/10/15', 'null', 'null'),
    ('1986/10/16', 'null', '4.0')],
    ('low', 'high', 'normal'))

I want to calculate the date difference between low column and 2017-05-02 and replace low column with the difference. I've tried related solutions on stackoverflow but neither of them works.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need to cast the column low to class date and then you can use datediff() in combination with lit(). Using Spark 2.2:

from pyspark.sql.functions import datediff, to_date, lit

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date("low","yyyy/MM/dd"))).show()
+----------+----+------+-----+
|       low|high|normal| test|
+----------+----+------+-----+
|1986/10/15|   z|  null|11157|
|1986/10/15|   z|  null|11157|
|1986/10/15|   c|  null|11157|
|1986/10/15|null|  null|11157|
|1986/10/16|null|   4.0|11156|
+----------+----+------+-----+

Using < Spark 2.2, we need to convert the the low column to class timestamp first:

from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...