Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
347 views
in Technique[技术] by (71.8m points)

python 2.7 - SparkSQL on pyspark: how to generate time series?

I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date.

Suppose that my_table contains:

 start      | stop       
-------------------------
 2000-01-01 | 2000-01-05 
 2012-03-20 | 2012-03-23 

In PostgreSQL it's very easy to do that:

SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table

and it will generate this table:

 dt
------------
 2000-01-01
 2000-01-02
 2000-01-03
 2000-01-04
 2000-01-05
 2012-03-20
 2012-03-21
 2012-03-22
 2012-03-23

but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

EDIT
This creates a dataframe with one row containing an array of consecutive dates:

from pyspark.sql.functions import sequence, to_date, explode, col

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")

+------------------------------------------+
|                  date                    |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+

You can use the explode function to "pivot" this array into rows:

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))

+------------+
|    date    |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+

(End of edit)

Spark v2.4 support sequence function:

sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions.

Supported types are: byte, short, integer, long, date, timestamp.

Examples:

SELECT sequence(1, 5);

[1,2,3,4,5]

SELECT sequence(5, 1);

[5,4,3,2,1]

SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month);

[2018-01-01,2018-02-01,2018-03-01]

https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...