apache spark - PySpark dataframe convert unusual string format to Timestamp

Question

Welcome To Ask or Share your Answers For Others

apache spark - PySpark dataframe convert unusual string format to Timestamp

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - PySpark dataframe convert unusual string format to Timestamp

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(datetime='2016_08_21 11_31_08')]

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.datetime.astype('Timestamp'))

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.

I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:30:40+0000

Spark >= 2.2

from pyspark.sql.functions import to_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
    .show(1, False))

## +-------------------+-------------------+
## |dt                 |parsed             |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
    # For Spark <= 1.5
    # See issues.apache.org/jira/browse/SPARK-11724 
    .cast("double")
    .cast("timestamp"))
    .show(1, False))

## +-------------------+---------------------+
## |dt                 |parsed               |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+

In both cases the format string should be compatible with Java SimpleDateFormat.

Categories

apache spark - PySpark dataframe convert unusual string format to Timestamp

apache spark - PySpark dataframe convert unusual string format to Timestamp

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags