apache spark - Split Contents of String column in PySpark Dataframe

Question

Welcome To Ask or Share your Answers For Others

apache spark - Split Contents of String column in PySpark Dataframe

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Split Contents of String column in PySpark Dataframe

I have a pyspark data frame whih has a column containing strings. I want to split this column into words

Code:

>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc                       |
+---+---------------------------+
|1  |Virat is good batsman      |
|2  |sachin was good            |
|3  |but modi sucks big big time|
|4  |I love the formulas        |
+---+---------------------------+


Expected Output
---------------

>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc                                 |
+---+-------------------------------------+
|1  |[Virat,is,good,batsman]              |
|2  |[sachin,was,good]                    |
|3  |....                                 |
|4  |...                                  |
+---+-------------------------------------+

How can I achieve this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:06:46+0000

Use split function:

from pyspark.sql.functions import split

df.withColumn("desc", split("desc", "s+"))

Categories

apache spark - Split Contents of String column in PySpark Dataframe

apache spark - Split Contents of String column in PySpark Dataframe

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags