python - Spark add new column to dataframe with value from previous row

Question

Welcome To Ask or Share your Answers For Others

python - Spark add new column to dataframe with value from previous row

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Spark add new column to dataframe with value from previous row

I'm wondering how I can achieve the following in Spark (Pyspark)

Initial Dataframe:

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe:

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). I also couldn't find anything in the API documentation on how to access a certain row in a DF by index.

Any help would be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:10:24+0000

You can use lag window function as follows

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## |  2|3.0|    5.0|
## |  3|7.0|    3.0|
## |  4|9.0|    7.0|
## +---+---+-------+

but there some important issues:

if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient.
you need a natural way to order your data.

While the second issue is almost never a problem the first one can be a deal-breaker. If this is the case you should simply convert your DataFrame to RDD and compute lag manually. See for example:

How to transform data with sliding window over time series data in Pyspark
Apache Spark Moving Average (written in Scala, but can be adjusted for PySpark. Be sure to read the comments first).

Categories

python - Spark add new column to dataframe with value from previous row

python - Spark add new column to dataframe with value from previous row

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags