Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
586 views
in Technique[技术] by (71.8m points)

python - How to select last row and also how to access PySpark dataframe by index?

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).

And how can I access the dataframe rows by index.like row no. 12 or 200 .

In pandas I can do

df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]

I am just curious how to access pyspark dataframe in such ways or alternative ways.

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How to get the last row.

Long and ugly way which assumes that all columns are oderable:

from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id
)

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))
    .alias("tmp")).select(col("tmp.*"))
    .drop("_id"))

If not all columns can be order you can try:

with_id = df.withColumn("_id", monotonically_increasing_id())
i = with_id.select(max_("_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")

Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

how can I access the dataframe rows by index.like

You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...