Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
740 views
in Technique[技术] by (71.8m points)

lazy evaluation - How to force Spark to evaluate DataFrame operations inline

According to the Spark RDD docs:

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline.

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

No.

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love .


By the way, I am pretty sure that knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.


Can you just confirm that count() and show() are considered "actions"

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

Yes! :)

remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!


Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...