How to maintain a mutable state while iterating through a dataframe in spark scala

I am new to Spark and I'm not sure what the best way is to proceed.

I need to maintain a mutable, distributed state that all executors need to read from and write to while looping through a Dataframe and I would like to use Redis as the mutable distributed state.

My dataframe is composed of composite keys idA, idB and some other information, I want to partition by idA and do either a foreachPartition or mapParitions while maintaining the count of idB in Redis.

In Redis I will be building a map of idB with a count, and in my iteration I will be checking to see if I can still process idB based off of business rules for idA and if the maxCount has been reached in Redis.

What is the best way to accomplish this?

Can I call spark.format("org.apache.spark.sql.redis").options().load & df.write.format("org.apache.spark.sql.redis").save() while I am iterating through the dataframe?

Has anyone done something similar? or have any suggestions as to how to accomplish this?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

How to maintain a mutable state while iterating through a dataframe in spark scala

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags