I am new to Spark and I'm not sure what the best way is to proceed.
I need to maintain a mutable, distributed state that all executors need to read from and write to while looping through a Dataframe and I would like to use Redis as the mutable distributed state.
My dataframe is composed of composite keys idA
, idB
and some other information, I want to partition by idA
and do either a foreachPartition
or mapParitions
while maintaining the count of idB
in Redis.
In Redis I will be building a map of idB
with a count
, and in my iteration I will be checking to see if I can still process idB
based off of business rules for idA
and if the maxCount
has been reached in Redis.
What is the best way to accomplish this?
Can I call spark.format("org.apache.spark.sql.redis").options().load
& df.write.format("org.apache.spark.sql.redis").save()
while I am iterating through the dataframe?
Has anyone done something similar? or have any suggestions as to how to accomplish this?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…