apache spark - What's the meaning of DStream.foreachRDD function?

Question

Welcome To Ask or Share your Answers For Others

apache spark - What's the meaning of DStream.foreachRDD function?

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:28:22+0000

A DStream or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.

An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.

DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD you could write data to a database.

The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:

val userList: List[User] = ???
userList.foreach{user => doSomeSideEffect(user)}

This will apply the side-effecting function doSomeSideEffect to each element of the userList collection.

Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:

val userDStream: DStream[User] = ???
userDstream.foreachRDD{usersRDD => 
    usersRDD.foreach{user => serveCoffee(user)}
}

Note that:

the DStream.foreachRDD gives you an RDD[User], not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a rdd.foreach to serve coffee to each user.

To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.

Categories

apache spark - What's the meaning of DStream.foreachRDD function?

apache spark - What's the meaning of DStream.foreachRDD function?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags