According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.
I have written more details about setting group.id
and Spark's checkpointing of offsets in another post
Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:
group.id:?Kafka source will create a unique group id for each query automatically. According to the?code?the group.id
will automatically be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}
auto.offset.reset:?Set the source option startingOffsets to specify where to start instead.?Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it
enable.auto.commit:?Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…