Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
263 views
in Technique[技术] by (71.8m points)

How spark calculates the window start time with given window interval?

Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as :

10 minutes

with input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02)

8 minutes

with same input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:26:02) to (2019-02-28 22:34:02)

5 minutes

with same input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:35:02)

14 minutes

with input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:32:02) to (2019-02-28 22:46:02)


So, my question here is :

How does spark calculates the start time of a window with a given input of ts ?

question from:https://stackoverflow.com/questions/65903327/how-spark-calculates-the-window-start-time-with-given-window-interval

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is explained in the section "Understanding How Intervals are computed" in the "Stream Processing with Apache Spark" book published by O'Reilly:

"The window intervals are aligned to the start of the second/minute/hour/day that corresponds to the next" upper time magnitude of the time unit used."

In your case you are always using minutes so the next upper time magnitude is "hour". Therefore it tries to reach the start of the hour. Your cases in more details (forget about the 2 seconds, this is just a delay in the internals):

  • 10 minutes: 22:40 + 10 + 10 -> start of the hour
  • 8 minutes: 22:34 + 8 + 8 + 8 -> start of the hour
  • 5 minutes: 22:35 + 5 + 5 + ... + 5 -> start of the hour
  • 14 minutes: 22:46 + 14 -> start of the hour

It is independent of the incoming data and its timestamp/event_time.

As an additional node, the lower window boundary is inclusive whereas the upper one is exclusive. In mathematical notations this would look like [start_time, end_time).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...