Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
703 views
in Technique[技术] by (71.8m points)

database - Method of finding gaps in time series data in MySQL?

Lets say we have a database table with two columns, entry_time and value. entry_time is timestamp while value can be any other datatype. The records are relatively consistent, entered in roughly x minute intervals. For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.

In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To start with, let us summarize the number of entries by hour in your table.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.

This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
   AND entry_time < CURRENT_DATE
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.

You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.

Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
    ) C,
    (SELECT @sample:=0) s

This little virtual table gives entry_num, entry_time, value.

Next step, we join it to itself.

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
     /* virtual table */
  ) ONE
  JOIN (
     /* same virtual table */
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.

Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.

The over all self join query is this. I told you it was a hairball.

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
    SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample:=0) s
  ) ONE
  JOIN (
    SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample2:=0) s
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
         WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
           AND entry_time < CURRENT_DATE /*yesterday but not today*/
    ) C,
    (SELECT @sample:=0) s

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...