Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
384 views
in Technique[技术] by (71.8m points)

r - aggregate 1-minute data into 5-minute average data

My question here is to aggregate the data collected at every 1-minute into 5-minute average.

DeviceTime         Concentration
6/20/2013 11:13       
6/20/2013 11:14
6/20/2013 11:15
6/20/2013 11:16
6/20/2013 11:17
6/20/2013 11:18
6/20/2013 11:19
6/20/2013 11:20
6/20/2013 11:21
6/20/2013 11:22
6/20/2013 11:23
6/20/2013 11:24
6/20/2013 11:25
6/20/2013 11:26
6/20/2013 11:27
6/20/2013 11:28

...

The result I want is like:

DeviceTime             Concentration
6/20/2013 11:15
6/20/2013 11:20
6/20/2013 11:25
6/20/2013 11:30
6/20/2013 11:35
...

The 5-minute average is just the simple average over the concentration in the past five minutes.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If your data doesn't start on a nice 5-minute wall clock boundary (as shown in your sample data – 11:13), note that cut() will create breakpoints based on the first timestamp it finds. This probably isn't what we normally want. Indeed, your sample output indicates this is not what you want.

Here's what cut() does:

df <- read.table(header=TRUE, sep=",", stringsAsFactors=FALSE, text="
DeviceTime,Concentration
6/20/2013 11:13,1
6/20/2013 11:14,1
6/20/2013 11:15,2
6/20/2013 11:16,2
6/20/2013 11:17,2
6/20/2013 11:18,2
6/20/2013 11:19,2
6/20/2013 11:20,3
6/20/2013 11:21,3
6/20/2013 11:22,3
6/20/2013 11:23,3
6/20/2013 11:24,3
6/20/2013 11:25,4")
df$DeviceTime <- as.POSIXct(df$DeviceTime, format="%m/%d/%Y %H:%M")

cut(df$DeviceTime, breaks="5 min")
 [1] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:13:00
 [4] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:18:00
 [7] 2013-06-20 11:18:00 2013-06-20 11:18:00 2013-06-20 11:18:00
[10] 2013-06-20 11:18:00 2013-06-20 11:23:00 2013-06-20 11:23:00
[13] 2013-06-20 11:23:00

means <- aggregate(df["Concentration"], 
                   list(fiveMin=cut(df$DeviceTime, "5 mins")),
                   mean)
means
              fiveMin Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

Notice that the first row of means (the 11:13:00 entry) is the mean of the first 5 rows of df, which have times of 11:13 to 11:17 -- i.e., up until just before the next cut/break point of 11:18.

You'll get the same result with dplyr (i.e., @lukeA's answer) if you use cut():

df %>%
  group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
  summarize(Concentration = mean(Concentration))
Source: local data frame [3 x 2]

           DeviceTime Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

The xts package seems to break by wall clock time:

require(xts)
df.xts <- xts(df$Concentration, df$DeviceTime)
means.xts <- period.apply(df.xts, endpoints(df.xts, "mins", k=5), mean)
means.xts
                    [,1]
2013-06-20 11:14:00    1
2013-06-20 11:19:00    2
2013-06-20 11:24:00    3
2013-06-20 11:25:00    4

The time values are always the last time entry found in the 5-min window. You can round the time index column up the the next 5-min boundary with align.time(), if you want to report the times of the end of the periods:

means.rounded <- align.time(means.xts, 5*60)
means.rounded
                    [,1]
2013-06-20 11:15:00    1
2013-06-20 11:20:00    2
2013-06-20 11:25:00    3
2013-06-20 11:30:00    4

You can also round down, if you want to report the times of the beginning of the periods. But you'll need to define your own function first (which I found on Cross Validated):

align.time.down = function(x,n) {
    index(x) = index(x) - n
    align.time(x,n)
}
means.rounded.down <- align.time.down(means.xts, 5*60)
means.rounded.down
                    [,1]
2013-06-20 11:10:00    1
2013-06-20 11:15:00    2
2013-06-20 11:20:00    3
2013-06-20 11:25:00    4

Another solution, that doesn't use the xts package, but rather floor(), is as follows:

df$DeviceTimeFloor <- as.POSIXct(floor(as.numeric(df$DeviceTime) / (5 * 60)) * (5 * 60), origin='1970-01-01')
meansFloor <- aggregate(Concentration ~ DeviceTimeFloor, df, mean)
meansFloor
      DeviceTimeFloor Concentration
1 2013-06-20 11:10:00             1
2 2013-06-20 11:15:00             2
3 2013-06-20 11:20:00             3
4 2013-06-20 11:25:00             4

I prefer to report the start time of the 5-minute interval – floor() is good for this. Because, if I were to report aggregates by hour, I would expect a timestamp of 2013-06-20 11:00:00 to contain data for the period 11:00:00 - 11:59:59 not 10:00:00 - 10:59:59.

If you prefer to report the end time of the intervals, ceiling() can be used instead of floor(). But note that timestamps 11:01 - 11:05 will be converted to (and hence grouped at) 11:05 by ceiling(). In contrast, floor() converts 11:00 - 11:04 to 11:00.

So they each group a different set of observations. The xts package will group the same set of observations as floor(), but it will report the last timestamp of the last observation in the period.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...