I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.
After a quick search in SO and Google I found 3 alternatives:
- A common arithmetic procedure (/365.25) (link)
- Using functions
new_interval()
and duration()
from package lubridate
(link)
- Function
age_calc()
from package eeptools
(link, link, link)
So, here's my toy code:
# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01",
"1962-12-30", "1962-12-31", "1963-01-01",
"2000-06-16", "2000-06-17", "2000-06-18",
"2007-03-18", "2007-03-19", "2007-03-20",
"1968-02-29", "1968-02-29", "1968-02-29"))
# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31",
"2015-12-31", "2015-12-31", "2015-12-31",
"2050-06-17", "2050-06-17", "2050-06-17",
"2008-03-19", "2008-03-19", "2008-03-19",
"2015-02-28", "2015-03-01", "2015-03-02"))
# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25
# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years")
# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")
Let's talk later about accuracy and focus first on performance. Here's the code:
# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
times = 1000
)
# And examine the results
mbm
autoplot(mbm)
Here the results:
Bottom line: performance of lubridate
and eeptools
functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.
"because of the way the modern Gregorian calendar
is constructed, there is no straightforward arithmetic
method that produces a person’s age, stated according to
common usage — common usage meaning that a person’s
age should always be an integer that increases exactly on
a birthday". (link)
As I read on some posts, lubridate
and eeptools
make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.
Any ideas on an efficient and accurate method to calculate the age?
EDIT
Ops, it seems lubridate
also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df
birthdate givendate arithmetic lubridate eeptools
1 1978-12-30 2015-12-31 37 37 37
2 1978-12-31 2015-12-31 36 37 37
3 1979-01-01 2015-12-31 36 37 36
4 1962-12-30 2015-12-31 53 53 53
5 1962-12-31 2015-12-31 52 53 53
6 1963-01-01 2015-12-31 52 53 52
7 2000-06-16 2050-06-17 50 50 50
8 2000-06-17 2050-06-17 49 50 50
9 2000-06-18 2050-06-17 49 50 49
10 2007-03-18 2008-03-19 1 1 1
11 2007-03-19 2008-03-19 1 1 1
12 2007-03-20 2008-03-19 0 1 0
13 1968-02-29 2015-02-28 46 47 46
14 1968-02-29 2015-03-01 47 47 47
15 1968-02-29 2015-03-02 47 47 47
See Question&Answers more detail:
os