I have a vector of character representation of dates, where formats mostly are dmY
(e.g. 27-09-2013), dmy
(e.g. 27-09-13), and occasionally some b
or B
months. Thus, parse_date_time
in package lubridate
that "allows the user to specify several format-orders to handle heterogeneous date-time character representations" could be a very useful function for me.
However, it seems that parse_date_time
has problem parsing dmy
dates when they occur together with dmY
dates. When parsing dmy
alone, or dmy
together with some other formats relevant to me, it works fine. This pattern was also noted in a comment to @Peyton's answer here. A quick fix was suggested, but I wish to ask if it is possible to handle it in lubridate
.
Here I show some examples where I try to parse dates on dmy
format together with some other formats, and specifying orders
accordingly.
library(lubridate)
# version: lubridate_1.3.0
# regarding how date format is specified in 'orders':
# examples in ?parse_date_time
# parse_date_time(x, "ymd")
# parse_date_time(x, "%y%m%d")
# parse_date_time(x, "%y %m %d")
# these order strings are equivalent and parses the same way
# "Formatting orders might include arbitrary separators. These are discarded"
# dmy date only
parse_date_time(x = "27-09-13", orders = "d m y")
# [1] "2013-09-27 UTC"
# OK
# dmy & dBY
parse_date_time(c("27-09-13", "27 September 2013"), orders = c("d m y", "d B Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK
# dmy & dbY
parse_date_time(c("27-09-13", "27 Sep 2013"), orders = c("d m y", "d b Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK
# dmy & dmY
parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
# [1] "0013-09-27 UTC" "2013-09-27 UTC"
# not OK
# does order of the date components matter?
parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
# [1] "2013-09-27 UTC" "0013-09-27 UTC"
# no
What about the select_formats
argument? I am sorry to say this, but I have a hard time understand this section of the help file. And a search for select_formats
on SO: 0 results. Still, this section seemed relevant: "By default the formats with most formating tockens (%) are selected and %Y counts as 2.5 tockens (so that it can have priority over %y%m).". So I (desperately) tried with some additional dmy
dates:
parse_date_time(c("27-09-2013", rep("27-09-13", 10)), orders = c("d m y", "d m Y"))
# not OK. Tried also 100 dmy dates.
# does order in the vector matter?
parse_date_time(c(rep("27-09-13", 10), "27-09-2013"), orders = c("d m y", "d m Y"))
# no
I then checked how the guess_formats
function (also in lubridate
) handled dmy
together with dmY
:
guess_formats(c("27-09-13", "27-09-2013"), c("dmy", "dmY"), print_matches = TRUE)
# dmy dmY
# [1,] "27-09-13" "%d-%m-%y" ""
# [2,] "27-09-2013" "%d-%m-%Y" "%d-%m-%Y"
# OK
From ?guess_formats
: y also matches Y
. From ?parse_date_time
: y* Year without century (00–99 or 0–99). Also matches year with century (Y format)
. So I tried:
guess_formats(c("27-09-13", "27-09-2013"), c("dmy"), print_matches = TRUE)
# dmy
# [1,] "27-09-13" "%d-%m-%y"
# [2,] "27-09-2013" "%d-%m-%Y"
# OK
Thus, guess_format
seems to be able to deal with dmy
together with dmY
. But how can I tell parse_date_time
to do the same? Thanks in advance for any comments or help.
Update
I posted the question on the lubridate
bug report, and got a rapid reply from @vitoshka: "This is a bug".
See Question&Answers more detail:
os