Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
717 views
in Technique[技术] by (71.8m points)

r - Parse dates in format dmy together with dmY using parse_date_time

I have a vector of character representation of dates, where formats mostly are dmY (e.g. 27-09-2013), dmy (e.g. 27-09-13), and occasionally some b or B months. Thus, parse_date_time in package lubridate that "allows the user to specify several format-orders to handle heterogeneous date-time character representations" could be a very useful function for me.

However, it seems that parse_date_time has problem parsing dmy dates when they occur together with dmY dates. When parsing dmy alone, or dmy together with some other formats relevant to me, it works fine. This pattern was also noted in a comment to @Peyton's answer here. A quick fix was suggested, but I wish to ask if it is possible to handle it in lubridate.

Here I show some examples where I try to parse dates on dmy format together with some other formats, and specifying orders accordingly.

library(lubridate)
# version: lubridate_1.3.0

# regarding how date format is specified in 'orders':
# examples in ?parse_date_time
# parse_date_time(x, "ymd")
# parse_date_time(x, "%y%m%d")
# parse_date_time(x, "%y %m %d")
# these order strings are equivalent and parses the same way
# "Formatting orders might include arbitrary separators. These are discarded"

# dmy date only
parse_date_time(x = "27-09-13", orders = "d m y")
# [1] "2013-09-27 UTC"
# OK

# dmy & dBY
parse_date_time(c("27-09-13", "27 September 2013"), orders = c("d m y", "d B Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dbY
parse_date_time(c("27-09-13", "27 Sep 2013"), orders = c("d m y", "d b Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dmY
parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
# [1] "0013-09-27 UTC" "2013-09-27 UTC"
# not OK

# does order of the date components matter?
parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
# [1] "2013-09-27 UTC" "0013-09-27 UTC"
# no

What about the select_formats argument? I am sorry to say this, but I have a hard time understand this section of the help file. And a search for select_formats on SO: 0 results. Still, this section seemed relevant: "By default the formats with most formating tockens (%) are selected and %Y counts as 2.5 tockens (so that it can have priority over %y%m).". So I (desperately) tried with some additional dmy dates:

parse_date_time(c("27-09-2013", rep("27-09-13", 10)), orders = c("d m y", "d m Y"))
# not OK. Tried also 100 dmy dates.

# does order in the vector matter?
parse_date_time(c(rep("27-09-13", 10), "27-09-2013"), orders = c("d m y", "d m Y"))
# no

I then checked how the guess_formats function (also in lubridate) handled dmy together with dmY:

guess_formats(c("27-09-13", "27-09-2013"), c("dmy", "dmY"), print_matches = TRUE)
#                   dmy        dmY       
# [1,] "27-09-13"   "%d-%m-%y" ""        
# [2,] "27-09-2013" "%d-%m-%Y" "%d-%m-%Y"
# OK   

From ?guess_formats: y also matches Y. From ?parse_date_time: y* Year without century (00–99 or 0–99). Also matches year with century (Y format). So I tried:

guess_formats(c("27-09-13", "27-09-2013"), c("dmy"), print_matches = TRUE)
#                   dmy       
# [1,] "27-09-13"   "%d-%m-%y"
# [2,] "27-09-2013" "%d-%m-%Y"
# OK

Thus, guess_format seems to be able to deal with dmy together with dmY. But how can I tell parse_date_time to do the same? Thanks in advance for any comments or help.

Update I posted the question on the lubridate bug report, and got a rapid reply from @vitoshka: "This is a bug".

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It looks like a bug. I am not sure So you should contact the maintainer.

Building the package source and changing one line in this internal function ( I replace which.max by wich.min):

.select_formats <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%Y", names(trained))*1.5
  names(trained[ which.min(n_fmts) ]) ## replace which.max  by which.min
}

seems to correct the problem. Frankly I don't know why this works, but I guess it is a kind of ranking..

parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
[1] "2013-09-27 UTC" "2013-09-27 UTC"

parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
[1] "2013-09-27 UTC" "2013-09-13 UTC"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...