Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
128 views
in Technique[技术] by (71.8m points)

POSIX character class does not work in base R regex

I'm having some problems matching a pattern with a string of text in R.

I'm trying to get TRUE with grepl when the text is something like "lettersornumbersorspaces y lettersornumbersorspaces".

I'm using the following regex:

([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+

When using the regex as follows to obtain the "address" it works at expected.

regex <- "([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+"
address <- str_extract(fulltext, regex)

I see that address is the text that I need. Now, if I want to use grepl to get a TRUE as follows:

grepl("([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+", address,ignore.case = TRUE)

FALSE is returned. How is this possible? I'm using the same regex to get TRUE. I have tried modifications to the grepl parameters, but non of them is related to this.

An example of text is: "26 de Marzo y Pareyra de la Luz"

Thanks!!

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Although stringr ICU regex engines supports bare POSIX character classes in the pattern, in base R regex flavors (both PCRE (perl=TRUE) and TRE), POSIX character classes must be inside bracket expressions. [:alnum:] -> [[:alnum:]].

x <- c("AZaz09 y AZaz09", "??az09 y AZ??09", "26 de Marzo y Pareyra de la Luz")
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x)
## => [1] TRUE TRUE TRUE
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x, perl=TRUE)
## => [1] TRUE TRUE TRUE

See the online demo

When you use [:alnum:] alone, it is a simple bracket expression that matches a single character, a :, a, l, n, u, m.

Pattern details:

  • [[:alnum:][:blank:]]+ - 1+ alphanumeric or horizontal whitespace symbols
  • [[:blank:]] - 1 horizontal whitespace symbols
  • [yY] - either y or Y
  • [[:blank:]] - 1 horizontal whitespace symbols
  • [[:alnum:][:blank:]]+ - 1+ alphanumeric or horizontal whitespace symbols

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...