Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
571 views
in Technique[技术] by (71.8m points)

r - Filtering observations in dplyr in combination with grepl

I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.

Take this sample df:

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                          "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
df1


#     fruit group
#1    apple     A
#2   orange     B
#3   xapple     A
#4  xorange     B
#5  applexx     A
#6 orangexx     B
#7  banxana     A
#8  appxxle     B

I want to:

  1. filter out those cases beginning with 'x'
  2. filter out those cases ending with 'xx'

I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):

df1 %>%  filter(!grepl("xx",fruit))

#    fruit group
#1   apple     A
#2  orange     B
#3  xapple     A
#4 xorange     B
#5 banxana     A

This obviously 'erroneously' (from my point of view) filtered 'appxxle'.

I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

Expected output:

#      fruit group
#1     apple     A
#2    orange     B
#3   banxana     A
#4   appxxle     B

I'd like to do this inside dplyr if possible.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I didn't understand your second regex, but this more basic regex seems to do the trick:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

And I assume you know this, but you don't have to use dplyr here at all:

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...