r - Removing text containing non-english character

Question

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

This is my sample dataset:

Name <- c("apple firm","苹果 firm","?pple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)

I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.

I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:21:36+0000

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^x01-x7F]+", data$Name)),]

res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^u0001-u007F]+", data$Name)),]

  res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

Note - we had to take out the NUL character for this to work. So instead of starting at u0000 or x00 we start at u0001 and x01.