Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
542 views
in Technique[技术] by (71.8m points)

r - Trouble with strings with <U+0092> Unicode characters

I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.

I have strings like the following:

x = "but it doesn<U+0092>t matter"

Looking up the code, we can see that it should be the character , which actually should be ' (the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.

Most of the other questions about Unicode in R concern Unicode in the format like this u0092.

Just use Encoding()

Let's try:

#> x = "but it doesn<U+0092>t matter"
#> Encoding(x)
#[1] "unknown"
#> Encoding(x) = "UTF-8"
#> Encoding(x)
#[1] "unknown"
#> x
#[1] "but it doesn<U+0092>t matter"

So this does not seem to do anything.

Use the hack functions from these previous questions

There are a few prior questions that concern this Unicode format and try to convert them:

Oddly, the example they give work, but mine doesn't.

#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
#> Encoding(test.string)
#[1] "unknown"
#> to_true_unicode(test.string)
#[1] "This is a α β β γ test δ string."

But:

#> x2 = to_true_unicode(x)
#> x2
#[1] "but it doesnu0092t matter"
#> cat(x2)
#but it doesnt matter
#> Encoding(x2)
#[1] "UTF-8"

So, it managed to convert to the u format from the <U+....> format, and using cat() prints the character without that symbol (or a bugged symbol on SO).

Just search and replace them manually

I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:

#> #base-r
#> gsub(x = x, pattern = "<U+0092>", replacement = "'")
#[1] "but it doesn<U+0092>t matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x, pattern = "<U+0092>", "'")
#[1] "but it doesn<U+0092>t matter"

So replacement does not seem to work, but it does work on the u versions:

#> #base-r
#> gsub(x = x2, pattern = "u0092", replacement = "'")
#[1] "but it doesn't matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x2, pattern = "u0092", "'")
#[1] "but it doesn't matter"

So, this suggests a working method: 1) convert <U+> format to u format, then use search-replace.

Unescape with stringi::stri_unescape_unicode()

Does not seem to work with either version:

#> stringi::stri_unescape_unicode(x)
#[1] "but it doesn<U+0092>t matter"
#> stringi::stri_unescape_unicode(x2)
#[1] "but it doesnu0092t matter"

Is there some generally applicable way to deal with problems like this?

My setup

My sessionInfo is:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.3   stringi_1.0-1

Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Not sure it will work for you but for the same symptoms i did convert the strings to ascii:

x <- iconv(x, "", "ASCII", "byte")

For non ascii chars, the indication is "<xx>" with the hex code of the byte.

You can then gsub the hex codes to the values that suit you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...