Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
699 views
in Technique[技术] by (71.8m points)

utf 8 - How to identify/delete non-UTF-8 characters in R

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "faxE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "faxE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...