Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
373 views
in Technique[技术] by (71.8m points)

r - Convert byte Encoding to unicode

I may not be using the appropriate language in the title. If this needs edited please feel free.

I want to take a string with "byte" substitutions for unicode characters and convert them back to unicode. Let's say I have:

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

I'd like to get back:

"bi?chen Zürcher hello world ?"

I know that if I could get it to this form it would print to the console as desired:

"bixdfchen Zxfcrcher xc6"

I tried:

gsub("<([[a-z0-9]+)>", "\x\1", x)
## [1] "bixdfchen Zxfcrcher xc6"
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How about this:

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})

regmatches(x, m) <- chars

x
# [1] "bixdfchen Zxfcrcher hello world xc6"

Encoding(x) <- "latin1"
x
# [1] "bi?chen Zürcher hello world ?"  

Note that you can't make an escaped character by pasting a "x" to the front of a number. That "x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar() to turn a number into the character we want.

I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...