Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
531 views
in Technique[技术] by (71.8m points)

r - What's the difference between hex code (x) and unicode (u) chars?

From ?Quotes:

xnn   character with given hex code (1 or 2 hex digits)  
unnnn Unicode character with given code (1--4 hex digits)

In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes help page shows:

"x48x65x6cx6cx6fx20x57x6fx72x6cx64x21"
## [1] "Hello World!"
"u48u65u6cu6cu6fu20u57u6fu72u6cu64u21"
## [1] "Hello World!"

However, under Linux, when trying to print a pound sign, I see

cat("ua3")
## £
cat("xa3")
## ?

That is, the x hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.

If I convert to integer and back then the pound sign displays correctly under Linux.

cat(intToUtf8(utf8ToInt("xa3")))
## £

Incidentally, this doesn't work under Windows, since utf8ToInt("xa3") returns NA.

Some x characters return NA under Windows but throw an error under Linux. For example:

utf8ToInt("xf0")
## Error in utf8ToInt("xf0") : invalid UTF-8 string

("uf0" is a valid character.)

These examples show that there are some differences between x and u forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.

What are the difference between these two character forms?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The escape sequence xNN inserts the raw byte NN into a string, whereas uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:

> charToRaw('xA3')
[1] a3
> charToRaw('uA3')
[1] c2 a3

These two types of escape sequence cannot be mixed in the same string:

> 'ua3xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed

This is because the escape sequences also define the encoding of the string. A uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas xNN leaves it in the default "unknown" (aka. native) encoding:

> Encoding('xa3')
[1] "unknown"
> Encoding('ua3')
[1] "UTF-8"

This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:

  • On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "?".
  • On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is uA3, a conversion from UTF-8 to Windows-1252 takes place.)

If the encoding is set explicitly, the appropriate conversion will take place on Linux:

> s <- 'xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...