Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
385 views
in Technique[技术] by (71.8m points)

php - Reference: Why are my "special" Unicode characters encoded weird using json_encode?

When using "special" Unicode characters they come out as weird garbage when encoded to JSON:

php > echo json_encode(['foo' => '馬']);
{"foo":"u99ac"}

Why? Have I done something wrong with my encodings?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "u005C".

In short: Any character can be encoded as u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

"馬"
"u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "馬". They don't look the same, but they mean the same thing in the JSON data encoding format.

PHP's json_encode preferably encodes non-ASCII characters using u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode(['foo' => '馬'], JSON_UNESCAPED_UNICODE);
{"foo":"馬"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...