Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
724 views
in Technique[技术] by (71.8m points)

utf 8 - How default is the default encoding (UTF-8) in the XML Declaration?

I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.

I also know that the XML-Declarataion <?xml version="1.0" ... ?> at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.

So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>

From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?

(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

What The Spec Says

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

However, according to the spec, it should still read the encoding declaration.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

If they don't match, according to section 4.3.3:

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

Encoded UTF-16, Declared UTF-8

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

Encoded UTF-8, Declared Otherwise

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

Other Inconsistencies

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

External Encoding Information

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...