Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
365 views
in Technique[技术] by (71.8m points)

utf 8 - PHP source code in UTF-8 files; how to interpret properly?

I build tools to analyze source code. Such tools have to read the source code files correctly, especially as regards character encodings. For example, "What is the precise string of bytes in a string literal?" (both PHP literals, and HTML text).

My perhaps erroneous understanding is that PHP source files are 8-bit character only (that is, the PHP engine reads them that way [right]?, since they are only supposed to contain 8 bit characters). But, eight bit characters in which encoding? (I presume intended to match ISO-8859-1 (-x?) [can somebody quote chapter and verse?]. That is, an umlaut is intended to be an umlaut, right? Following this, one can write PHP scripts with HTML and strings for most European nations/character sets straightforwardly.

But it is clear this is problematic with Unicode. As far as I can tell, most PHP applications deal with Unicode essentially by having strings containing UTF-8 byte sequences which can be inserted in 8-bit PHP strings. Following this, one can generate scripts whose HTML contains Unicode UTF-8 sequences, if you tell your server you are generating UTF-8 text.

For the above situations, one can read the PHP file as 8-bit character text, and this seems to me to match the language.

What puzzles me are PHP source files encoded as UTF-8 (the Joomla package has ~1800 source files, of which some 10 are UTF-8 and the rest are not). Any (non-ASCII) European characters that show correctly in a UTF-8 rendering are actually encoded as multibyte sequences. I suppose such pages served as UTF-8 will have the HTML rendered correctly. But any string comparisons for European characters or other Unicode characters that apparently render correctly in a text editor simply won't work. And string literals will not contain what they appear to contain. Do programmers use UTF-8 files because that's what editors offer? Are they doing this on purpose? Or is just an accident that doesn't matter for most work?

So, how should one read a PHP source file? (in particular, in what character encoding?) One possible answer is, always as ISO-8859-1 8 bit codes, regardless of the actual content or BOMs (I see a lot UTF-8 BOM-marked PHP files). Another answer is as UTF-8, if so marked.

[Our tools read and write arbitrary encodings. A "trivial" tool is read-file-in-one-character encoding, write identical code points in another encoding. Reading UTF-8 PHP files that way, gets us into trouble writing ISO8859-1 equivalent files, because many UTF-8 code points (e.g., the euro symbol) cannot be encoded in ISO8859-x.]

EDIT Aug 30: We now check PHP files to see if the have UTF-8 BOMs, or appear to have UTF-8 sequences that are all legal. In either of these cases, we read the file as UTF-8; otherwise we read it as ISO8859-1 by default. We now preserve the file encoding if we modify it. (Getting all this right is quite a lot of work). This seems to be a safe strategy, but that may be different than what PHP programmers are expecting.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR

ASCII


Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

If there's none...

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.


Edge cases (requested in comments):

1.

Is there a restriction on what encoding are allowed?

There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.

2.

How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16) While this may be a problem for PHP files encoded as UTF-16, I think it's fair to say the vast majority of developers just don't encode their scripts in UTF-16. Their data, sure, if the application's case calls for it. But not the script itself. Most PHP files in the wild are encoded either with an ANSI encoding or UTF-8.

3.

If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.

4.

For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).

5.

If not, what does it mean to set the encoding to UTF something?

AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...