Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
642 views
in Technique[技术] by (71.8m points)

html - How to substitute non SGML characters in String using PHP?

I programmed a guestbook using PHP4 and HTML 4.01 (with the charset ISO-8859-15, i.e. latin-9). The data is saved in a MySQL-database with the charset (ISO-8859-1, i.e. latin-1).

When somebody enters characters from a different charset, it seems that the browsers send the data encoded (actually I have not checked where it gets encoded, ...).

Anyway, in some cases, it seems that characters are not saved encoded in the database. Thus, the validator returns an error message when I add show the data within an HTML4.01 document:

non SGML character number 146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets. The validator has found one of these undefined characters in your document. The character may appear on your browser as a curly quote, or a trademark symbol, or some other fancy glyph; on a different computer, however, it will likely appear as a completely different character, or nothing at all.

Your best bet is to replace the character with the nearest equivalent ASCII character, or to use an appropriate character entity. For more information on Character Encoding on the web, see Alan Flavell's excellent HTML Character Set Issues reference.

This error can also be triggered by formatting characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use the "Save as ASCII" or similar command to save the document without formatting information.

I'm now using PHP5.2.17, and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters, so that there are no more validation errors?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.

SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.

I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.

From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).

It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.

Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)-Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "x8E" => 'Ž',    # 142 -> U+017D
    "x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "x9E" => 'ž',    # 158 -> U+017E
    "x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:

$cp1252HTMLNumericEntities = array(
    "x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "x8E" => 'Ž',    # 142 -> U+017D
    "x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "x9E" => 'ž',    # 158 -> U+017E
    "x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...