Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
492 views
in Technique[技术] by (71.8m points)

sanitize - What's up with these Unicode combining characters and how can we filter them?

????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ????????????????????? ?????????????????????

These recently showed up in facebook comment sections.

How can we sanitize this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

?????????????????????

...it's an ? (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...