I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:
strTest = strTest.replace(/([x00-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})|./g, "$1");
It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:
strTest = strTest.replace(/([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./g, "$1");
Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.
I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.
Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…