regex - How to remove invalid UTF-8 characters from a JavaScript string?

Question

Welcome To Ask or Share your Answers For Others

regex - How to remove invalid UTF-8 characters from a JavaScript string?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - How to remove invalid UTF-8 characters from a JavaScript string?

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:

strTest = strTest.replace(/([x00-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})|./g, "$1");

It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:

strTest = strTest.replace(/([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./g, "$1");

Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.

I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.

Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:04:02+0000

I use this simple and sturdy approach:

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).

Categories

regex - How to remove invalid UTF-8 characters from a JavaScript string?

regex - How to remove invalid UTF-8 characters from a JavaScript string?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags