regex - PHP - Fast way to strip all characters not displayable in browser from utf8 string

Question

Welcome To Ask or Share your Answers For Others

regex - PHP - Fast way to strip all characters not displayable in browser from utf8 string

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - PHP - Fast way to strip all characters not displayable in browser from utf8 string

I've got a little messy database containing names of many institutions around the world.

I want to display them including national characters, but without invalid characters - those displayed in firefox as unicode numbers.

How to filter them out?

Database has utf8 encoding, but some strings were inserted with wrong encodings or were a mess already in sources.

I do not want to fix the database - it's too big. I want to just filter it out - "out of sight out of mind"

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:31:51+0000

I want to just filter it out

You have got an unspecified encoding/charset with your data. This is a huge problem.

You can first try to convert it into utf-8 and then strip all non-printable characters:

$str = iconv('utf-8', 'utf-8//ignore', $str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

The problem is, that the iconv function can only try. It will drop any invalid character sequence. As of php 5.4 it will drop the complete string however, if the input encoding specified is invalid.

You will see a warning since PHP 5.3 already that the input string has an invalid encoding.

You can go around this by removing all invalid utf-8 byte sequences first:

$str = valid_utf8_bytes($str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

/**
 * get valid utf-8 byte squences
 *
 * take over all matching bytes, drop an invalid sequence until first
 * non-matching byte.
 * 
 * @param string $str
 * @return string
 */
function valid_utf8_bytes($str)
{
    $return = '';
    $length = strlen($str);
    $invalid = array_flip(array("xEFxBFxBF" /* U-FFFF */, "xEFxBFxBE" /* U-FFFE */));

    for ($i=0; $i < $length; $i++)
    {
        $c = ord($str[$o=$i]);

        if ($c < 0x80) $n=0; # 0bbbbbbb
        elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
        else continue; # Does not match

        for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
            if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                continue 2
        ;

        $match = substr($str, $o, $n);

        if ($n === 3 && isset($invalid[$match])) # test invalid sequences
            continue;

        $return .= $match;
    }
    return $return;
}

Categories

regex - PHP - Fast way to strip all characters not displayable in browser from utf8 string

regex - PHP - Fast way to strip all characters not displayable in browser from utf8 string

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags