I want to just filter it out
You have got an unspecified encoding/charset with your data. This is a huge problem.
You can first try to convert it into utf-8
and then strip all non-printable characters:
$str = iconv('utf-8', 'utf-8//ignore', $str);
echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);
The problem is, that the iconv
function can only try. It will drop any invalid character sequence. As of php 5.4 it will drop the complete string however, if the input encoding specified is invalid.
You will see a warning since PHP 5.3 already that the input string has an invalid encoding.
You can go around this by removing all invalid utf-8
byte sequences first:
$str = valid_utf8_bytes($str);
echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);
/**
* get valid utf-8 byte squences
*
* take over all matching bytes, drop an invalid sequence until first
* non-matching byte.
*
* @param string $str
* @return string
*/
function valid_utf8_bytes($str)
{
$return = '';
$length = strlen($str);
$invalid = array_flip(array("xEFxBFxBF" /* U-FFFF */, "xEFxBFxBE" /* U-FFFE */));
for ($i=0; $i < $length; $i++)
{
$c = ord($str[$o=$i]);
if ($c < 0x80) $n=0; # 0bbbbbbb
elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
else continue; # Does not match
for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
continue 2
;
$match = substr($str, $o, $n);
if ($n === 3 && isset($invalid[$match])) # test invalid sequences
continue;
$return .= $match;
}
return $return;
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…