This should work:
if (max(array_map('ord', str_split($string))) >= 240)
The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx
. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.
If you want to remove long characters, this will do:
preg_replace_callback('/./u', function (array $match) {
return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)
Though there may be a more elegant regex way to express high codepoints directly.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…