You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.
$regex = '/(
[xC0-xC1] # Invalid UTF-8 Bytes
| [xF5-xFF] # Invalid UTF-8 Bytes
| xE0[x80-x9F] # Overlong encoding of prior code point
| xF0[x80-x8F] # Overlong encoding of prior code point
| [xC2-xDF](?![x80-xBF]) # Invalid UTF-8 Sequence Start
| [xE0-xEF](?![x80-xBF]{2}) # Invalid UTF-8 Sequence Start
| [xF0-xF4](?![x80-xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[x00-x7FxF5-xFF])[x80-xBF] # Invalid UTF-8 Sequence Middle
| (?<![xC2-xDF]|[xE0-xEF]|[xE0-xEF][x80-xBF]|[xF0-xF4]|[xF0-xF4][x80-xBF]|[xF0-xF4][x80-xBF]{2})[x80-xBF] # Overlong Sequence
| (?<=[xE0-xEF])[x80-xBF](?![x80-xBF]) # Short 3 byte sequence
| (?<=[xF0-xF4])[x80-xBF](?![x80-xBF]{2}) # Short 4 byte sequence
| (?<=[xF0-xF4][x80-xBF])[x80-xBF](?![x80-xBF]) # Short 4 byte sequence (2)
)/x';
We can test it by creating a few variations of text:
// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)
etc...
In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…