Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just SELECT
two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
This is made unusually complicated because the MySQL regexp engine seems to ignore things like x80
and makes it necessary to use the UNHEX()
method instead.
This produces results like this:
latin1 utf8
----------------------------------------
Bj??rn Bj?rn
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…