utf 8 - How to detect UTF-8 characters in a Latin1 encoded column - MySQL

Question

Welcome To Ask or Share your Answers For Others

utf 8 - How to detect UTF-8 characters in a Latin1 encoded column - MySQL

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

utf 8 - How to detect UTF-8 characters in a Latin1 encoded column - MySQL

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name); Is this enough?

At the moment I have switched my Mysql client encoding to UTF-8.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:05:52+0000

Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:

latin1                utf8
----------------------------------------
Bj??rn                Bj?rn

Categories

utf 8 - How to detect UTF-8 characters in a Latin1 encoded column - MySQL

utf 8 - How to detect UTF-8 characters in a Latin1 encoded column - MySQL

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags