The Gujarati starts ?????
, correct? And the Malyalam starts ???
, correct? And the English should have included Bureau’s
.
This is the classic case of
- The bytes you have in the client are correctly encoded in utf8. (
Bureau
is encoded in the Ascii/latin1 subset of utf8; but ’
is not the ascii apostrophe.)
- You connected with
SET NAMES latin1
(or set_charset('latin1')
or ...), probably by default. (It should have been utf8
.)
- The column in the table was declared
CHARACTER SET latin1
. (Or possibly it was inherited from the table/database.) (It should have been utf8
.)
The fix for the data is a "2-step ALTER".
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL
, etc) was already on the column.
Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY
all the necessary columns to VARBINARY
for a single table in a pair of ALTERs
.
The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs
will change the column definition.
Edit
You have VARCHAR
with the wrong CHARACTER SET
. Hence, you see Mojibake like àa°à??àa2
. Most conversion techniques try to preserve àa°à??àa2
, but that is not what you need. Instead, taking a step to VARBINARY
preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…