Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
190 views
in Technique[技术] by (71.8m points)

java - How to avoid Junk/garbage characters while reading data from multiple languages?

I am parsing rss news feeds from over 10 different languages.

All the parsing is being done in java and data is stored in MySQL before my API's written in php are responding to the clients.

I constantly come across garbage characters when I read the data.

What have I tried :

  1. I have configured my MySQL to store utf-8 data. My db,table and even the column have UTF8 as their default charset.
  2. While connecting my db,I set the character set results as utf-8

When I run the jar file manually to insert the data,the character's appear fine. But when I set a cronjob for the same jar file,I start facing the problem all over again.

In English,I particularly face problems like this and in other vernacular languages,the character appear to be totally garbish and I cant even recongnize a single character.

Is there anything that I am missing?

Sample garbage characters :

Gujarati :"àa°à??àa2àaμà?? àa?à?àa?àa?àa?àa°à?€àa?àa?àa? àa?àa?àa?àa?àa¨ àa?à??àa°à?€ àa¥àa?à?? àa¤à?? àa?àa3àa?à?? àaμàa3àa¤àa°!"

Malyalam : "à′¨àμ?à′aàμà′aà′?à′3à′?à′2àμ?à′?àμà′?àμà′3àμà′3 à′?àμ?à′3àμa€ à′¨à′?à′°à′?àμà′?àμ à′?àμà′±à′?àμà′?àμ"

English : Bank Board Bureaua€?s ambit to widen to financial sector PSUs

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Gujarati starts ?????, correct? And the Malyalam starts ???, correct? And the English should have included Bureau’s.

This is the classic case of

  • The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but is not the ascii apostrophe.)
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)

The fix for the data is a "2-step ALTER".

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

Edit

You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like àa°à??àa2. Most conversion techniques try to preserve àa°à??àa2, but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...