Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
752 views
in Technique[技术] by (71.8m points)

encoding - Recover wrongly encoded character (Java )

We ran some java code using cron in Linux to persist thousands of records in production database. The locale charmap in that box was "ANSI_X3.4-1968". Now, we took following steps before persisting those to database. 1. Use StringEscapeUtils.unescapeHtml4 on the text 2. Write the String in UTF-8 format and persist in database

Now the problem is after these steps special characters are showing up as "?". Is it possible to revert it back to the original character? I have simulated the problem with following steps.

  1. Change Eclipse encoding to "ANSI_X3.4-1968"
  2. Write following lines of codes
 

String insertSpecial = StringEscapeUtils.unescapeHtml4("×");
System.out.println(insertSpecial);
String uni = new String(insertSpecial.getBytes(), "UTF-8");// This value is currently in DB
System.out.println(uni);

Now I want to get back "×" from the String "uni". Any help will be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Basically no. You made the biggest mistake in new String(insertSpecial.getBytes(), "UTF-8"); which again shows that character encoding is surprisingly difficult to handle.

What that piece of code does, step by step:

  1. Give me the bytes from insertSpecial in the platform encoding
  2. Create a new String from the bytes, telling that the bytes are UTF-8 (even though the bytes were gotten in platform encoding just previously)

I've seen this code several times, and unfortunately it only breaks things. It's completely unnecessary and it doesn't "convert" anything even if it were written correctly. If the platform encoding is not UTF-8 then it will most likely destroy any special characters (or even the whole String if there's a suitable difference between platform encoding and the one given in the String constructor).

The question mark is a placeholder for a character that could not be converted, meaning it's forever gone.

Here's some reading so you won't make that mistake again: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...