Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

java - String encoding conversion UTF-8 to SHIFT-JIS

Variables used:

  • JavaSE-6
  • No frameworks

Given this string input of ピーター?ジョーズ which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.

  • Input (UTF-8 encoding): ピーター?ジョーンズ
  • Output (SHIFT-JIS encoding): ピーター?ジョーンズ (SHIFT-JIS to be encoded)

I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:

  • stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
  • new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

Both code snippets return this string output: ?s?[?^?[?E?W???[???Y (SHIFT-JIS encoded)

Any ideas on how this can be resolved?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

(Note "encoding", "charset" and Charset are more or less synonyms.)

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

So if you have a byte[] that's encoded using UTF-8:

byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ? ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

You can create a String from those bytes using:

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"

Then you can encode that String as Shift-JIS using:

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ? ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8 will produce garbage:

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (? s ? [ ? ^)
// 81 5b 81 45 83 57   (? [ ? E ? W)
// 83 87 81 5b 83 59   (? ? ? [ ? Y)

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437 or CP850) before presenting it to you.

This last part might be tripping you up.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...