Java 8 UTF-8 encoding issue (java bug?)

Question

Welcome To Ask or Share your Answers For Others

Java 8 UTF-8 encoding issue (java bug?)

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Java 8 UTF-8 encoding issue (java bug?)

There is an inconsistency when creating a String with UTF-8 encoding.

Run this code:

public static void encodingIssue() throws IOException {
    byte[] array = new byte[3];
    array[0] = (byte) -19;
    array[1] = (byte) -69;
    array[2] = (byte) -100;

    String str = new String(array, "UTF-8");
    for (char c : str.toCharArray()) {
        System.out.println((int) c);
    }
}

On Java 1.8.0_20 (and earlier versions) we have the result

On Java 1.7 and 1.6 we have the correct result:

Have you encountered this error? Is there a workaround for this?

This inconsistency manifests itself also for Shift_JIS, JIS_X0212-1990, x-IBM300, x-IBM834, x-IBM942, x-IBM942C, x-JIS0208, but obviously UTF-8 is the more urgent.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:34:30+0000

It is a property of the “Modified UTF-8” encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses “Modified UTF-8”. This seems to have been fixed with Java?8.

You can reliably read such data using a method that is specified to use “Modified UTF-8”:

ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();

Categories

Java 8 UTF-8 encoding issue (java bug?)

Java 8 UTF-8 encoding issue (java bug?)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags