Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
226 views
in Technique[技术] by (71.8m points)

java - ISO-8859-1 encoding and binary data preservation

I read in a comment to an answer by @Esailija to a question of mine that

ISO-8859-1 is the only encoding to fully retain the original binary data, with exact byte<->codepoint matches

I also read in this answer by @AaronDigulla that :

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping

I need some insight on this. This will fail (as illustrated here) :

// u00F6 is ?
System.out.println(Arrays.toString("u00F6".getBytes("utf-8")));
// prints [-61, -74]
System.out.println(Arrays.toString("u00F6".getBytes("ISO-8859-1")));
// prints [-10]

Questions

  1. I admit I do not quite get it - why does it not get the bytes in the code above ?
  2. Most importantly, where is this (byte preserving behavior of ISO-8859-1) specified - links to source, or JSL would be nice. Is it the only encoding with this property ?
  3. Is it related to ISO-8859-1 being the default default ?

See also this question for nice counter examples from other charsets.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

"u00F6" is not a byte array. It's a string containing a single char. Execute the following test instead:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};
    String s = new String(b, "ISO-8859-1"); // decoding
    byte[] b2 = s.getBytes("ISO-8859-1"); // encoding
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2)); // true
}

To check that this is true for any byte, just improve the code an loop through all the bytes:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[256];
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte) i;
    }
    String s = new String(b, "ISO-8859-1");
    byte[] b2 = s.getBytes("ISO-8859-1");
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2));
}

ISO-8859-1 is a standard encoding. So the language used (Java, C# or whatever) doesn't matter.

Here's a Wikipedia reference that claims that every byte is covered:

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

(emphasis mine)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...