java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

Question

Welcome To Ask or Share your Answers For Others

java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

What is happening here? Why when I read the file using utf-8 does it output questionmarks in the console?

This is a minimal working example:

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
    
import static org.apache.commons.io.FileUtils.readFileToString;
import static org.apache.commons.io.FileUtils.writeStringToFile;
    
public class Main {
    
    public static void main(String... args) throws IOException {
    
        System.out.println("---------");
        System.out.println(Charset.defaultCharset());
        System.out.println("? ? ?");
        System.out.println("? ? ?");
        System.out.println("? ? ?");
    
        File inputFile  = new File(System.getProperty("user.dir") + "/input.md");
        File outputFile = new File(System.getProperty("user.dir") + "/output.md");
    
        String content, encoding;
    
        System.out.println("--------- windows-1252");
        encoding = "windows-1252";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- iso-8859-1");
        encoding = "iso-8859-1";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- utf-8");
        encoding = "utf-8";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        writeStringToFile(outputFile, content, encoding);
    
    }
    
}

Where input.md contains: (encoded in UTF-8)

This is input.md. '?' '?' '?'

Running the above code yields

---------
windows-1252
? ? ?
? ? ?
? ? ?
--------- windows-1252
This is file C. '?' '?' '?'.
--------- iso-8859-1
This is file C. '?' '?' '?'.
--------- utf-8
This is file C. '?' '?' '?'.

Why do I get ? when I read the file using UTF-8? This is especially weird since the file is encoded in UTF-8.

UPDATE: My console is set to "UTF-8":

Here is a screenshot of the hex values of each char in string extracted from the input file:

Here is a better screenshot of the hex isolated:

question from:https://stackoverflow.com/questions/65886999/reading-a-file-using-utf-8-that-is-encoded-in-utf-8-doesnt-work-but-reading-th

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:19:35+0000

The code looks fine to me, and your output.md file looks OK. So this is most likely just an issue with the console output.

The Unicode characters you are experimenting with are encoded as the same single bytes in both Windows-1252 and ISO-8859-1 (? = 0xE6, ? = 0xF8, ? = 0xE5), but are encoded as multiple bytes in UTF-8 (? = 0xC3 0xA6, ? = 0xC3 0xB8, ? = 0xC3 0xA5).

Reading a UTF-8 encoded file as either Windows-1252 or ISO-8859-1 will decode each byte individually, producing a string containing a separate char for each byte, and those chars will have the same numeric values as the bytes. So, you should be ending up with a string containing chars 0x00C3 0x00A6, 0x00C3 0x00B8, and 0x00C3 0x00A5. Outputting those chars to the console as Windows-1252 should be showing as ?| ?? ?￥, not as ? ? ?.

On the other hand, reading a UTF-8 encoded file as UTF-8 will decode the file properly, producing a string with chars 0x00E6, 0x00F8, and 0x00E5. Writing that string to a UTF-8 encoded file should be producing the correct byte sequences (0xC3 0xA6, 0xC3 0xB8, and 0xC3 0xA5), but outputting that same string as Windows-1252 risks data loss, but you should be seeing ? ? ? as expected, since Windows-1252 does support those Unicode characters.

So, your results are actually backwards from what I would expect. Even though Charset.defaultCharset() is reporting Windows-1252, I suspect your console is actually using a different charset for its output.

I suggest you print out the numeric values of the individual chars of the content string to see exactly how input.md is actually being decoded by each encoding. You should be getting the char values I mentioned above.

Categories

java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags