Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

java - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

What is happening here? Why when I read the file using utf-8 does it output questionmarks in the console?

This is a minimal working example:

image

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
    
import static org.apache.commons.io.FileUtils.readFileToString;
import static org.apache.commons.io.FileUtils.writeStringToFile;
    
public class Main {
    
    public static void main(String... args) throws IOException {
    
        System.out.println("---------");
        System.out.println(Charset.defaultCharset());
        System.out.println("? ? ?");
        System.out.println("? ? ?");
        System.out.println("? ? ?");
    
        File inputFile  = new File(System.getProperty("user.dir") + "/input.md");
        File outputFile = new File(System.getProperty("user.dir") + "/output.md");
    
        String content, encoding;
    
        System.out.println("--------- windows-1252");
        encoding = "windows-1252";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- iso-8859-1");
        encoding = "iso-8859-1";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- utf-8");
        encoding = "utf-8";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        writeStringToFile(outputFile, content, encoding);
    
    }
    
}

Where input.md contains: (encoded in UTF-8)

This is input.md. '?' '?' '?'

Running the above code yields

---------
windows-1252
? ? ?
? ? ?
? ? ?
--------- windows-1252
This is file C. '?' '?' '?'.
--------- iso-8859-1
This is file C. '?' '?' '?'.
--------- utf-8
This is file C. '?' '?' '?'.

Why do I get ? when I read the file using UTF-8? This is especially weird since the file is encoded in UTF-8.

UPDATE: My console is set to "UTF-8":

image

Here is a screenshot of the hex values of each char in string extracted from the input file:

image

Here is a better screenshot of the hex isolated:

image

question from:https://stackoverflow.com/questions/65886999/reading-a-file-using-utf-8-that-is-encoded-in-utf-8-doesnt-work-but-reading-th

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The code looks fine to me, and your output.md file looks OK. So this is most likely just an issue with the console output.

The Unicode characters you are experimenting with are encoded as the same single bytes in both Windows-1252 and ISO-8859-1 (? = 0xE6, ? = 0xF8, ? = 0xE5), but are encoded as multiple bytes in UTF-8 (? = 0xC3 0xA6, ? = 0xC3 0xB8, ? = 0xC3 0xA5).

Reading a UTF-8 encoded file as either Windows-1252 or ISO-8859-1 will decode each byte individually, producing a string containing a separate char for each byte, and those chars will have the same numeric values as the bytes. So, you should be ending up with a string containing chars 0x00C3 0x00A6, 0x00C3 0x00B8, and 0x00C3 0x00A5. Outputting those chars to the console as Windows-1252 should be showing as ?| ?? ?¥, not as ? ? ?.

On the other hand, reading a UTF-8 encoded file as UTF-8 will decode the file properly, producing a string with chars 0x00E6, 0x00F8, and 0x00E5. Writing that string to a UTF-8 encoded file should be producing the correct byte sequences (0xC3 0xA6, 0xC3 0xB8, and 0xC3 0xA5), but outputting that same string as Windows-1252 risks data loss, but you should be seeing ? ? ? as expected, since Windows-1252 does support those Unicode characters.

So, your results are actually backwards from what I would expect. Even though Charset.defaultCharset() is reporting Windows-1252, I suspect your console is actually using a different charset for its output.

I suggest you print out the numeric values of the individual chars of the content string to see exactly how input.md is actually being decoded by each encoding. You should be getting the char values I mentioned above.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...