The code looks fine to me, and your output.md
file looks OK. So this is most likely just an issue with the console output.
The Unicode characters you are experimenting with are encoded as the same single bytes in both Windows-1252 and ISO-8859-1 (? = 0xE6
, ? = 0xF8
, ? = 0xE5
), but are encoded as multiple bytes in UTF-8 (? = 0xC3 0xA6
, ? = 0xC3 0xB8
, ? = 0xC3 0xA5
).
Reading a UTF-8 encoded file as either Windows-1252 or ISO-8859-1 will decode each byte individually, producing a string
containing a separate char
for each byte, and those char
s will have the same numeric values as the bytes. So, you should be ending up with a string
containing chars 0x00C3 0x00A6
, 0x00C3 0x00B8
, and 0x00C3 0x00A5
. Outputting those char
s to the console as Windows-1252 should be showing as ?| ?? ?¥
, not as ? ? ?
.
On the other hand, reading a UTF-8 encoded file as UTF-8 will decode the file properly, producing a string
with char
s 0x00E6
, 0x00F8
, and 0x00E5
. Writing that string
to a UTF-8 encoded file should be producing the correct byte sequences (0xC3 0xA6
, 0xC3 0xB8
, and 0xC3 0xA5
), but outputting that same string
as Windows-1252 risks data loss, but you should be seeing ? ? ?
as expected, since Windows-1252 does support those Unicode characters.
So, your results are actually backwards from what I would expect. Even though Charset.defaultCharset()
is reporting Windows-1252, I suspect your console is actually using a different charset for its output.
I suggest you print out the numeric values of the individual char
s of the content
string to see exactly how input.md
is actually being decoded by each encoding. You should be getting the char
values I mentioned above.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…