UTF-8 can contain a BOM. However, it
makes no difference as to the
endianness of the byte stream. UTF-8
always has the same byte order.
If Utf-8
stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM
isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8
always has the same byte order?
Thank you
EDIT:
UTF-8 is byte oriented
I understand that if two byte UTF-8
character C
consists of bytes B1
and B2 ( where B1
is first byte and B2
is last byte ), then with UTF-8
those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM
, B1
will be first and B2
last. Similarly, if C
is written to a file on big endian machine BEM
, B1
will still be first and B2
still last).
But what happens when C
is written to file F
on LEM
, but we copy F
to BEM
and try to read it there? Since BEM
automatically swaps bytes ( B1
is now last and B2
first byte ), how will app ( running on BEM
) reading F
know whether F was created on BEM
and thus order of two bytes wasn’t swapped or whether F
was transferred from LEM
, in which case BEM
automatically swapped the bytes?
I hope question made some sense
EDIT 2:
In response to your edit: big-endian
machines do not swap bytes if you ask
them to read a byte at a time.
a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )
b)
In UTF-8, you decide what to do with a
byte based on its high-order bits
Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?
If you believe that you're seeing
something different, please edit your
question and include
I’m not saying that. I simply didn’t understand what was going on
c)Why aren't Utf-16 and Utf-32 also byte oriented?
See Question&Answers more detail:
os