Use an existing utility such as iconv, or whatever libraries come with the language you're using.
If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:
- U+0000 — U+007F: 1 byte: 0xxxxxxx
- U+0080 — U+07FF: 2 bytes: 110xxxxx 10xxxxxx
- U+0800 — U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
- U+10000 — U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it's a 1-byte character. If it begins with 110, it's a 2-byte character. If it begins with 1110, it's a 3-byte character. If it begins with 11110, it's a 4-byte character. If it begins with 10, it's a non-initial byte of a multibyte character. If it begins with 11111, it's an invalid character.
Once you figure out how many bytes are in the character, it's just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.
Since you didn't specify a language, here's some sample C code (error checking omitted):
wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
if(!(utf8[0] & 0x80)) // 0xxxxxxx
return (wchar_t)utf8[0];
else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx
return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx
return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
else
return ERROR; // uh-oh, UCS-2 can't handle code points this high
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…