Thank you Aya for the Asci85 link, there are very good ideas.
I developed them below for our case.
UTF-8 characters possibilites:
For 1 byte characters (0xxxxxxx): 96 possibilites per byte
+
UTF-8 ASCII chars 0xxxxxxx = +2^7
-
UTF-8 Control chars 000xxxxx = -2^5
+
XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
-
XML entity unallowed chars (<, >, &) = -3
EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...
For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes
+
UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
-
UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7
For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes
+
UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
-
UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
-
Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11
And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.
The coding possibilites in let's say a 3-byte space
So let's see what combinations we can do on a 3-byte (24bit) space:
- 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
- 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
- 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
- 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities
There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).
Total number of possibilities:
- A 3-byte space has 2^24 = 16777216
possibilities.
- UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.
How much overhead does that mean?
- 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
- This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
- That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is
18% overhead
. Way better than Base64's 33% overhead and Ascii85's 25% overhead!
EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.
How to get near this 18% overhead?
The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.
Good coding lengths candidates:
- 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
- 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
- 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
- 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)
NB: 0,84 is the average "usefulness" of a space bit (20,32/24).
How to build our encoding algorithm?
We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).
The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).
Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.
That's the next challenge for this question. Who wants to to play? :)
A proposal of algorithm, I'll name BaseUTF-8 for XML
20 binary bits to encode: ABCDEFGHIJKLMNOPQRST
Resulting UTF-8 string named "encoded": 24 bits long
Mathematical encoding algorithm (not based on any known programming language):
If GH != 00 && NO != 00:
encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)
If ABCD != 0000:
If GH == 00 && NO == 00: # 16 bits to encode
encoded = 0010ABCD 01EFIJKL 01MPQRST
Else If GH == 00: # 18 bits to encode, 18 space bits with restrictions (1-byte UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
encoded = 0NOFIJKL 110ABCDE 10MPQRST
Else If NO == 00: # 18 bits to encode
encoded = 110ABCDE 10MPQRST 0GHFIJKL
If ABCD == 0000: # 16 bits to encode
encoded = 0011EFGH 01IJKLMN 01OPQRST
On "encoded" variable apply:
convert < (0x3C) to Line Feed (0x0A)
convert > (0x3E) to Cariage Return (0x0D)
convert & (0x26) to TAB (0x09)
And that's how you get a 20% overhead only.
This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).