I'm just in the process of reading some data from a file as a stream of bytes, and I've just encountered some unicode strings that I'm not sure how best to handle.
Each character is using two bytes, with only the first seeming to contain actual data, so for example the string 'trust' is stored in the file as:
0x74 0x00(t) 0x72 0x00(r) ...and so on
Normally I'd just use a regex to replace the zeros with nothing and therefore remove the whitespace. However, the spaces between words within the file are implemented using 0x00 0x00
, so trying to do a simple String 'replaceAll' is kind of messing it up a little.
I've tried playing around with the String encoding sets, such as 'ISO-8859-1' and 'UTF-8/16', but everytime I end up with white space.
I did create a simple regex to remove the double zero hex values, which is:
new String(bytes).replaceAll("[\00]{2,},"");
But this obviously only works for the double zero, and I'd really like to replace single zeros with nothing, and double zeros with a an actual ASCII/Unicode space character.
I could have sworn that one of the Java string format settings dealt with this kind of thing, but I might be wrong. So should I work on creating a regex to strip out the zeros, or does Java actually provide the mechanisms for doing it?
Thanks
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…