Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
253 views
in Technique[技术] by (71.8m points)

tcl - Reading double byte files

I was wondering if there was a simple way in Tcl to read a double byte file (or so I think it is called). My problem is that I get files that look fine when opened in notepad (I'm on Win7) but when I read them in Tcl, there are spaces (or rather, null characters) between each and every character.

My current workaround has been to first run a string map to remove all the null

string map { {}} $file

and then process the information normally, but is there a simpler way to do this, through fconfigure, encoding or another way?

I'm not familiar with encodings so I'm not sure what arguments I should use.

fconfigure $input -encoding double

of course fails because double is not a valid encoding. Same with 'doublebyte'.

I'm actually working on big text files (above 2 GB) and doing my 'workaround' on a line by line basis, so I believe that this slows the process down.


EDIT: As pointed out by @mhawke, the file is UTF-16-LE encoded and this apparently is not a supported encoding. Is there an elegant way to circumvent this shortcoming, maybe through a proc? Or would this make things more complex than using string map?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The input files are probably UTF-16 encoded as is common in Windows.

Try:

% fconfigure $input -encoding unicode

You can get a list of encodings using:

% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...