Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:
#!/usr/bin/env python
# coding=utf-8
import sys,os
print sys.getfilesystemencoding()
p = u'/temp/s/'
s = u'???'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
print 'dir', [ord(c) for c in d], d
It outputs the following:
utf-8
s [229, 228, 246] ???
s2 [195, 165, 195, 164, 195, 182] ???
dir [97, 778, 97, 776, 111, 776] ???
So, file system encoding is utf-8, but when I encode my filename ??? using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string ??? to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.
If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ?, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?
Or have I messed up?
See Question&Answers more detail:
os