I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:
xe6x89x8exe5x8axa0xe6x8bx89
But, no matter how I import it into Python (3 or 2), I get this string, at best:
\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89
I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:
stringName.encode("utf-8").decode("unicode_escape")
But then it messes up the original encoding, and gives this as the string:
'?x89x8e?x8axa0?x8bx89' (printing this string results in: ?? ? )
Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:
b'xe6x89x8exe5x8axa0xe6x8bx89'.encode("utf-8")
Results in: '扎加拉'
But, I can't do this programmatically. I can't even get rid of the double slashes.
To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:
with open('source.txt','r',encoding='utf-8') as f_open:
source = f_open.read()
Okay, so I clicked the answer below (I think), but here is what works:
from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')
I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!
To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:
{'m_cacheHandles': ['s2max00x00CNx1fx1b"x8dxdbx1fr \xbfxd4Dx05Rx87x10x0bx0f9x95x9bxe8x16Tx81bxe4x08x1exa8Ux11',
's2max00x00CNx1axd9Lx12nxb9x8aLx1dxe7xb8xe6xf8xaaxa1Sxdbxa5+xd3x82^x0cx89xdbxc5x82x8dxb7x0fv',
's2max00x00CNx92xd8x17Dxc1Dx1bxf6(xedjxb7xe9xd1x94x85xc8`x91Mx8btZx91xf65x1fxf9xdcxd4xe6xbb',
's2max00x00CNxa1xe9xabxcd?xd2PSxc9x03xabx13Rxa6x85u7(K2x9dx08xb8k+xe2xdeIxc3xabx7fC',
's2max00x00CNNxa5xe7xafxa0x84xe5xbcxe9HXxb93S*sjxe3xf8xe7x84`xf1Yex15~xb93x1fxc90',
's2max00x00CN8xc6x13Fx19x1fx97AHxfax81mxacxc9xa6xa8x90sxfddx06
L]zxbbx15xdcIx93xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92, 'm_r': 36},
'm_control': 2,
'm_handicap': 0,
'm_hero': 'xe6x89x8exe5x8axa0xe6x8bx89',
All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.
See Question&Answers more detail:
os