Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
550 views
in Technique[技术] by (71.8m points)

python - In what world would \u00c3\u00a9 become é?

I have a likely improperly encoded json document from a source I do not control, which contains the following strings:

du00c3u00a9cor

businessu00e2u20acu2122 active accounts 

the u00e2u20acu0153Made in the USAu00e2u20acu009d label

From this, I am gathering they intend for u00c3u00a9 to beceom é, which would be utf-8 hex C3 A9. That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks.

My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine writing some code to transform their broken input into something I can understand, as it is highly unlikely they would be able to fix the system if I brought it to their attention.

Any ideas how to force their input to something I can understand? For the record, I am working in Python.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You should try the ftfy module:

>>> print ftfy.ftfy(u"du00c3u00a9cor")
décor
>>> print ftfy.ftfy(u"businessu00e2u20acu2122 active accounts")
business' active accounts
>>> print ftfy.ftfy(u"the u00e2u20acu0153Made in the USAu00e2u20acu009d label")
the "Made in the USA" label
>>> print ftfy.ftfy(u"the u00e2u20acu0153Made in the USAu00e2u20acu009d label", uncurl_quotes=False)
the “Made in the USA” label

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...