Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
482 views
in Technique[技术] by (71.8m points)

python - Fixing invalid JSON octal escape

KISSmetrics generates invalid JSON strings I need to parse. I'm getting tons of errors like

ERROR 2013-03-04 04:31:12,253 Invalid escape: line 1 column 132 (char 132): {"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 19203032271080 hd","_t":1356390128}

ERROR 2013-03-04 04:34:19,153 Invalid escape: line 1 column 101 (char 101): {"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"331203330261330252331207 331201331212330257331212330244331211 330256331212331204330247330255331211 331203331210330261330257331211","_t":1356483052}

My code is:

for line in lines:
    try:
        data = self.clean_data(json.loads(line))
        except ValueError, e:
            logger.error('%s: %s' % (e.message, line))

Example raw data:

{"search engine":"Google","_n":"search engine hit","_p":"kvceh84hzbhywcnlivv+hdztizw=","search terms":"military sound effects programs","_t":1356034177}

Is there any chance to cleanup this messy JSON and parse it? Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your input data contains octal escapes; those would be invalid indeed. Replace them with decoded bytes using a regular expression:

import re

invalid_escape = re.compile(r'\[0-7]{1,3}')  # up to 3 digits for byte values up to FF

def replace_with_byte(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_byte, brokenjson)

This makes your input work:

>>> data1 = r"""{"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 19203032271080 hd","_t":1356390128}"""
>>> json.loads(repair(data1))
{u'_n': u'search engine hit', u'search terms': u'happy new year animation 1920xd71080 hd', u'_p': u'z392cpdpnm6silblq5mac8kiugq=', u'_t': 1356390128, u'search engine': u'Google'}
>>> print json.loads(repair(data1))['search terms']
happy new year animation 1920×1080 hd
>>> data2 = r"""{"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"331203330261330252331207 331201331212330257331212330244331211 330256331212331204330247330255331211 331203331210330261330257331211","_t":1356483052}"""
>>> json.loads(repair(data2)){u'_n': u'ad campaign hit', u'search terms': u'u0643u0631u062au0647 u0641u064au062fu064au0624u0649 u062eu064au0644u0627u062du0649 u0643u0648u0631u062fu0649', u'_p': u'byskpczsw6sorbmzqi0tk1uimgw=', u'_t': 1356483052, u'search engine': u'Google'}
>>> print json.loads(repair(data2))['search terms']
???? ?????? ?????? ?????

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...