Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
238 views
in Technique[技术] by (71.8m points)

python - efficiently replace bad characters

I often work with utf-8 text containing characters like:

xc2x99

xc2x95

xc2x85

etc

These characters confuse other libraries I work with so need to be replaced.

What is an efficient way to do this, rather than:

text.replace('xc2x99', ' ').replace('xc2x85, '...')
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There is always regular expressions; just list all of the offending characters inside square brackets like so:

import re
print re.sub(r'[xc2x99]'," ","Helloxc2Therex99")

This prints: 'Hello There ', with the unwanted characters replaced by spaces.

Alternately, if you have a different replacement character for each:

# remove annoying characters
chars = {
    'xc2x82' : ',',        # High code comma
    'xc2x84' : ',,',       # High code double comma
    'xc2x85' : '...',      # Tripple dot
    'xc2x88' : '^',        # High carat
    'xc2x91' : 'x27',     # Forward single quote
    'xc2x92' : 'x27',     # Reverse single quote
    'xc2x93' : 'x22',     # Forward double quote
    'xc2x94' : 'x22',     # Reverse double quote
    'xc2x95' : ' ',
    'xc2x96' : '-',        # High hyphen
    'xc2x97' : '--',       # Double hyphen
    'xc2x99' : ' ',
    'xc2xa0' : ' ',
    'xc2xa6' : '|',        # Split vertical bar
    'xc2xab' : '<<',       # Double less than
    'xc2xbb' : '>>',       # Double greater than
    'xc2xbc' : '1/4',      # one quarter
    'xc2xbd' : '1/2',      # one half
    'xc2xbe' : '3/4',      # three quarters
    'xcaxbf' : 'x27',     # c-single quote
    'xccxa8' : '',         # modifier - under curve
    'xccxb1' : ''          # modifier - under line
}
def replace_chars(match):
    char = match.group(0)
    return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...