python - efficiently replace bad characters

Question

Welcome To Ask or Share your Answers For Others

python - efficiently replace bad characters

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - efficiently replace bad characters

I often work with utf-8 text containing characters like:

xc2x99

xc2x95

xc2x85

etc

These characters confuse other libraries I work with so need to be replaced.

What is an efficient way to do this, rather than:

text.replace('xc2x99', ' ').replace('xc2x85, '...')

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:44:37+0000

There is always regular expressions; just list all of the offending characters inside square brackets like so:

import re
print re.sub(r'[xc2x99]'," ","Helloxc2Therex99")

This prints: 'Hello There ', with the unwanted characters replaced by spaces.

Alternately, if you have a different replacement character for each:

# remove annoying characters
chars = {
    'xc2x82' : ',',        # High code comma
    'xc2x84' : ',,',       # High code double comma
    'xc2x85' : '...',      # Tripple dot
    'xc2x88' : '^',        # High carat
    'xc2x91' : 'x27',     # Forward single quote
    'xc2x92' : 'x27',     # Reverse single quote
    'xc2x93' : 'x22',     # Forward double quote
    'xc2x94' : 'x22',     # Reverse double quote
    'xc2x95' : ' ',
    'xc2x96' : '-',        # High hyphen
    'xc2x97' : '--',       # Double hyphen
    'xc2x99' : ' ',
    'xc2xa0' : ' ',
    'xc2xa6' : '|',        # Split vertical bar
    'xc2xab' : '<<',       # Double less than
    'xc2xbb' : '>>',       # Double greater than
    'xc2xbc' : '1/4',      # one quarter
    'xc2xbd' : '1/2',      # one half
    'xc2xbe' : '3/4',      # three quarters
    'xcaxbf' : 'x27',     # c-single quote
    'xccxa8' : '',         # modifier - under curve
    'xccxb1' : ''          # modifier - under line
}
def replace_chars(match):
    char = match.group(0)
    return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

Categories

python - efficiently replace bad characters

python - efficiently replace bad characters

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags