python - Unicode category for commas and quotation marks

Question

Welcome To Ask or Share your Answers For Others

python - Unicode category for commas and quotation marks

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python - Unicode category for commas and quotation marks

I have this helper function that gets rid of control characters in XML text:

def remove_control_characters(s): #Remove control characters in XML text
    t = ""
    for ch in s:
        if unicodedata.category(ch)[0] == "C":
            t += " "
        if ch == "," or ch == """:
            t += ""
        else:
            t += ch
    return "".join(ch for ch in t if unicodedata.category(ch)[0]!="C")

I would like to know whether there is a unicode category for excluding quotation marks and commas.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:29+0000

In Unicode, control characters general category is 'Cc', even if they have no name.unicodedata.category() returns the general category, as you can test for yourself in the python console :

>>>unicodedata.category(unicode('0')) 'Cc'

For commas and quotation marks, the categories are Pi and Pf. You only test the first character of the returned code in your example, so try instead :

 cat = unicodedata.category(ch)
 if cat == "Cc" or cat == "Pi" or cat == "Pf":

Categories

python - Unicode category for commas and quotation marks

python - Unicode category for commas and quotation marks

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags