Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.7k views
in Technique[技术] by (71.8m points)

regex - remove unicode emoji using re in python

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u'[u1F300-u1F5FFu1F600-u1F64Fu1F680-u1F6FFu2600-u26FFu2700-u27BF]+',re.UNICODE)
print myre.sub('', text)

but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?

here is an example output that all the characters were removed:

“   '   //./” ! # # # …
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are not using the correct notation for non-BMP unicode points; you want to use U0001FFFF, a capital U and 8 digits:

myre = re.compile(u'['
    u'U0001F300-U0001F5FF'
    u'U0001F600-U0001F64F'
    u'U0001F680-U0001F6FF'
    u'u2600-u26FFu2700-u27BF]+', 
    re.UNICODE)

This can be reduced to:

myre = re.compile(u'['
    u'U0001F300-U0001F64F'
    u'U0001F680-U0001F6FF'
    u'u2600-u26FFu2700-u27BF]+', 
    re.UNICODE)

as your first two ranges are adjacent.

Your version was specifying (with added spaces for readability):

[u1F30 0-u1F5F Fu1F60 0-u1F64 Fu1F68 0-u1F6F F u2600-u26FFu2700-u27BF]+

That's because the uxxxx escape sequence always takes only 4 hex digits, not 5.

The largest of those ranges is 0-u1F6F (so from the digit 0 through to ?), which covers a very large swathe of the Unicode standard.

The corrected expression works, provided you use a UCS-4 wide Python executable:

>>> import re
>>> myre = re.compile(u'['
...     u'U0001F300-U0001F64F'
...     u'U0001F680-U0001F6FF'
...     u'u2600-u26FFu2700-u27BF]+', 
...     re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: U0001f62a')
u'Some example text with a sleepy face: '

The UCS-2 equivalent is:

myre = re.compile(u'('
    u'ud83c[udf00-udfff]|'
    u'ud83d[udc00-ude4fude80-udeff]|'
    u'[u2600-u26FFu2700-u27BF])+', 
    re.UNICODE)

You can combine the two into your script with a exception handler:

try:
    # Wide UCS-4 build
    myre = re.compile(u'['
        u'U0001F300-U0001F64F'
        u'U0001F680-U0001F6FF'
        u'u2600-u26FFu2700-u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    myre = re.compile(u'('
        u'ud83c[udf00-udfff]|'
        u'ud83d[udc00-ude4fude80-udeff]|'
        u'[u2600-u26FFu2700-u27BF])+', 
        re.UNICODE)

Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...