The regex is outdated. It appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0). The other approach is just a very inefficient method of force-encoding to ASCII, which is rarely what you want when just removing Emoji (and can be much more easily and efficiently achieved with text.encode('ascii', 'ignore').decode('ascii')
).
If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.
Demo using your sample inputs:
>>> print(remove_emoji(u'??????? ????? ??????? ????'))
??????? ????? ???????
>>> print(remove_emoji(u'??Test????? ??????? A.P&T.S. ????????'))
Test????? ??????? A.P&T.S.
Note that the regex works on Unicode text, for Python 2 make sure you have decoded from str
to unicode
, for Python 3, from bytes
to str
first.
Emoji are complex beasts these days. The above will remove complete, valid Emoji. If you have 'incomplete' Emoji components such as skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER that have specific meaning in other contexts too, and you can't just regex those out, not unless you don't mind breaking text using Devanagari, or Kannada or CJK ideographs, to name just a few examples.
That said, the following Unicode 11.0 codepoints are probably safe to remove (based on filtering the Emoji_Component
Emoji-data designation):
20E3 ; (?) combining enclosing keycap
FE0F ; () VARIATION SELECTOR-16
1F1E6..1F1FF ; (??..??) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; (??..??) light skin tone..dark skin tone
1F9B0..1F9B3 ; (??..??) red-haired..white-haired
E0020..E007F ; (??..??) tag space..cancel tag
which can be removed by creating a new regex to match those:
import re
try:
uchr = unichr # Python 2
import sys
if sys.maxunicode == 0xffff:
# narrow build, define alternative unichr encoding to surrogate pairs
# as unichr(sys.maxunicode + 1) fails.
def uchr(codepoint):
return (
unichr(codepoint) if codepoint <= sys.maxunicode else
unichr(codepoint - 0x010000 >> 10 | 0xD800) +
unichr(codepoint & 0x3FF | 0xDC00)
)
except NameError:
uchr = chr # Python 3
# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
(0x20E3, 0xFE0F), # combining enclosing keycap, VARIATION SELECTOR-16
range(0x1F1E6, 0x1F1FF + 1), # regional indicator symbol letter a..regional indicator symbol letter z
range(0x1F3FB, 0x1F3FF + 1), # light skin tone..dark skin tone
range(0x1F9B0, 0x1F9B3 + 1), # red-haired..white-haired
range(0xE0020, 0xE007F + 1), # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
flags=re.UNICODE)
then update the above remove_emoji()
function to use it:
def remove_emoji(text, remove_components=False):
cleaned = emoji.get_emoji_regexp().sub(u'', text)
if remove_components:
cleaned = emoji_components.sub(u'', cleaned)
return cleaned