So first question I've ever asked on here, and its about emojis. I'm sorry.
I am making a twitter bot in python with the help of Tweepy, and regex ( also tried python-pcre ) that will analyse a tweet of a given user, and record the number of times a word or emoji was used. I can do most of this just fine. My problems start with the emojis.
I was under the impression that when using X, (in both regex and python-pcre) will find the eXtended grapheme clusters. Not just the individual ones. I read in another post What does the expression X match when inside a RegEx? that X follows a set of guidelines to determine if the next char should be clustered, but will always return at least 1.
I tried the first and second solution over at this post: How to extract all the emojis from text?.
The first one acted as expected. Grabs individual code-points and adds them to a list. Perfect for single code-point emojis, but I need to capture emojis with multiple code-points, and single code-point emojis.
The second solution one is where I am having problems. According to the post this function should print the emojis in a string, in clusters, separated by spaces.
def split_count(self, text):
emoji_list = []
data = regex.findall(r'X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_list.append(word)
return emoji_list
When called like:
counter = self.split_count(tweet)
print(' '.join(emoji for emoji in counter))
Should result in:
?? ???????????
However when I run it I get:
?? ?? ?? ?? ??
100% not clustered.
Why is this happening? This has been bugging me for a couple days now.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…