python - Why does my code return random letter tokens, instead of word tokens?

Question

Welcome To Ask or Share your Answers For Others

python - Why does my code return random letter tokens, instead of word tokens?

posted Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why does my code return random letter tokens, instead of word tokens?

I'm an absolute beginner with Python, and I am very stuck at this part. I tried creating a function to preprocess my texts/data for topic modeling, and it works perfectly when I ran it as an individual code, but when it does not return anything when I ran it as a function. I would appreciate any help!

The codes I'm using are very basic, and probably inefficient, but it's for my basic class, so really basic ways is the way to go for me!

codes:

def clean (data):
    data_prep = []
    for data in data:
        tokenized_words = nltk.word_tokenize (data)
        text_words = [token.lower() for token in tokenized_words if token.isalnum()]
        text_words = [word for word in text_words if word not in stop_words]
        text_joined = " ".join(textwords)
        data_prep.append(text_joined)
        
    return data_prep

the outputs are really random like "j", ",", "i". I was using a .txt file as my data, converted from a .csv file.

edit:

I've adjusted my codes from pointed mistakes and it is now

def clean (data):
    data_prep = []
    for row in data:
        tokenized_words = nltk.word_tokenize (data)
        text_words = [token.lower() for token in tokenized_words if token.isalnum()]
        text_words = [word for word in text_words if word not in stop_words]
        text_joined = " ".join(text_words)
        data_prep.append(text_joined)
    return data_prep

results: it now returns tokenized sentences and seemingly on loop.

what is my mistake this time?

see image

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-16T21:15:56+0000

I don't have enough reputation to comment, so I will instead post this as an answer. It seems you are unnecessarily looping through all of your data twice, once in your outside for loop (for row in data) and then again in your list comprehensions ([token.lower() for token in tokenized_words if token.isalnum()]) since you are tokenizing all of the data (nltk.word_tokenize(data)), not just the current row. That is, your code should stop returning the same sentence multiple times if you get rid of your outermost for loop.

Categories

python - Why does my code return random letter tokens, instead of word tokens?

python - Why does my code return random letter tokens, instead of word tokens?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags