Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
190 views
in Technique[技术] by (71.8m points)

python - Why does my code return random letter tokens, instead of word tokens?

I'm an absolute beginner with Python, and I am very stuck at this part. I tried creating a function to preprocess my texts/data for topic modeling, and it works perfectly when I ran it as an individual code, but when it does not return anything when I ran it as a function. I would appreciate any help!

  • The codes I'm using are very basic, and probably inefficient, but it's for my basic class, so really basic ways is the way to go for me!

codes:

def clean (data):
    data_prep = []
    for data in data:
        tokenized_words = nltk.word_tokenize (data)
        text_words = [token.lower() for token in tokenized_words if token.isalnum()]
        text_words = [word for word in text_words if word not in stop_words]
        text_joined = " ".join(textwords)
        data_prep.append(text_joined)
        
    return data_prep

the outputs are really random like "j", ",", "i". I was using a .txt file as my data, converted from a .csv file.

edit:

I've adjusted my codes from pointed mistakes and it is now

def clean (data):
    data_prep = []
    for row in data:
        tokenized_words = nltk.word_tokenize (data)
        text_words = [token.lower() for token in tokenized_words if token.isalnum()]
        text_words = [word for word in text_words if word not in stop_words]
        text_joined = " ".join(text_words)
        data_prep.append(text_joined)
    return data_prep

results: it now returns tokenized sentences and seemingly on loop.

what is my mistake this time?

see image


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I don't have enough reputation to comment, so I will instead post this as an answer. It seems you are unnecessarily looping through all of your data twice, once in your outside for loop (for row in data) and then again in your list comprehensions ([token.lower() for token in tokenized_words if token.isalnum()]) since you are tokenizing all of the data (nltk.word_tokenize(data)), not just the current row. That is, your code should stop returning the same sentence multiple times if you get rid of your outermost for loop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...