Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.7k views
in Technique[技术] by (71.8m points)

python 3.x - How to create corpus from pandas data frame to operate with NLTK

Here is my problem:

  1. I have a csv file containing articles data set with columns: ID, CATEGORY, TITLE, BODY.

  2. In python, I read the file to a pandas data frame like this:

    import pandas as pd
    df = pd.read_csv('my_file.csv')
    
  3. Now I need to transform somehow this df to get a corpus object, let's call it my_corpus. But how exactly I can do it? I assume I need to use:

    from nltk.corpus.reader import CategorizedCorpusReader
    my_corpus = some_nltk_function(df) # <- what is the function?
    
  4. At the end I can use NLTK methods to analyze the corpus. For example:

    import nltk
    my_corpus.fileids() # <- I expect values from column ID
    my_corpus.categories() # <- I expect values from column CATEGORY
    my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
    my_corpus.sents(categories=['cat_A', 'cat_B', 'cat_C']) # <- I expect values from column TITLE and BODY
    

Please, advise.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I guess you need to do 2 things.

First you need to convert each row of your dataframe df to corpus files. The following function should do it for you

def CreateCorpusFromDataFrame(corpusfolder,df):
    for index, r in df.iterrows():
        id=r['ID']
        title=r['TITLE']
        body=r['BODY']
        category=r['CATEGORY']
        fname=str(category)+'_'+str(id)+'.txt'
        corpusfile=open(corpusfolder+'/'+fname,'a')
        corpusfile.write(str(body) +" " +str(title))
        corpusfile.close()

CreateCorpusFromDataFrame('yourcorpusfolder/',df)

Second, you need to read the files from yourcorpusfolder and then do the NLTK processing required by you

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
my_corpus=CategorizedPlaintextCorpusReader('yourcorpusfolder/',
r'.*', cat_pattern=r'(.*)_.*') 
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B']) # <- I expect values from column TITLE and BODY

Some helpful references :


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...