Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
189 views
in Technique[技术] by (71.8m points)

python - Using a CSR Matrix in a multivariate random forest classification model

I am trying to utilize a CSR matrix as a variable to enhance my model. This matrix is derived from analyzing tf-idf metrics from string values in a pandas dataframe.

The series that the CSR matrix is derived from has 7325 records. After the CSR Matrix is generated it has a shape of (7325, 4927). I am not clear on the matrix format or what that 4927 represents.

But basically I am trying to use the matrix as 1 variable in a multivariate random forest classification model. I have tried converting the matrix to a dataframe, and then adding the martix dataframe and 2 other series' to create a new dataframe representing all my variable to plug into the model.


pd.DataFrame(pd.DataFrame(matrix), df['var1'], df['var2'])

but my dataframe is crazy. The matrix data isn't in the table. Furthermore Var 2 becomes the x-axis and var 1 is the y-axis. This does not happen if I just join the var 1 and var 2 series in a separate dataframe.

[![enter image description here][1]][1]

I can convert the matrix to a dataframe with a shape of (7325,1) just fine by

pd.DataFrame(matrix)

The shape of each of the other series' are (7325,). I don't know if this has something to do with it.

I generate the matrix via a tf-idf analysis of a string variable of parcel owner names. It involves tokenizing the string varibale and assigning values to every element in the string. I am able to pass the CSR matrix directly to sklearn RandomForestClassifier model and it works fine. I am now trying to add variables to the model:


from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


stemmer =PorterStemmer()
df['String_variable']=df['String_variable'].apply(lambda x: [stemmer.stem(y) for y in x])

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['String_variable'])

transformer = TfidfTransformer().fit(counts)

matrix=transformer.transform(counts)


  [1]: https://i.stack.imgur.com/C5eDS.png
question from:https://stackoverflow.com/questions/65942508/using-a-csr-matrix-in-a-multivariate-random-forest-classification-model

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Per hpaulj's comments I converted the matrix to a pandas dataframe via .todense().

x=pd.DataFrame(matrix.todense())

x['Var1']=df['Var1']
x['Var2']=df['Var2']

I was then able to plug it directly into the model trainer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...