I have written two codes to explore test messages and create models to predict if a message is a spam or not.
In both the SVC models, I have used Tfidf Vectorizer and have set max_df = 5 and have added a new columns, which is the length of the document.
This is the first piece of code and this returns the ROC AUC score as 0.85
def spam_or_not():
v = TfidfVectorizer(max_df=5).fit(X_train)
l_train = [len(x) for x in X_train]
l_test = [len(x) for x in X_test]
x_train_text = v.transform(X_train)
x_train = add_feature(x_train_text,l_train) #add_features returns sparse feature matrix with added feature.
x_test_text = v.transform(X_test)
x_test = add_feature(x_test_text , l_test)
clf = SVC(C=10000)
clf.fit(x_train,y_train)
y_predict = clf.predict(x_test)
return roc_auc_score(y_test,y_predict)
This is the second code and this gives a score of 0.95.
def spam_or_not():
length_X_train = list(map(len,X_train))
length_X_test = list(map(len,X_test))
vect = TfidfVectorizer(min_df=5).fit(X_train)
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)
x_test_text = vect.transform(X_test)
x_train = add_feature(X_train_vectorized,length_X_train)
x_test = add_feature(X_test_vectorized , length_X_test)
clf = SVC(C=10000)
clf.fit(x_train,y_train)
y_predict = clf.predict(x_test)
score = roc_auc_score(y_test,y_predict)
return roc_auc_score(y_test, y_predict)
Both of these codes look the same to me but still give really different results. If there's someone who can show me the difference between both of these it'd be really helpful.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…