Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
379 views
in Technique[技术] by (71.8m points)

python - Two similar codes providing different answers

I have written two codes to explore test messages and create models to predict if a message is a spam or not.

In both the SVC models, I have used Tfidf Vectorizer and have set max_df = 5 and have added a new columns, which is the length of the document.

This is the first piece of code and this returns the ROC AUC score as 0.85

def spam_or_not():
    v = TfidfVectorizer(max_df=5).fit(X_train)
    l_train = [len(x) for x in X_train]
    l_test = [len(x) for x in X_test]
    x_train_text = v.transform(X_train) 
    x_train = add_feature(x_train_text,l_train) #add_features returns sparse feature matrix with added feature.
    x_test_text = v.transform(X_test)
    x_test = add_feature(x_test_text , l_test)
    clf = SVC(C=10000)
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_test)
    return roc_auc_score(y_test,y_predict)

This is the second code and this gives a score of 0.95.

def spam_or_not():
    length_X_train = list(map(len,X_train))
    length_X_test = list(map(len,X_test))
    vect = TfidfVectorizer(min_df=5).fit(X_train)
    X_train_vectorized = vect.transform(X_train)
    X_test_vectorized = vect.transform(X_test)
    x_test_text = vect.transform(X_test)
    x_train = add_feature(X_train_vectorized,length_X_train)
    x_test = add_feature(X_test_vectorized , length_X_test)
    clf = SVC(C=10000)
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_test)
    score = roc_auc_score(y_test,y_predict)
    return roc_auc_score(y_test, y_predict) 

Both of these codes look the same to me but still give really different results. If there's someone who can show me the difference between both of these it'd be really helpful.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I can spot at least one difference:

v = TfidfVectorizer(max_df=5).fit(X_train)

versus

vect = TfidfVectorizer(min_df=5).fit(X_train)
  • The first one sets the maximum document frequency to 5, which means ignoring the frequent terms.
  • The second sets the minimum to 5, which means ignoring the rare terms.

That's going to cause a massive difference in the TFIDF weights, hence very different results.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...