python - Two similar codes providing different answers

Question

Welcome To Ask or Share your Answers For Others

python - Two similar codes providing different answers

posted Jan 29, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Two similar codes providing different answers

I have written two codes to explore test messages and create models to predict if a message is a spam or not.

In both the SVC models, I have used Tfidf Vectorizer and have set max_df = 5 and have added a new columns, which is the length of the document.

This is the first piece of code and this returns the ROC AUC score as 0.85

def spam_or_not():
    v = TfidfVectorizer(max_df=5).fit(X_train)
    l_train = [len(x) for x in X_train]
    l_test = [len(x) for x in X_test]
    x_train_text = v.transform(X_train) 
    x_train = add_feature(x_train_text,l_train) #add_features returns sparse feature matrix with added feature.
    x_test_text = v.transform(X_test)
    x_test = add_feature(x_test_text , l_test)
    clf = SVC(C=10000)
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_test)
    return roc_auc_score(y_test,y_predict)

This is the second code and this gives a score of 0.95.

def spam_or_not():
    length_X_train = list(map(len,X_train))
    length_X_test = list(map(len,X_test))
    vect = TfidfVectorizer(min_df=5).fit(X_train)
    X_train_vectorized = vect.transform(X_train)
    X_test_vectorized = vect.transform(X_test)
    x_test_text = vect.transform(X_test)
    x_train = add_feature(X_train_vectorized,length_X_train)
    x_test = add_feature(X_test_vectorized , length_X_test)
    clf = SVC(C=10000)
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_test)
    score = roc_auc_score(y_test,y_predict)
    return roc_auc_score(y_test, y_predict)

Both of these codes look the same to me but still give really different results. If there's someone who can show me the difference between both of these it'd be really helpful.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-01-29T04:28:25+0000

I can spot at least one difference:

v = TfidfVectorizer(max_df=5).fit(X_train)

versus

vect = TfidfVectorizer(min_df=5).fit(X_train)

The first one sets the maximum document frequency to 5, which means ignoring the frequent terms.
The second sets the minimum to 5, which means ignoring the rare terms.

That's going to cause a massive difference in the TFIDF weights, hence very different results.

Categories

python - Two similar codes providing different answers

python - Two similar codes providing different answers

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags