python - How to retain column headers of data frame after Pre-processing in scikit-learn

Question

Welcome To Ask or Share your Answers For Others

python - How to retain column headers of data frame after Pre-processing in scikit-learn

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to retain column headers of data frame after Pre-processing in scikit-learn

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

How to overcome that?

EDIT1: Editing with sample data snapshot.

    Pclass  Sex Age SibSp   Parch   Fare    Embarked
0   3   0   22  1   0   7.2500  1
1   1   1   38  1   0   71.2833 2
2   3   1   26  0   0   7.9250  1
3   1   1   35  1   0   53.1000 1
4   3   0   35  0   0   8.0500  1
5   3   0   NaN 0   0   8.4583  3
6   1   0   54  0   0   51.8625 1
7   3   0   2   3   1   21.0750 1
8   3   1   27  0   2   11.1333 1
9   2   1   14  1   0   30.0708 2
10  3   1   4   1   1   16.7000 1
11  1   1   58  0   0   26.5500 1
12  3   0   20  0   0   8.0500  1
13  3   0   39  1   5   31.2750 1
14  3   1   14  0   0   7.8542  1
15  2   1   55  0   0   16.0000 1

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

from sklearn import preprocessing 
X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed

New data is of numpy array and hence the column names are stripped.

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,
          7.25      ,   1.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
         71.2833    ,   2.        ],
       [  3.        ,   1.        ,  26.        , ...,   0.        ,
          7.925     ,   1.        ],
       ..., 
       [  3.        ,   1.        ,  29.69911765, ...,   2.        ,
         23.45      ,   1.        ],
       [  1.        ,   0.        ,  26.        , ...,   0.        ,
         30.        ,   2.        ],
       [  3.        ,   0.        ,  32.        , ...,   0.        ,
          7.75      ,   3.        ]])

So I want to retain the column names when I do some data manipulation on my pandas data frame.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:44:36+0000

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed as the sklearn.preprocessing output and X_train as the original dataframe, you can put the column headers back on with:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)

Categories

python - How to retain column headers of data frame after Pre-processing in scikit-learn

python - How to retain column headers of data frame after Pre-processing in scikit-learn

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags