It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
from here you can pipe it with a classifier e.g.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
Use it as so:,y_train)
this will apply the preprocessor and then pass transfomed data to the predictor.
If we want to select data types on fly, we can modify our preprocessor to use column selector by data dtypes:
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer(
('num', numeric_transformer, selector(dtype_include="numeric")),
('cat', categorical_transformer, selector(dtype_include="category"))])
Using GridSearch
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
'Classifier__solver': ['lbfgs', 'sag'],
grid_search = GridSearchCV(clf, param_grid, cv=10),y_train)