It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
from here you can pipe it with a classifier e.g.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
Use it as so:
clf.fit(X_train,y_train)
this will apply the preprocessor and then pass transfomed data to the predictor.
Updates:
If we want to select data types on fly, we can modify our preprocessor to use column selector by data dtypes:
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_include="numeric")),
('cat', categorical_transformer, selector(dtype_include="category"))])
Using GridSearch
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
'Classifier__solver': ['lbfgs', 'sag'],
}
grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train,y_train)