I want to use sklearn.compose.ColumnTransformer
consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:
log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
transformers=[
('num', impute.SimpleImputer() , ['a', 'b']),
('log', log_transformer, ['b', 'c']),
('scale', p.StandardScaler(), ['a', 'b', 'c'])
]).fit_transform(df)
So, I want to use SimpleImputer
for 'a'
, 'b'
, then log
for 'b'
, 'c'
, and then StandardScaler
for 'a'
, 'b'
, 'c'
.
But:
- I get array of
(4, 7)
shape.
- I still get
Nan
in a
and b
columns.
So, how can I use ColumnTransformer
for different columns in the manner of Pipeline
?
UPD:
pipe_1 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])
pipe_2 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])
pipe_3 = pipeline.Pipeline(steps=[
('scl', p.StandardScaler()),
])
# in the real situation I don't know exactly what cols these arrays contain, so they are not static:
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']
proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
('1', pipe_1, cols_1),
('2', pipe_2, cols_2),
('3', pipe_3, cols_3),
])
proc.fit_transform(df).T
Output:
array([[ 1. , 2. , 42. , 4. ],
[ 1. , 24. , 3. , 4. ],
[-1.06904497, -0.26726124, nan, 1.33630621],
[-1.33630621, nan, 0.26726124, 1.06904497],
[-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]])
I understood why I have cols duplicates, nans
and not scaled values, but how can I fix this in the correct way when cols are not static?
UPD2:
A problem may arise when the columns change their order. So, I want to use FunctionTransformer
for columns selection:
def select_col(X, cols=None):
return X[cols]
ct1 = compose.make_column_transformer(
(p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
remainder='passthrough'
)
ct1.fit(df)
But get this output:
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed
How can I fix it?
See Question&Answers more detail:
os