Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
414 views
in Technique[技术] by (71.8m points)

machine learning - scikit-learn impute mean of feature within groups of nominal value in another feature

I want to impute the mean of a feature but only calculate the mean based off other examples that have the same category/nominal value in another column and I was wondering if this was possible using scikit-learn's Imputer class? It would just make it easier to add into a pipeline that way.

For example:

Using the Titanic dataset from kaggle: source

How would I go about imputing the mean fare per pclass. The thinking behind it being that people in different classes would have large differences in cost between tickets.

Update: After discussion with some people, the phrase I should have used was "imputing the mean within class".

I've looked into Vivek's comment below and will construct a generic pipeline function when I get time to do what I want :) I have a good idea of how to do it and will post as an answer when it's finished.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So below is a pretty simple approach to my question that was just meant to handle the means of things. A more robust implementation would probably involve utilising the Imputer class from scikit learn which would mean it could also do mode, median, etc. and would be better at dealing with sparse/dense matrices.

This is based on Vivek Kumar's comment on the original question which suggested splitting the data into stacks and imputing it that way then re-assembling them.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
        self.missing_values = missing_values
        self.replace_col_index = replace_col_index
        self.y = None
        self.class_col_index = class_col_index

    def fit(self, X, y = None):
        self.y = y
        return self

    def transform(self, X):
        y = self.y
        classes = np.unique(y)
        stacks = []

        if len(X) > 1 and len(self.y) = len(X):
            if( self.class_col_index == None ):
                # If we're using the dependent variable
                for aclass in classes:
                    with_missing = X[(y == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(y == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean

                    stacks.append(np.concatenate((with_missing, without_missing)))
            else:
                # If we're using nominal values within a binarised feature (i.e. the classes
                # are unique values within a nominal column - e.g. sex)
                for aclass in classes:
                    with_missing = X[(X[:, self.class_col_index] == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(X[:, self.class_col_index] == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean
                    stacks.append(np.concatenate((with_missing, without_missing)))

            if len(stacks) > 1 :
                # Reassemble our stacks of values
                X = np.concatenate(stacks)

        return X

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...