Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
148 views
in Technique[技术] by (71.8m points)

python - quickly drop dataframe columns with only one distinct value

Is there a faster way to drop columns that only contain one distinct value than the code below?

cols=df.columns.tolist()
for col in cols:
    if len(set(df[col].tolist()))<2:
        df=df.drop(col, axis=1)

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

res = df
for col in df.columns:
    if len(df[col].unique()) == 1:
        res = res.drop(col,axis=1)

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])

In [155]: for col in df.columns:
   .....:     if len(df[col].unique()) == 1:
   .....:         df.drop(col,inplace=True,axis=1)
   .....:

In [156]: df
Out[156]:
   1
0  2
1  3
2  2

Timing results -

In [166]: %paste
def func1(df):
        res = df
        for col in df.columns:
                if len(df[col].unique()) == 1:
                        res = res.drop(col,axis=1)
        return res

## -- End pasted text --

In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})

In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop

In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop

In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using unique and looping through the columns.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...