I see there are two ways you might want to do this.
For the entire DataFrame
This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
Column-by-column
This method removes the entries that occur infrequently in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
value_counts = df[col].value_counts() # Specific column
to_remove = value_counts[value_counts <= threshold].index
df[col].replace(to_remove, np.nan, inplace=True)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…