Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
409 views
in Technique[技术] by (71.8m points)

python - Why use pandas qcut return ValueError: Bin edges must be unique?

I have dataset :

recency;frequency;monetary
21;156;41879955
13;88;16850284
8;74;79150488
2;74;26733719
9;55;16162365
...;...;...

detail raw data -> http://pastebin.com/beiEeS80 and i put into DataFrame and here is my complete code :

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1

but it's return error

df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
ValueError: Bin edges must be unique: array([   1.,    1.,    2.,    4.,    9.,  156.])

How to solve this ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.

Please note that the first line:

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])

loads the colums 'userid' when it isn't defined in the data file. I removed this column name.

Solution

import pandas as pd

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).argmax()
    return series.rank(pct=1).apply(f)

datas = pd.read_csv('./exampledata.txt', delimiter=';')

df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])

df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)

Explanation

The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, 'frequency' has more than 28% number 1's. This broke qcut.

I provided a new function pct_rank_qcut that addresses this and pushes all 1's into the first bin.

    edges = pd.Series([float(i) / n for i in range(n + 1)])

This line defines a series of percentile edges based on the desired number of bins defined by n. In the case of n = 5 the edges will be [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

    f = lambda x: (edges >= x).argmax()

this line defines a helper function to be applied to another series in the next line. edges >= x will return a series equal in length to edges where each element is True or False depending on whether x is less than or equal to that edge. In the case of x = 0.14 the resulting (edges >= x) will be [False, True, True, True, True, True]. By the taking the argmax() I've identified the first index where the series is True, in this case 1.

    return series.rank(pct=1).apply(f)

This line takes the input series and turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the apply(f). What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:

pd.qcut(df['recency'].values, 5).codes + 1

This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.

Demonstration

print df.head()

   recency  frequency  monetary
0        3          5         5
1        2          5         5
2        2          5         5
3        1          5         5
4        2          5         5

Update

pd.Series.argmax() is now deprecated. Simply switch to pd.Series.values.argmax()() to update!

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).values.argmax()
    return series.rank(pct=1).apply(f)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...