Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
216 views
in Technique[技术] by (71.8m points)

python - Multi-Index Sorting in Pandas

I have a multi-index DataFrame created via a groupby operation. I'm trying to do a compound sort using several levels of the index, but I can't seem to find a sort function that does what I need.

Initial dataset looks something like this (daily sales counts of various products):

         Date Manufacturer Product Name Product Launch Date  Sales
0  2013-01-01        Apple         iPod          2001-10-23     12
1  2013-01-01        Apple         iPad          2010-04-03     13
2  2013-01-01      Samsung       Galaxy          2009-04-27     14
3  2013-01-01      Samsung   Galaxy Tab          2010-09-02     15
4  2013-01-02        Apple         iPod          2001-10-23     22
5  2013-01-02        Apple         iPad          2010-04-03     17
6  2013-01-02      Samsung       Galaxy          2009-04-27     10
7  2013-01-02      Samsung   Galaxy Tab          2010-09-02      7

I use groupby to get a sum over the date range:

> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

So far so good!

Now the last thing I want to do is sort each manufacturer's products by launch date, but keep them grouped hierarchically under Manufacturer - here's all I am trying to do:

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

When I try sortlevel() I lose the nice per-company hierarchy I had before:

> grouped.sortlevel('Product Launch Date')
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
Apple        iPad         2010-04-03              30
Samsung      Galaxy Tab   2010-09-02              22

sort() and sort_index() just fail:

grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

Seems like a simple operation, but I can't quite figure it out.

I'm not tied to using a MultiIndex for this, but since that's what groupby() returns, that's what I've been working with.

BTW the code to produce the initial DataFrame is:

data = {
  'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
  'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
  'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'], 
  'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
  'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A hack would be to change the order of the levels:

In [11]: g
Out[11]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

In [12]: g.index = g.index.swaplevel(1, 2)

Sortlevel, which (as you've found) sorts the MultiIndex levels in order:

In [13]: g = g.sortlevel()

And swap back:

In [14]: g.index = g.index.swaplevel(1, 2)

In [15]: g
Out[15]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

I'm of the opinion that sortlevel should not sort the remaining labels in order, so will create a github issue. :) Although it's worth mentioning the docnote about "the need for sortedness".

Note: you could avoid the first swaplevel by reordering the order of the initial groupby:

g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...