I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical combination). Here is the example data, where I would like to look at all pairs of genes in [mygenes]:
import pandas
import itertools
mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']
df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
'case1' : [0,1,1,0,0],
'case2' : [1,1,1,0,1],
'control1':[0,0,1,1,1],
'control2':[1,0,0,1,0] })
>>> df
Gene case1 case2 control1 control2
0 ABC1 0 1 0 1
1 ABC2 1 1 0 0
2 ABC3 1 1 1 0
3 ABC4 0 0 1 1
4 ABC5 0 1 1 0
The final product should look like this (applying np.sum by default is fine):
case1 case2 control1 control2
'ABC1', 'ABC2' 1 2 0 1
'ABC1', 'ABC3' 1 2 1 1
'ABC1', 'ABC4' 0 1 1 2
'ABC2', 'ABC3' 2 2 1 0
'ABC2', 'ABC4' 1 1 1 1
'ABC3', 'ABC4' 1 1 2 1
The set of gene pairs can be easily obtained with itertools ($itertools.combinations(mygenes, 2)
), but I can't figure out how to perform aggregation of specific rows based on their values. Can anyone advise? Thank you
See Question&Answers more detail:
os