I have a data set that looks something like this:
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
Where id1
and id2
are unique nodes in a graph, and size
is a value assigned to the directed edge connecting them from id1
to id2
. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1
and id2
. For example, in the first row, we have id1=5400
and id2=5505
. There exists another row in the data frame where id1=5505
and id2=5400
. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using apply
with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr
or with something in the base packages that would be more efficient?
Question&Answers:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…