groupby
is super lazy. Here's an illuminating demo. Let's group three a
-values and four b
-values, and print out what's happening:
>>> from itertools import groupby
>>> def letters():
for letter in 'a', 'a', 'a', 'b', 'b', 'b', 'b':
print('yielding', letter)
yield letter
### Going through the groups WITHOUT looking at their members
Let's roll:
>>> groups = groupby(letters())
>>>
Nothing got printed yet! So until now, groupby
did nothing. What a lazy bum. Let's ask it for the first group:
>>> next(groups)
yielding a
('a', <itertools._grouper object at 0x05A16050>)
So groupby
tells us that this is a group of a
-values, and we could go through that _grouper
object to get them all. But wait, why did "yielding a" get printed only once? Our generator is yielding three of them, isn't it? Well, that's because groupby
is lazy. It did read one value to identify the group, because it needs to tell us what the group is about, i.e., that it's a group of a
-values. And it offers us that _grouper
object for us to get all the group's members if we want to. But we didn't ask to go through the members, so the lazy bum didn't go any further. It simply didn't have a reason to. Let's ask for the next group:
>>> next(groups)
yielding a
yielding a
yielding b
('b', <itertools._grouper object at 0x05A00FD0>)
Wait, what? Why "yielding a" when we're now dealing with the second group, the group of b
-values? Well, because groupby
previously stopped after the first a
because that was enough to give us all we had asked for. But now, to tell us about the second group, it has to find the second group, and for this it asks our generator until it sees something other than a
. Note that "yielding b" is again only printed once, even though our generator yields four of them. Let's ask for the third group:
>>> next(groups)
yielding b
yielding b
yielding b
Traceback (most recent call last):
File "<pyshell#32>", line 1, in <module>
next(groups)
StopIteration
Ok so there is no third group and thus groupby
issues a StopIteration
so the consumer (e.g., a loop or list comprehension) would know to stop. But before that, the remaining "yielding b" get printed, because groupby
got off its lazy butt and walked over the remaining values in hopes to find a new group.
### Going through the groups WITH looking at their members
Let's try again, this time let's ask for the members:
>>> groups = groupby(letters())
>>> key, members = next(groups)
yielding a
>>> key
'a'
Again, groupby
asked our generator for just a single value, in order to identify the group so it can tell us that it's an a
-group. But this time, we'll also ask for the group members:
>>> list(members)
yielding a
yielding a
yielding b
['a', 'a', 'a']
Aha! There are the remaining "yielding a". Also, already the first "yielding b"! Even though we didn't even ask for the second group yet! But of course groupby
has to go this far because we asked for the group members, so it has to keep looking until it gets a non-member. Let's get the next group:
>>> key, members = next(groups)
>>>
Wait, what? Nothing got printed at all? Is groupby
sleeping? Wake up! Oh wait... that's right... it already found out that the next group is b
-values. Let's ask for all of them:
>>> list(members)
yielding b
yielding b
yielding b
['b', 'b', 'b', 'b']
Now the remaining three "yielding b" happen, because we asked for them so groupby
has to get them.
### Why doesn't it work to get the group members afterwards?
Let's try it your initial way with list(groupby(...))
:
>>> groups = list(groupby(letters()))
yielding a
yielding a
yielding a
yielding b
yielding b
yielding b
yielding b
>>> [list(members) for key, members in groups]
[[], ['b']]
Note that not only is the first group empty, but also, the second group only has one element (you didn't mention that). (Edit: That has changed by now, see the comments under the answer.)
Why?
Again: groupby
is super lazy. It offers you those _grouper
objects so you can go through each group's members. But if you don't ask to see the group members and instead just ask for the next group to be identified, then groupby
just shrugs and is like "Ok, you're the boss, I'll just go find the next group".
What your list(groupby(...))
does is it asks groupby
to identify all groups. So it does that. But if you then at the end ask for the members of each group, then groupby
is like "Dude... I'm sorry, I offered them to you but you didn't want them. And I'm lazy, so I don't keep things around for no good reason. I can give you the last member of the last group, because I still remember that one, but for everything before that... sorry, I just don't have them anymore, you should've told me that you wanted them".
P.S. In all of this, of course "lazy" really means "efficient". Not something bad but something good!