Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
613 views
in Technique[技术] by (71.8m points)

python - A recipe to group/aggregate data?

I have some data stored in a list that I would like to group based on a value.

For example, if my data is

data = [(1, 'a'), (2, 'x'), (1, 'b')]

and I want to group it by the first value in each tuple to get

result = [(1, 'ab'), (2, 'x')]

how would I go about it?

More generally, what's the recommended way to group data in python? Is there a recipe that can help me?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The go-to data structure to use for all kinds of grouping is the dict. The idea is to use something that uniquely identifies a group as the dict's keys, and store all values that belong to the same group under the same key.

As an example, your data could be stored in a dict like this:

{1: ['a', 'b'],
 2: ['x']}

The integer that you're using to group the values is used as the dict key, and the values are aggregated in a list.

The reason why we're using a dict is because it can map keys to values in constant O(1) time. This makes the grouping process very efficient and also very easy. The general structure of the code will always be the same for all kinds of grouping tasks: You iterate over your data and gradually fill a dict with grouped values. Using a defaultdict instead of a regular dict makes the whole process even easier, because we don't have to worry about initializing the dict with empty lists.

import collections

groupdict = collections.defaultdict(list)
for value in data:
    group = value[0]
    value = value[1]
    groupdict[group].append(value)

# result:
# {1: ['a', 'b'],
#  2: ['x']}

Once the data is grouped, all that's left is to convert the dict to your desired output format:

result = [(key, ''.join(values)) for key, values in groupdict.items()]
# result: [(1, 'ab'), (2, 'x')]

The Grouping Recipe

The following section will provide recipes for different kinds of inputs and outputs, and show how to group by various things. The basis for everything is the following snippet:

import collections

groupdict = collections.defaultdict(list)
for value in data:  # input
    group = ???  # group identifier
    value = ???  # value to add to the group
    groupdict[group].append(value)

result = groupdict  # output

Each of the commented lines can/has to be customized depending on your use case.

Input

The format of your input data dictates how you iterate over it.

In this section, we're customizing the for value in data: line of the recipe.

  • A list of values

    More often than not, all the values are stored in a flat list:

    data = [value1, value2, value3, ...]
    

    In this case we simply iterate over the list with a for loop:

    for value in data:
    
  • Multiple lists

    If you have multiple lists with each list holding the value of a different attribute like

    firstnames = [firstname1, firstname2, ...]
    middlenames = [middlename1, middlename2, ...]
    lastnames = [lastname1, lastname2, ...]
    

    use the zip function to iterate over all lists simultaneously:

    for value in zip(firstnames, middlenames, lastnames):
    

    This will make value a tuple of (firstname, middlename, lastname).

  • Multiple dicts or a list of dicts

    If you want to combine multiple dicts like

    dict1 = {'a': 1, 'b': 2}
    dict2 = {'b': 5}
    

    First put them all in a list:

    dicts = [dict1, dict2]
    

    And then use two nested loops to iterate over all (key, value) pairs:

    for dict_ in dicts:
        for value in dict_.items():
    

    In this case, the value variable will take the form of a 2-element tuple like ('a', 1) or ('b', 2).

Grouping

Here we'll cover various ways to extract group identifiers from your data.

In this section, we're customizing the group = ??? line of the recipe.

  • Grouping by a list/tuple/dict element

    If your values are lists or tuples like (attr1, attr2, attr3, ...) and you want to group them by the nth element:

    group = value[n]
    

    The syntax is the same for dicts, so if you have values like {'firstname': 'foo', 'lastname': 'bar'} and you want to group by the first name:

    group = value['firstname']
    
  • Grouping by an attribute

    If your values are objects like datetime.date(2018, 5, 27) and you want to group them by an attribute, like year:

    group = value.year
    
  • Grouping by a key function

    Sometimes you have a function that returns a value's group when it's called. For example, you could use the len function to group values by their length:

    group = len(value)
    
  • Grouping by multiple values

    If you wish to group your data by more than a single value, you can use a tuple as the group identifier. For example, to group strings by their first letter and their length:

    group = (value[0], len(value))
    
  • Grouping by something unhashable

    Because dict keys must be hashable, you will run into problems if you try to group by something that can't be hashed. In such a case, you have to find a way to convert the unhashable value to a hashable representation.

    1. sets: Convert sets to frozensets, which are hashable:

      group = frozenset(group)
      
    2. dicts: Dicts can be represented as sorted (key, value) tuples:

      group = tuple(sorted(group.items()))
      

Modifying the aggregated values

Sometimes you will want to modify the values you're grouping. For example, if you're grouping tuples like (1, 'a') and (1, 'b') by the first element, you might want to remove the first element from each tuple to get a result like {1: ['a', 'b']} rather than {1: [(1, 'a'), (1, 'b')]}.

In this section, we're customizing the value = ??? line of the recipe.

  • No change

    If you don't want to change the value in any way, simple delete the value = ??? line from your code.

  • Keeping only a single list/tuple/dict element

    If your values are lists like [1, 'a'] and you only want to keep the 'a':

    value = value[1]
    

    Or if they're dicts like {'firstname': 'foo', 'lastname': 'bar'} and you only want to keep the first name:

    value = value['firstname']
    
  • Removing the first list/tuple element

    If your values are lists like [1, 'a', 'foo'] and [1, 'b', 'bar'] and you want to discard the first element of each tuple to get a group like [['a', 'foo], ['b', 'bar']], use the slicing syntax:

    value = value[1:]
    
  • Removing/Keeping arbitrary list/tuple/dict elements

    If your values are lists like ['foo', 'bar', 'baz'] or dicts like {'firstname': 'foo', 'middlename': 'bar', 'lastname': 'baz'} and you want delete or keep only some of these elements, start by creating a set of elements you want to keep or delete. For example:

    indices_to_keep = {0, 2}
    keys_to_delete = {'firstname', 'middlename'}
    

    Then choose the appropriate snippet from this list:

    1. To keep list elements: value = [val for i, val in enumerate(value) if i in indices_to_keep]
    2. To delete list elements: value = [val for i, val in enumerate(value) if i not in indices_to_delete]
    3. To keep dict elements: value = {key: val for key, val in value.items() if key in keys_to_keep]
    4. To delete dict elements: value = {key: val for key, val in value.items() if key not in keys_to_delete]

Output

Once the grouping is complete, we have a defaultdict filled with lists. But the desired result isn't always a (default)dict.

In this section, we're customizing the result = groupdict line of the recipe.

  • A regular dict

    To convert the defaultdict to a regular dict, simply call the dict constructor on it:

    result = dict(groupdict)
    
  • A list of (group, value) pairs

    To get a result like [(group1, value1), (group1, value2), (group2, value3)] from the dict {group1: [value1, value2], group2: [value3]}, use a list comprehension:

    result = [(group, value) for group, values in groupdict.items()
                               for value in values]
    
  • A nested list of just values

    To get a result like [[value1, value2], [value3]] from the dict {group1: [value1, value2], group2: [value3]}, use dict.values:

    result = list(groupdict.values())
    
  • A flat list of just values

    To get a result like [value1, value2, value3] from the dict {group1: [value1, value2], group2: [value3]}, flatten the dict with a list comprehension:

    result = [value for values in groupdict.values() for value in values]
    
  • Flattening iterable values

    If your values are lists or other iterables like

    groupdict = {group1: [[list1_value1, list1_value2], [list2_value1]]}
    

    and you want a flattened result like

    result = {group1: [list1_value1, list1_value2, list2_value1]}
    

    you have two options:

    1. Flatten the lists with a dict comprehension:

      result = {group: [x for iterable in values for x in iterable]
                                for group, values in groupdict.items()}
      
    2. Avoid creating a list of iterab


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...