Thursday, 19 May 2022

Dictionary Comprehensions and groupby

Dictinary comprehensions have been part of Python (2.7 and 3.1) since 2009-2010, so more or less when my long python hiatus started, meaning that I have not discovered them until recently. You won't need them with such frequency as List Comprehensions, but every now and then you'll have a chance to make your code more cute thanks to them. You can check several use cases here

I've recently combined them with itertools.groupby, and it seems worth to me to post it here. Let's say I have several cities that I want to group them in a countries dictionary.


class City:
    def __init__(self, name, country):
        self.name = name
        self.country = country

cities = [
    City("Toulouse", "France"),
    City("Prague", "Czech Republic"),
    City("Paris", "France"),
    City("Lisbon", "Portugal"),
    City("Porto", "Portugal")
]

A first approach to group them in a dictionary would be something like this:


def traditional_approach(cities):
    countries = {}
    for city in cities:
        if not city.country in countries:
            countries[city.country] = [city]
        else:
            countries[city.country].append(city)
    return countries

countries = traditional_approach(cities)
print(json.dumps(countries, indent=4, default=lambda x: x.__dict__))

Using group_by and dictionary comprehensions we have this:


def group_by_approach(cities):
    key_fn = lambda city: city.country
    cities.sort(key=key_fn)
    city_groups = itertools.groupby(cities, key_fn)   # iterable[str, grouper]
    return {key:list(group) for key,group in city_groups}

countries = group_by_approach(cities)
print(json.dumps(countries, indent=4, default=lambda x: x.__dict__))

Maybe the second code is not clearer than the first one, but it looks so nice :-) Notice that itertools.groupby has a huge gotcha if you are used to sql group by, or .Net Linq GroupBy: you have to previously sort your list by the key that you'll use later on to group:

It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

Update 2022/08/17. I've found out today that there's another way to do this, leverage the setdefault dictionary method:


def traditional_approach_2(cities):
    countries = {}
    for city in cities:
        countries.setdefault(city.country, []).append(city)
    return countries

No comments:

Post a Comment