Saturday, 2 September 2023

Python groupby Gotchas

I already posted last year about itertools.groupby. There's a gotcha with this so useful callable (itertools.groupby is a class, not a function) that occasionally hits me and I hope posting about it will help me avoiding it.

First a bit of theory. I've just mentioned that itertools.groupby() is a callable class not a function, so invoking it we get an instance of itertools.groupby, that is an iterable/iterator object. Iterating it returns tuples of key and itertools._grouper objects. These _grouper objects are also iterable/iterator objects.

I mainly use groupby to generate dictionaries. Let's say I have a list of cities that I want to group by country.


@dataclass
class City:
    name: str
    country: str

cities = [
    City("Toulouse", "France"),
    City("Prague", "Czech Republic"),
    City("Paris", "France"),
    City("Lisbon", "Portugal"),
    City("Porto", "Portugal")
]



This is the correct way to do it (notice how while iterating the groupby object to create the dictionary I convert the _grouper object (with the grouped cities) to a list.


key_fn = lambda x: x.country
cities.sort(key=key_fn)

country_to_cities: dict[str, list[City]] = {country: list(cities)
    for country, cities in itertools.groupby(cities, key_fn)
}
french_cities = country_to_cities["France"]
print(french_cities)
# [City(name='Toulouse', country='France'), City(name='Paris', country='France')]


And now let's see 2 wrong ways to do it.
First case, I iterate the groupby object creating the dictionary, but forget to convert the grouped cities to a list. One would think that this is ok, even better, cause you are being lazy and not traversing the iterable cities yet. But no, we have a big problem, when we later try to iterate that iterable of cities (converting it to a list), we find that it's EMPTY!


# gotcha 1
country_to_cities: dict[str, Iterable[City]]  = {country: cities
    for country, cities in itertools.groupby(cities, key_fn)
}
french_cities = country_to_cities["France"]
# BIG PROBLEM HERE, it's EMPTY!
print(list(french_cities)) # []

There's a second case where we'll have the same problem. Let's say that for some reason we convert the groupby object to a list that we'll iterate later. Notice that this time, when creating the dictionary we remember to convert the cities to a list, but it's an EMPTY list anyway.


grouped_cities = list(itertools.groupby(cities, key_fn))
country_to_cities = {country: list(cities)
    for country, cities in grouped_cities
}
french_cities = country_to_cities["France"]
# BIG PROBLEM HERE, it's EMPTY!
print(list(french_cities)) # []

The explanation for this odd behaviour is that the groupby object and the _grouper objects share the same iterable/iterator. So when you iterate the groupby object just with the idea of getting its keys, as you move to the next key you already iterate the _grouper of the previous key, so if later on you try to iterate that _grouper again, you'll get no items, as the iterable/iterator is already at the end.

This is explained in the documentation, and the pseudo-implementation shown there makes it clear. I say pseudo-implementation cause if that were the real implementation the _grouper objects would be generator objects, but I've verified with inspect.isgenerator() that that's not the case.

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:

No comments:

Post a Comment