Monday, 18 September 2023

Sorting and Comparison

Comparison for sorting/ordering (that is, determining if one value is less than, equal to or greater than another value) in Python works slightly different from what I was used to in other languages. For example in C# and Kotlin the design is almost the same, so let's see that first.

Classes in C# can implement the IComparable interface to define how to sort them. Additionally you can define comparer objects (implementing the IComparer interface), so that you can define different comparison strategies for a same class. The CompareTo or Compare methods of those interfaces return an integer to indicate if one value is less, equal or greater. Basically the same goes for Kotlin, where you have the Comparable interface and the Comparator interface, which methods also return an integer.

So When we want to sort/order a collection in C# (with the order method) we have 2 options: either our collection is made up of IComparable objects, or we provide an IComparer object. There's another sorting method, OrderBy, that receives a function that for each element in the collection returns a "key", that is an IComparer object (this is a bit odd to me, it would seem more natural returning an IComparer).

The philosophy in Kotlin is almost the same (but it allows both sorting in place, with sort, and returning a new collection with sorted). So Collections have a sort method that can be used for sorting a collection of Comparable items, or any collection by providing a Comparer object. We also have a sortBy method that receives a "selector" function that for each object in the collection returns a Comparable object. As for operator overloading, the different comparison operators work on Comparable objects invoking the compareTo method.

In Python we make objects comparable by defining the Rich Comparison operators. Well, that makes a lot of methods, but indeed for sorting your objects you only need to define the __lt__ method, as explained in the documentation.

The sort routines use < when making comparisons between two objects. So, it is easy to add a standard sort order to a class by defining an __lt__() method

There are some advanced cases (not just sorting comparisons) where defining the additional Rich methods is useful. Apart from its usage in numpy, I've read somewhere that SqlAlchemy does some use of them.

Same as in kotlin we can sort in place (using the sort method) or return a new collection (using the sorted function). If the collection contains comparable objects (implementing the __lt__ method) that's all, else we'll have to provide a key function (aka selector). This function receives a collection item and must return a comparable object (so it's like the sortBy - selector in Kotlin). I've never needed nothing more than this in Python, but if you compare it with C# - Kotlin where you can provide a Comparer object (apart from the key-selector function), it could seem like we are missing something. It would seem as if a Comparer object/function could better deal with complex comparison scenarios (for example compare based on several properties and several combinations of values and ranges in those properties). No, not really, but let's clarify this.

Let's say I have a Country class that I want to compare by default based on its population. It's just a matter of adding a __lt__ method to the class:



class Country:
    def __init__(self, name: str, population: int):
        self.name = name
        self.population = population
    
    def __lt__(self, other: "Country"):
        return self.population < other.population

countries = [
    Country("Germany", 80),
    Country("Spain", 48),
    Country("France", 68),
    Country("Russia", 150),
    Country("Portugal", 10),
    Country("China", 1000),
]

print("standard sorting (by population):")
sorted_countries = sorted(countries)
print([country.name for country in sorted_countries])

#standard sorting (by population):
#['Portugal', 'Spain', 'France', 'Germany', 'Russia', 'China']
       

Now let's say I want to sort it by name. OK, it's just using as key function one that returns the name.



print("sorting by name:")
sorted_countries = sorted(countries, key=lambda x: x.name)
print([country.name for country in sorted_countries])

#sorting by name:
#['China', 'France', 'Germany', 'Portugal', 'Russia', 'Spain']


And now the interesting part. I want to sort it by name, but giving the maximum weight to the France name, so it's not just a matter of returning as sorting key one property of the object. This could seem more natural to be managed with a sort of Comparer object/function, but we can do just the same with a key/selector function that returns a comparable object that implements the comparison logic that we would have set in the Comparer object. Let's see:


class FrenchNameCentricKey:
    """Think of this Key as a Comparer object that compares Country objects in an "odd" way"""
    def __init__(self, country: Country):
        self.country = country

    def __lt__(self, key: "FrenchNameCentricKey") -> bool:
        if self.country.name == "France" and key.country.name == "France":
            return False
        if self.country.name == "France":
            return False
        if key.country.name == "France":
            return True
        return self.country.name < key.country.name
        
print("French centric sorting:")
sorted_countries = sorted(countries, key=FrenchNameCentricKey)
print([country.name for country in sorted_countries])       

#French centric sorting:
#['China', 'Germany', 'Portugal', 'Russia', 'Spain', 'France']  	

The thing to bear in mind is that a comparer object/function receives 2 objects to compare, while a key-selector function returns a "comparable" object. This "comparable" object will get its comparison method invoked passing it over another instance of another "comparable" object.

In the past Python used compare functions (following the 1, 0, -1 school of thought) rather than key/selector functions. Because of that Python provides a cmp_to_key function that transforms a compare function into a key function. At first the idea seemed like magic to me, but with what I've explained above the implementation seems pretty clear to me now (notice that it returns a class not a function).



def cmp_to_key(mycmp):
    """Convert a cmp= function into a key= function"""
    class K(object):
        __slots__ = ['obj']
        def __init__(self, obj):
            self.obj = obj
        def __lt__(self, other):
            return mycmp(self.obj, other.obj) < 0
        def __gt__(self, other):
            return mycmp(self.obj, other.obj) > 0
        def __eq__(self, other):
            return mycmp(self.obj, other.obj) == 0
        def __le__(self, other):
            return mycmp(self.obj, other.obj) <= 0
        def __ge__(self, other):
            return mycmp(self.obj, other.obj) >= 0
        __hash__ = None
    return K