Sunday 13 October 2019

IEqualityComparer and GetHashCode

I've always found this a bit confusing, but it has not been until this week that I decided to try to understand the reasons behind it. Some methods in Linq to Objects (System.Linq.Enumerable) take an IEqualityComparer as parameter, for example Distinct and Contains. Both methods have an overload that requires no EqualityComparer. In that case the default EqualityComparer for the class is used, that means that if the class does not implement System.IEquatable and it's a reference type, we'll end up using reference equality. So using Distinct or Contains when we are concerned about references pointing to the same object is pretty straight forward.

When we want to compare based on some property in the object, (the Id, the Name...) it's obvious that we need to provide the comparison "mechanism". In javascript for functions like find, findIndex, includes... we pass a function expecting two values and returning true or false. That's what in principle I would be expecting that should be needed in .Net, a delegate, but no, we are forced to provide an IEqualityComparer, that has an Equals method and also a GetHashCode method. The reason for this is that the comparison logic in Distinct and Contains not only uses the Equals method (in that case it could have been designed to just receive a delegate), but also the HashCode. It's well explained here. Hash functions are not perfect, they have collisions (the hash of 2 different values can be the same), but we need to make sure that for 2 objects, if IEqualityComparer.Equals is true, the GetHashCode for both objects is also the same (but the reverse does not need to be true). With this, and based on some explanations I've read (I've taken a fast look into the source code, but got a bit lost) we can think of the Distinct method being implemented sort of like this:

Each unique element found so far is stored in a HashTable, where the key is the EqualityComparer.GetHashCode. Now, when checking if another element is unique, rather than comparing it with all items from the start of the collection, for the part of the collection going from the start to this current element we can just check against the hash table (much faster than checking element by element), if we don't find it there, we'll have to continue comparing item by item against the remaining part of the collection.

So this GetHashCode is clearly a way to speed up those methods where the collection is going to be iterated multiple times (like Distinct). However, for methods like Contains, not sure how it comes into play (probably it compares hashCodes first before resorting to Equals?). The odd thing is that methods similar to Contains, like First or Any, do not use an IEqualityComparer, but just a delegate...

No comments:

Post a Comment