Rank aggregation is a problem with many important applications and naive approaches to it go wrong in subtle ways.
Let’s say that your national Quidditch league is dominated by five major wizard sports newspapers. Yes, the ones with moving images and everything. Every week after the games, each of them publishes a ranking of the star players. For now, let’s suppose that the set of players under investigation is always the same, as the problem becomes a bit more complicated otherwise.
The Athletic Wizard: Alicia Spinnet, Ginny Weasley, Gwendolyn Morgan, Robin Higgy, Debbie Muntz
The Daily Prophet: Alicia, Ginny, Robin, Gwendolyn, Debbie
Quidditch News: Robin, Ginny, Gwendolyn, Debbie, Alicia
Seeker Weekly: Gwendolyn, Ginny, Robin, Debbie, Alicia
The Quibbler: Debbie, Ginny, Robin, Gwendolyn, Alicia
As you can see, there’s quite a bit of disagreement and personal taste involved. You didn’t get to watch all of the games, but you’d like to make a decision on who the best players were, by somehow aggregating the opinions of the popular newspapers. An easy option would be to pretend that each newspaper votes for the player that they rank #1, and ignore the rest. As Alicia Spinnet is the only player getting two nominations for best player, she should win best player, right?
Upon closer inspection, Alicia seems very controversial, loved by two but hated by five of the newspaper. Ginny, on the other hand, didn’t stand out as best player to anybody, but she was uniformly considered runner-up. There should be some way to account for this. It would be nice if we would have a method of finding an optimal ranking that maximizes some sort of agreement with the opinions we are trying to aggregate.
Kendall’s Tau distance
One of the most interesting ways to measure disagreement between rankings is the Tau statistic introduced by Kendall. It essentially measures the number of pairwise disagreements between two rankings. Since you can think of it as the number of flips you need to perform on a ranking to turn it into the other, it is sometimes called bubble-sort distance.
While the closely-related Tau correlation coefficient is implemented in Scipy as
scipy.stats.kendalltau, let’s code it ourselves in a simpler way.