Three Points or Two? Comparing Ranking Systems in League Soccer.
- Robert Gregory
- Mar 4
- 9 min read
Introduction
Now that I’ve analysed the relative merits of different tie-breakers by comparing their correlations with primary performance indicators, I thought I’d go one step further and turn the microscope on those primary indicators themselves. From the Football League’s first season in 1888-’89 until its eighty-second in 1980-’81, it operated on the twin assumptions that a draw was half as good as a win, and that no one game in a round-robin competition should count for more or less than another. Each fixture was valued at two points, to be awarded to the victor if there was one and split evenly between the contestants in the case of a draw. However, the 1981-’82 season saw the abandonment of these assumptions with the introduction of a new scoring system. In an attempt to encourage attacking play and thus make the competition more exciting for spectators, the League decided to value a win at three points. This experiment was judged such a success that foreign leagues gradually came to adopt the three-point system themselves, and FIFA made it the international standard in 1995. The extent to which this move has worked in achieving its desired effect has been studied elsewhere with varying conclusions; but as in my study of goal difference and goal average, what I am interested in is the integrity of the different ranking systems. Which system is most effective in ranking teams from best to worst accurately?
This is, of course, a loaded question. In order to attempt to answer it, one first has to come up with a definition of “best,” and whichever definition one chooses will imply its own scoring system. I confess to being biased in this before I began my study. Reasoning from my own first principles about the matter, I cannot conceive a way that the three-point system could be improved upon. Awarding an extra point for a decided game plays havoc with the principle of a balanced schedule, and in so doing undermines an idea which typically goes unstated in discussions about sporting competition but which I suspect most of us would assume as axiomatic: that the definition of success in a sport, perhaps in any field of endeavour, should be generalizable from a small scale to a large one. When the decision was taken to value a win at three points, it changed not only the method of measurement of success but the nature of the thing being measured in a way that made League football in some sense a different game from knockout or friendly matches. Imagine for a moment that there is no such thing as a league competition, that all football matches apart from cup-ties are friendlies whose competitive context is entirely self-contained. Imagine that what are now league fixtures are simply exhibition matches arranged on the basis of mutual back-scratching, as indeed they were until the clubs decided to award points for them. How would you measure a team’s performance in such matches? You might use goal difference, or goal average, or devise some other metric based on the number of goals it scored and conceded. You might go by wins and losses and forget about draws entirely, as the League considered doing at first. But if you were basing your assessment on match results and counting every result, the logical way to do so would surely be to take an equally weighted average of those results. You might, I suppose, consider a team that won two-fifths of its games and lost the other three-fifths more exciting to watch than a team that drew all its games, but you would not consider it more efficient. Yet this is precisely what league competitions around the world now ask us to do. Under the two-point system, the measure of a team’s success was functionally equivalent to the way one would intuitively assess its performance if there were no such thing as a league table. When the scoring system was changed, an unintuitive measure of success, specific to the league itself, was introduced. Once one abandons the principle of equal weighting, there is no a priori reason for weighting a win at three times a draw instead of four or five, or for not awarding bonus points for goals scored as the North American Soccer League once did. The question is no longer “which team is the best at winning games?” but “which team is the best at scoring points under a system in which some games are weighted more heavily than others after the fact?”
Nevertheless, one might argue, the latter question is just as valid as the former. As long as the rules remain the same for all the contestants, the competition can in some sense be said to be fair. Even if the standard be arbitrarily decided, the consistency of its application when it has been decided may be held to be its own justification. A defender of the three-point system might point out that every game is at least potentially worth the same number of points; and that if the two teams involved in a particular game fail to reach a decision, neither has any cause for complaint about losing a point between them because of it. They knew beforehand that this was a possibility. According to this way of thinking, the question of which ranking system is the fairest is what Angrist and Pischke, authors of Mostly Harmless Econometrics, call a Fundamentally Unidentified Question, or FUQ’d.
Methods
But before one gives up completely, there are questions one can ask. Most obviously, one can ask oneself which ask oneself which scoring system gives results that correlate most strongly with some other metric of performance. Here, my previous study on goal difference and goal average means that most of the work has already been done. Using the same correlations derived from the same dataset, I can change the research question from “which secondary performance metric correlates most strongly with a given primary metric?” to “which primary metric correlates most strongly with a given secondary metric?” The latter question feels weaker than the first even to me, for the simple reason that primary metrics are primary and secondary metrics are secondary. Considering primary metrics in their relation to secondary ones does not alter this fact, but it can nonetheless tell one something.
Secondly, one can ask oneself which system is more consistent with itself. If a given scoring system is a reliable indicator of a team’s strength, then a team’s points tally in one sample of its games should be strongly correlated with its tally in another sample. The difficulty here is in choosing the samples. Combining records from different seasons would not work, because even if summer transfers could not alter the composition of a team between one season and the next, promotion and relegation would change the composition of any division of any league in which they operated. Dividing a single season into first and second halves, as the Argentinian national championship did for a while, seems more promising; but the way the first half develops may strengthen or weaken a team’s incentive to do well in the second half, thus rendering the correlations unreliable. In order to avoid these issues, I decided to divide every team’s season into home and away fixtures.
One can gain additional information by studying the cross-correlations between the rankings the two systems provide. If, in addition to a team’s two-point rank having greater consistency across home and away games than its three-point rank, its team’s two-point rank in home games is shown to correlate more strongly with its three-point rank in away games than does its three-point rank in home games, this finding not only supports my sense that the two-point system provides the most accurate assessment of a team’s ability and achievements but also undermines the subjectivist defence of the three-point system. Such a finding would call into question whether or not there is such a thing as being good at scoring points under the three-point system, as distinguished from being good at scoring points in a two-point system, at all.
For this study, I use the final League tables of every Premier League season from 1992 to 2022. For anyone who doesn’t know, Equations 1 to 5 define the variables being studied.
1) GD = F — A
2) GR = F/A
F and A are a team’s goals for and against respectively. GD stands for Goal Difference, GR for Goal Ratio (another term for Goal Average). These are the secondary performance indicators against which our primary indicators are correlated in the first part of the study. The primary indicators are given by:
1) P2 = 2W + D
2) P3 = 3W + D
W and D indicate the number of games won and drawn respectively. P2 is therefore the number of points a team would have won under the old two-point system. P3 is the number of points it did win under the three-point system that is now established. Breaking a team’s season down into home and away games gives us the following:
3) P(i, j )= i*Wj + Dj
The variables i and j indicate the point-scoring system (2 or 3) and the location (home or away, denoted respectively by H and A.)
As in my study on secondary performance indicators, the bounded nature of points tallies and the non-continuous nature of goal differences makes the Pearson product moment correlation coefficient unsuitable. The strength of the relationship between two variables X and Y is therefore estimated using the Spearman rank correlation coefficient, ρ.
4) ρ = C[R(X), R(Y)] / [σR(X)σR(Y)]
This is the Pearson coefficient applied to the ranks of variables within their sets. The numerator of the fraction is the covariance between the ranks of the variables and the denominator is the product of the standard deviations of these ranks.
Again, you can see a more detailed explanation in the aforementioned study.
Results and Conclusion
The method of comparing primary performance metrics by correlating them with secondary ones gives no clear winner. The summary table below shows that P2 has had a slightly higher average correlation with goal difference than P3, but that this is reversed when the points tallies are correlated with goal average. In neither case is the difference perceptible before the third decimal place, and in the case of goal difference it does not become visible until the fifth. Whichever tie-breaker is used, P3 has a higher minimum correlation, a higher maximum correlation, and a narrower range in correlations. P2, on the other hand, has a lower interquartile range; and when correlated with goal difference, it has had a higher correlation than P3 in 18 of 30 seasons. When P3 and P2 are correlated with goal average, each has had the higher correlation the same number of times. A correlogram showing the coefficients from which these findings were derived can be seen in the study on tie-breakers.

Studying the consistency of the two systems indicates that if there is a difference, it favours the two-point system. During the sample period, the correlation between P2H and P2A has been higher that between P3H and P3A in 17 of 30 seasons. On average, it is also higher than either of the cross-correlations between P2 and P3, which are themselves higher than the internal correlation of P3. The same pattern exists in the average ranks of the correlations relative to each other, and has been observable within a season 8 times. P3 has had a similar advantage only twice. The internal correlation of P2 has been the highest correlation within a season 10 times, compared with only 3 for P3; and the lowest 3 times against P3’s 12. The only summary statistic in which P3 has an advantage is in the highest observed value of its internal correlation.


The extent to which this matters depends on one’s point of view. One can take these things too far. I have heard that a team’s expected goals for and against in previous matches are stronger predictors of future goals for and against than are the goals it has actually scored and conceded. I haven’t done my own research as to the truth of that assertion; but even if it is true, its truth does not change the fact that the objective of football is not to create a high number of high-probability goal chances but to convert the chances one does fashion.
Similarly, the conclusion that performance according to a two-point system is more consistent than performance according to a three-point system, even if accepted, does not by itself mean that the two-point system is intrinsically superior. It all depends on whether one views a League’s scoring system as a measurement of performance or a definition thereof. If one takes the latter view, the consistency of different scoring systems may be beside the point. But if one seeks to scientifically measure performance, the evidence of this study is at least relevant to the question of how it can be measured most accurately. From this point of view, the theoretical arguments in favour of the two-point system advanced earlier have been given some empirical support.
Reference



Comments