A stick that’s regularly used to beat me with is that my work uses numbers from football-data.co.uk. I don’t know the whole history of which data companies who supply the data, but I do know that the numbers are taken from sportinglife.com. I first chose to do so because it was the only set I was aware of, although over time I’ve been variously told that the data is screwy for a whole host of reasons, namely that the football-data.co.uk numbers appear to count blocked shots as shots on target.

However for all of the talk there’s only been, to my knowledge, a single study from outside of this site to look into whether this matters – this one by Ted Knutson, where TSR ratios were found to be “close to the same”, whilst shots on target for were counted at a 63-70% higher rate on football-data compared with Opta numbers.

Further to that I compared TSRs from the two data sources over a period of four seasons, and showed that they correlate so well they’re essentially the same (R^2 = 0.99). As such the discussion around whether total shots data is useable has, at least in my mind, been put to bed (if an R^2 of 0.99 isn’t good enough for you then I frankly don’t care).

The discussion about whether the same is true for shots on target continues, however. Michael Caley was kind enough to send me his shot totals for the last 4.5 Premiership seasons, so we can go ahead and do a comparison. The totals don’t include penalties as shots, but they’re a close enough proxy. Firstly I’ll say that as of this year football-data is using Opta’s shot on target numbers (with minor differences – I don’t think penalties are counted as shots by Opta). Thus, unsurprisingly, there’s a very strong correlation this season between STRs generated using numbers from football-data and numbers from Opta:

There’s not a lot of value in that graph, but I’m not sure that everyone was aware that the numbers being reported were essentially the same this year. We know now.

Anyway, as I use mainly historical data to project what likely to happen in the future it’s important to look at the correlation in prior years, so from now I’m going to focus on the shots on target data I have in both datasets from the last 4 completed Premiership seasons and, as with before, Opta’s shots on target have penalties removed. If we compare STRs generated using numbers from football-data and the raw, penalties-not-included, Opta numbers this is what we see:

So the R^2 is 0.93. Not as strong as for TSR, but still high.

N.b., from this point in the post I’m going to use the terminology ‘this year’ to refer to the season in which a metric/number of points were recorded, and ‘next year’ to refer to the succeeding season.

Firstly, how well does a teams STR this season correlate to how many points it has scored this season?

So STRs generated using Opta data are more strongly correlated to the number of points a teams scores. Intuitively this makes sense – if the f-d numbers are including blocked shots that have zero chance of resulting in a goal they aren’t going to tell us any more information about the number of points a team will score that season.

Next, lets compare the STR a team recorded this year against the STR it recorded next year, to see how repeatable the metric is.

The f-d numbers are the more repeatable – they’ll regress less towards the mean from one season to the next.

How about how well STR this year correlates to the number of points a team will score next year?

Again the f-d numbers come out on top but it’s close enough that it’s probably prudent to declare a draw.

Finally, given that we know how much each metric regresses towards the mean from season to season lets see how well these values would predict the number of points scored by a given team the following season.

The lower the standard deviation the more accurate the prediction. However that’s kind of academic as, whilst the football-data numbers have the edge it’s basically a wash.

In short, the STR values are similar between the two data sources, the STRs generated from the Opta numbers in a given season are more strongly correlated to the number of points scored in that season. The STRs generated from the football-data numbers regresses less towards the mean from this season to next, whilst both predict the number of points a team will score next season equally well.

I’ll leave it at that, the evidence is there for you to each make up your own minds. I think they’re close enough to be useful, and don’t see a reason there to move from my original (and oft repeated) stance – if someone repeats one of my studies (and I think that’s easy to do – I outline my methods more thoroughly than nigh on anyone) using another data source and reaches conclusions that are different to the ones I reach then I’ll re-evaluate using the data. As of today I’m not aware of anyone who has put in the effort to do so. Until then I’m going to point people in the direction of this post and the one focussing on TSR.