Classifying Strikers In The SPFL With Primary Component Analysis And K-Means Clustering

Modern Fitba will be disappearing in the next few days, so I wanted to put some of the articles I wrote on there here.

In my first article discussing K-means clustering, I used midfielders in the SPFL Premiership last season to demonstrate how we can use the clustering methods to come up with styles of play and players who play like each other in the league. Now even casual fans of the game are likely aware that there can be different roles for midfielders. We can use our clustering techniques to help us confirm our preconceived notions are regarding midfielders’ roles but if you “USE YOUR EYES, MATE”, you can see some type of pattern for midfield playing styles.

While there are typically four or five midfielders on the pitch playing different roles, unless you are a desperate Tony Mowbray at Love Street about to lose his job, there are usually only one or two strikers on the pitch at a time. With less personnel it might not be as easy to discern the style of a play a striker employs as it is for other positions on the pitch. Putting the ball into the net is clearly job number one nearly all the time for strikers but how they do that can vary. Does your manager want to your striker to press the back line, be heavily involved in the build-up, play off the last man’s shoulder, etc? Two strikers might have similar goal scoring stats or even similar advanced xG stats, but if they aren’t doing it in the same way, one might not necessarily be a good replacement for the other in your favorite team’s lineup.

This table shows the percentage of explained variance for the first six principal components of the principal component analysis.

To try and find which strikers playing in the Premiership this season are playing similarly, I again looked to use K-means clustering. To do this I followed a similar methodology as I did for midfielders, with a few key changes. After my article on midfield clusters, Modern Fitba’s favorite Dutchman in Ortec’s Head of Data Bertus Talsma suggested reducing the number of dimensions used in in the clustering. In my midfield clustering article, I used 43 metrics to come up with the clusters. This time around, I took those 43 metrics and used them for a Principal Component Analysis. Often shortened to PCA, this is a dimensionality-reduction method that is often used to lower  the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. In short, using PCA we can reduce our variables and determine which are most important.

In my previous article, I used the elbow method to determine the optimal number of clusters for midfielders. This time around, it was more appropriate to use the silhouette method.

Once we have these new, reduced variables we can perform the same K-means clustering that we did with midfielders, only with the two new measurements that our PCA gave us that explain the variance between our players the most. We determine how many clusters are the optimal number to use and find that the optimal number of clusters for SPFL strikers this season is three. So let’s see where strikers who have played at least 990 minutes this season in Scotland fall.

The three clusters of strikers who have appeared in at least 990 Premiership minutes of play so far this season.

In our three clusters, some interesting patterns emerge. The thing that sticks out to me right away is that every striker at Celtic and Rangers outside of Patryk Kimala is located in cluster 2. There are a few others, such as Billy McKay, Stevie May, Eamonn Brophy, and Bruce Anderson in the cluster with the strikers plying their trade in Glasgow. This tells us that strikers at Rangers and Celtic are often times playing in a similar fashion. Players in this cluster seem to mostly be all-around strikers. Many of them have a high number of possession regains in an opponent’s half, indicating they are frequently pressing. They attempt a large share of the shots their teams take. The complete a lot of passes and often those passes are in their opponent’s half in the final third. They are all around players that are use their many skills to change games frequently.

While Patryk Kimala isn’t clustered with other Celtic players or their rivals Rangers, many of the players in his cluster certainly make sense to the eye to be put together. Often times these players are the first outlet for their teams on a counter, as they receive the first pass of a possession frequently. Seeing former Aberdeen striker Sam Cosgrove in cluster 1, it is probably no surprise that players in this cluster are often winning aerial duels, attacking duels, and taking plenty of shots via header.

In our third cluster, these strikers are attempting a high number of forward passes and taking on a lot of defenders via the dribble. They also attempt a fair number of high crosses to the box. Finally, they are winning defensive duels more than their peers in the Premiership, which could indicate the need to help out on defense with their team more often.

So we have clusters helping to identify strikers in Scotland. Great. We can imagine a scenario where a high profile striker will likely be moving on from his club in Scotland. Let’s call him Moredouard. We could use clustering to try and find a replacement for him, but if his replacements are at a rival club that will never do transfer business with us or are, with no offense meant, Stevie May or Billy McKay, we may need to look outside Scotland to find a suitable replacement.

Luckily our friends at Ortec let us have a look at some data for leagues at a similar level as the Scottish Premiership to try and find some players that might play similarly to the player we are trying to replace. If we add strikers that appeared in Major League Soccer last season and in the Danish Superligaen and Belgian First Division this season, we have a larger sample of cluster to try and find more suitable replacements.

The four clusters of strikers that have appeared in at least a third of available minutes in their respective leagues.

Like we did with the strikers in Scotland, we use PCA and K-Means clustering to find similar players. For our larger sample of international strikers, the optimal number of clusters was four and it gives us the results above. If you could decode my very tricky word play above, we see above in cluster 4 some players that our algorithm thinks play similarly to Odsonne Edouard and Alfredo Morelos. A cursory glance at surface level statistics see some goals go along the similar style to the best strikers in Scotland. Fashion Sakala Junior is in the Edouard/Morelos cluster and has 13 goals in Belgium this season for KV Oostende. Mauro Manotas scored 13 goals as well for Houston in MLS before moving to Tijuana in Liga MX.

Of course, there is a world of football outside of Glasgow in Scotland and Celtic and Rangers aren’t the only clubs in the Premiership that could be looking to replace a young talented striker this summer Kevin Nisbett seemed out the door at Easter Road in January before a deal fell through late, but you would imagine interest in the Scottish striker would pick back up at season’s end. Suitors for Nisbett might like the idea of getting the “Scottish Higuain” (emphasis to note the tongue slightly in cheek nature of comment), but Hibs can look to clustering to try and find someone to fill Nisbett’s shoes. He might now be out of Hibs price range, but we have another KV Oostende player in Belgium in the same cluster that Kevin Nisbett is in that fits in with Nisbett’s playing style who has knocked in 11 goals this season in Makhtar Gueye.

Clubs of all size can simply their transfer process with some basic data and a few lines of code. You would want to confirm in a scenario like this with in-person scouting that a player the algorithm spits out really does fit in with your team, but you can narrow down a large population of players to focus in on a few names to dig deep on with something like clustering and primary component analysis.

Scottish Football Graduating to “Advanced” Expected Goals

Congratulations Scotland, you have passed Intro to Expected Goals and are now movingProfessor.jpg onto the advanced class. Most following me know that expected goals are the likelihood that a goal will be scored on a shot. Expected goals is now a term that more and more Scottish football fans are familiar with, understand, and can discuss coherently. Sure, there is the occasional “Yer Da” still yelling about “Goals and Points being the only stat that matters!”, but compared to three years ago, football analytics literacy has grown considerably in Scotland.

However, now that many have the basics down, we need to have a talk about expected goals. On Twitter last week, I noticed there was discussion about the usage of xG and in particular summing xG totals for individual matches and saying things like “(this team) should have scored 2 goals because they had an xG of 2.” Let me first throw myself at the mercy of the metaphorical court, I have created a few different visualizations where a summed xG total for an individual match was present. It is still on the xG maps I publish each week for the SPFL.

I chose to sum xG on the graphics I have posted to try and ease Scottish football fans into xG. With that being said, there are some issues with summing xG for individual matches. Danny Page covers the issues in an article he wrote pretty comprehensively. Danny points out that if you sum the xG, you will miss on on the variance that can occur in a single match. In his article, he says:

Arsenal won 0–3 with a xG scoreline of 0.39–1.49. In these cases, some may say “The right team won” because the xG and real life scorelines match. However, these values are only adding expected goals. But something is missing. Only adding independent probabilities misses half of the story: variance.

A good situation to think of here is a shot with an xG of 0.05. That shot may go in, it has gone in before, but it is not likely. The instances where it does go in is the variance Danny is talking about, but generally it is not a shot that is going to lead to goals often. But let’s say that a team has ten of those 0.05 xG shots, compared to a team that has one 0.50 xG shot. The second team’s shot is much more likely to go in than any of the first team’s shots, but summing the xG in this situation they would both have an xG total of 0.50.

Ross County Motherwell Prob 11_4_17
The xG graphic that will be out each week for match, borrowing heavily from Danny Page and his xG simulator.

Sometimes those lower xG shots will lead to a win, thus the idea of variance. Typically  summing xG over the course of a season variance usually will find the mean. However, anything can happen in one game. Therefore, Danny puts forth that rather than summing xG totals in a single match and making conclusions off that, it would be better to use win percentages based on the xG of each team’s shots and the likelihood of the goal difference for that match based on the xG output, so that is what we are going to do.

 

To do this, we will take the xG of each shot for a team in a match and run them in a Monte Carlo simulation 1000 times. This is similar to what I do to come up with the numbers for B.U.R.L.E.Y. for the season. With these simulations, we can come up with 1,000 results of matches with the xG results of a particular match and produce how many times each team would typically win and draw, what would be the most common scoreline, and the typical points per game from that xG performance. In addition to seeing the sum of the xG for a match, we will see the team that was most likely to win and what the score would typically be from a match with that xG output.

St. Johnstone Celtic Prob 11_4_17.png
My xG graphic for St. Johnstone v. Celtic on November 4th.

Using Danny’s xG simulator and taking all the graphics he came up with as a template, I will now be producing these graphics for every SPFL match. Henceforth, these graphics will accompany the xG maps we have been producing each week and will hopefully give some further insight into expected goals. As this now the “advanced class”, please feel free to let me know if you have questions or comments about this!

This article was written with the aid of StrataData, which is property of Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.