Classifying Strikers In The SPFL With Primary Component Analysis And K-Means Clustering

Modern Fitba will be disappearing in the next few days, so I wanted to put some of the articles I wrote on there here.

In my first article discussing K-means clustering, I used midfielders in the SPFL Premiership last season to demonstrate how we can use the clustering methods to come up with styles of play and players who play like each other in the league. Now even casual fans of the game are likely aware that there can be different roles for midfielders. We can use our clustering techniques to help us confirm our preconceived notions are regarding midfielders’ roles but if you “USE YOUR EYES, MATE”, you can see some type of pattern for midfield playing styles.

While there are typically four or five midfielders on the pitch playing different roles, unless you are a desperate Tony Mowbray at Love Street about to lose his job, there are usually only one or two strikers on the pitch at a time. With less personnel it might not be as easy to discern the style of a play a striker employs as it is for other positions on the pitch. Putting the ball into the net is clearly job number one nearly all the time for strikers but how they do that can vary. Does your manager want to your striker to press the back line, be heavily involved in the build-up, play off the last man’s shoulder, etc? Two strikers might have similar goal scoring stats or even similar advanced xG stats, but if they aren’t doing it in the same way, one might not necessarily be a good replacement for the other in your favorite team’s lineup.

This table shows the percentage of explained variance for the first six principal components of the principal component analysis.

To try and find which strikers playing in the Premiership this season are playing similarly, I again looked to use K-means clustering. To do this I followed a similar methodology as I did for midfielders, with a few key changes. After my article on midfield clusters, Modern Fitba’s favorite Dutchman in Ortec’s Head of Data Bertus Talsma suggested reducing the number of dimensions used in in the clustering. In my midfield clustering article, I used 43 metrics to come up with the clusters. This time around, I took those 43 metrics and used them for a Principal Component Analysis. Often shortened to PCA, this is a dimensionality-reduction method that is often used to lower  the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. In short, using PCA we can reduce our variables and determine which are most important.

In my previous article, I used the elbow method to determine the optimal number of clusters for midfielders. This time around, it was more appropriate to use the silhouette method.

Once we have these new, reduced variables we can perform the same K-means clustering that we did with midfielders, only with the two new measurements that our PCA gave us that explain the variance between our players the most. We determine how many clusters are the optimal number to use and find that the optimal number of clusters for SPFL strikers this season is three. So let’s see where strikers who have played at least 990 minutes this season in Scotland fall.

The three clusters of strikers who have appeared in at least 990 Premiership minutes of play so far this season.

In our three clusters, some interesting patterns emerge. The thing that sticks out to me right away is that every striker at Celtic and Rangers outside of Patryk Kimala is located in cluster 2. There are a few others, such as Billy McKay, Stevie May, Eamonn Brophy, and Bruce Anderson in the cluster with the strikers plying their trade in Glasgow. This tells us that strikers at Rangers and Celtic are often times playing in a similar fashion. Players in this cluster seem to mostly be all-around strikers. Many of them have a high number of possession regains in an opponent’s half, indicating they are frequently pressing. They attempt a large share of the shots their teams take. The complete a lot of passes and often those passes are in their opponent’s half in the final third. They are all around players that are use their many skills to change games frequently.

While Patryk Kimala isn’t clustered with other Celtic players or their rivals Rangers, many of the players in his cluster certainly make sense to the eye to be put together. Often times these players are the first outlet for their teams on a counter, as they receive the first pass of a possession frequently. Seeing former Aberdeen striker Sam Cosgrove in cluster 1, it is probably no surprise that players in this cluster are often winning aerial duels, attacking duels, and taking plenty of shots via header.

In our third cluster, these strikers are attempting a high number of forward passes and taking on a lot of defenders via the dribble. They also attempt a fair number of high crosses to the box. Finally, they are winning defensive duels more than their peers in the Premiership, which could indicate the need to help out on defense with their team more often.

So we have clusters helping to identify strikers in Scotland. Great. We can imagine a scenario where a high profile striker will likely be moving on from his club in Scotland. Let’s call him Moredouard. We could use clustering to try and find a replacement for him, but if his replacements are at a rival club that will never do transfer business with us or are, with no offense meant, Stevie May or Billy McKay, we may need to look outside Scotland to find a suitable replacement.

Luckily our friends at Ortec let us have a look at some data for leagues at a similar level as the Scottish Premiership to try and find some players that might play similarly to the player we are trying to replace. If we add strikers that appeared in Major League Soccer last season and in the Danish Superligaen and Belgian First Division this season, we have a larger sample of cluster to try and find more suitable replacements.

The four clusters of strikers that have appeared in at least a third of available minutes in their respective leagues.

Like we did with the strikers in Scotland, we use PCA and K-Means clustering to find similar players. For our larger sample of international strikers, the optimal number of clusters was four and it gives us the results above. If you could decode my very tricky word play above, we see above in cluster 4 some players that our algorithm thinks play similarly to Odsonne Edouard and Alfredo Morelos. A cursory glance at surface level statistics see some goals go along the similar style to the best strikers in Scotland. Fashion Sakala Junior is in the Edouard/Morelos cluster and has 13 goals in Belgium this season for KV Oostende. Mauro Manotas scored 13 goals as well for Houston in MLS before moving to Tijuana in Liga MX.

Of course, there is a world of football outside of Glasgow in Scotland and Celtic and Rangers aren’t the only clubs in the Premiership that could be looking to replace a young talented striker this summer Kevin Nisbett seemed out the door at Easter Road in January before a deal fell through late, but you would imagine interest in the Scottish striker would pick back up at season’s end. Suitors for Nisbett might like the idea of getting the “Scottish Higuain” (emphasis to note the tongue slightly in cheek nature of comment), but Hibs can look to clustering to try and find someone to fill Nisbett’s shoes. He might now be out of Hibs price range, but we have another KV Oostende player in Belgium in the same cluster that Kevin Nisbett is in that fits in with Nisbett’s playing style who has knocked in 11 goals this season in Makhtar Gueye.

Clubs of all size can simply their transfer process with some basic data and a few lines of code. You would want to confirm in a scenario like this with in-person scouting that a player the algorithm spits out really does fit in with your team, but you can narrow down a large population of players to focus in on a few names to dig deep on with something like clustering and primary component analysis.