Soccer is the biggest game in the world. With over 250 million active players, followed by four billion fans, it’s a huge, competitive industry.
While the game is on the pitch, a massive data industry sits behind it. This industry looks for the secrets to victory in player actions. The biggest teams hire secretive data consultants for an edge in the multi-billion-dollar industry.
Now, three researchers at Curtin University have developed a tool to level the playing field.
Graduate student, Jordan Makins is a member of Curtin’s High Performance Intelligent Systems (HPIS) research group. With guidance from Assoc. Prof. Aneesh Krishna and Dr Sajb Mistry, he set out to make an open-source tool for analysing player actions during soccer matches.
The 2003 Michael Lewis book, Moneyball, popularised sports analytics. It followed the data-driven approach of the Oakland Athletics Baseball team. Since then, many sports have turned to data to get an advantage in games, but soccer has been resistant to this approach.
“Soccer is the most popular game in the world, but soccer analytics are so complicated, not a lot of work is done in this space,” says Sajib.
The sport is set on a constantly evolving field where many actions occur at once. Any computer analysis needs to cut through all that noise to the actions that lead to winning soccer matches.
“The big teams like Barcelona and Liverpool invest a lot in analytics, but that knowledge is proprietary. That means others don’t have access to it. Our findings are open-source, so people may use them however they like.”
Supercomputing was vital for analysing the field. Each second of game time was logged using event streams. Soccer experts logged these streams, keeping track of every action, from passes, to dribbles and goals. The trio divided the field into 26 zones and used the logs to analyse plays.
“We were able to aggregate the different action types per zone and assign them values. We used a random forest feature selector to do this, then fed it into a deep neural network. This is the first time this approach has been used in soccer analytics,” says Jordan.
A random forest is a machine learning model that uses many decision trees to sort data by values. Random forest sorting is already used in genetics. The team read work in this field and decided to combine it with a deep learning system to trial a novel soccer analytics approach.
“If you think of a DNA strand, it’s a long, complex vector of values, much like our soccer data. The Forest Deep Neural Network (FDNN) was used in a recent genetics paper, so we figured it could be a good crossover,” says Jordan.”
“It’s a great technique for selecting features. In our soccer analytics, certain actions we wanted to value were nested in a lot of constant, changing noise. This feature selector helped us cut through that noise and find the actions that were really valuable to us,” says Sajib.
With a large amount of data coming in and two years’ worth of European games to analyse, the team needed access to a graphics processing unit (GPU) supercomputer. Through Curtin’s partnership with Pawsey Supercomputing Centre, they were able to gain access to the Topaz GPU cluster.
“It was important we had access to this kind of high-performance computing (HPC) power. If you think about the fluid nature of this data, we needed to process it as close to real-time as possible. Eventually we’d like to train the software to analyse a live soccer game,” says Sajib.
Pawsey helped Jordan upskill to work on the Topaz cluster. After completing introductory supercomputing and parallel data coding modules, he applied the knowledge to his research project.
“I would remote connect from my PC, so I didn’t need to go to the Pawsey building. I would run my Python code from home on my PC connected to the cloud and I would get the superior performance of the Pawsey supercomputer,” says Jordan.
A Compute Unified Device Architecture (CUDA) job script that featured in Pawsey’s training module also became the template for Jordan’s soccer analytics code:
#Default loaded compiler module is gcc module
module load cuda
“The script I used was very similar to [the CUDA script]. The only difference was I would log on for roughly a 16-hour block when I was coding.”
“Curtin’s partnership with Pawsey trains our students and researchers to use HPC resources. That training is very thorough. It’s carefully updated every year and goes through a step-by-step process. It also gives us access to computing power for this kind of exploratory research.” – Assoc. Prof. Aneesh Krishna.
The open-source model is usable by anyone. This means soccer teams no longer need huge budgets to hire consultancy firms to make data-driven decisions for their players. It levels the playing field in the sporting world, while exploring powerful new data analytics techniques.
“In the realm of machine learning, this is a newer ensemble algorithm. It’s the first time somebody’s combined these two widely tested algorithms. The mass adoption of a new algorithm is always an exciting concept. In the same way we got the random forest feature selector from a genetics study, other fields could use this approach to identify valuable data within a lot of noise.”
The model assigns values to players based on their actions. Unlike other models, defensive and mid-field actions are considered alongside goals, so defensive players are valued. With these player values, future research could explore in-game, data-driven decision making. It could help managers pick players on a limited budget and analyse opponent tactics in near-real-time.
“There are still a lot of analytics left to do and we’re looking at making all of this open-source, so everyone has access,” says Sajib.
The team are using Pawsey supercomputing to democratise soccer data and give everyone a fair go.