In 2011, a software developer named Sarah Rudd got fed up with conventional soccer stats. It was easy to count how many passes a player attempted and completed, but not all passes are equal. “We know passing percentage is a terrible metric for evaluating how good of a passer you are,” she told me. What were those actions really worth?
At a conference at Harvard that fall, Rudd, who’d started an analytics blog after talking to Seattle Sounders owner Adrian Hanauer about “Moneyball, but for soccer,” presented a better way to value actions. She split the game into 37 different states, determined by factors like ball location and defensive organization, and calculated how likely each state was to eventually lead to a goal or a turnover. Each time a player moved the ball from one state to another, such as by dribbling up the wing or playing a linebreaking pass, he got credit for the change in his team’s chances of scoring. Suddenly a pass wasn’t just complete or incomplete — it had an expected goal value.
The concept Rudd hit on is sometimes called “possession value,” and it’s an important building block of soccer analytics. Over the last few years, as data has become more widely available, more people have started trying to measure the vast, muddled majority of the game that happens in between shots. We’ve seen a boom in various approaches to the same problem Rudd was curious about: How much does any action on the pitch change the likelihood of scoring (and, for some models, conceding)?
“The work Rudd did really covers all the important parts of the philosophy of possession value models,” said Javier Fernández, the former head of sports analytics at FC Barcelona. “The fundamental question that everyone wants to solve is, ‘How do we model the state of the game at any time, and what can we get from that about future reward?’ ”
These days, there’s a whole periodic table of different possession value models. But why are there so many — and what exactly is each showing us?
The simplest kind of possession value model is location-based: the average probability of going on to score from wherever the ball is. Passing or dribbling toward the opponent’s goal typically improves the team’s chance of scoring, so players who progress the ball add value with their actions, while players who pass backward or turn the ball over lose value. That’s the idea behind Karun Singh’s popular expected threat (xT) model.
But location isn’t everything. A pass from the center circle to the top of the box might be a valuable through ball that puts a striker through one-on-one with the keeper, or it might be a worthless lob into a crowd of defenders. “With just x-y locations, it’s really hard to tell, ‘Is this actually a productive pass to make?’ ” Rudd said. For more accurate possession values, you need context about what teammates and opponents are doing away from the ball.
The problem is that most soccer data doesn’t tell you what’s happening off the ball. To guess at what they can’t see, some models use possession history features, such as how fast the ball has been moving upfield, as proxies for defensive disorganization. “There could be a slight bias” when using proxies, said Club Brugge data scientist Jan Van Haaren, who helped develop the VAEP model with a machine learning group at KU Leuven. “But I still think it’s better than using no context at all.”
Such machine learning models can consider more information about the game than just location. They can tell the difference between a pass and a carry, and they can measure the value of an action like a take-on that doesn’t progress the ball. On the other hand, it’s harder to interpret why a tree-based model values a situation the way that it does. “We knew there was going to be a lot of things about the possession we wanted to measure. But we needed a lot of data,” said Matthias Kullowatz, who designed American Soccer Analysis’s goals added (g+) model. “If you only have one instance of a right back having the ball at the corner flag, you’re going to get a potentially crappy estimate of the value.”
Weirdly, one of the hardest parts of possession value is establishing what a “possession” is. Fernández’s state-of-the-art tracking data EPV model defines it as the window between a kickoff and a goal for either team (or the end of a half). Most other models like g+ understand “possession” the way fans do: a sequence of ball control by one team that ends with a goal or turnover. But it’s not always easy to say which team is in control of the ball, and the idea that a team loses all chance of scoring after a turnover can produce drastic values that don’t make soccer sense. For example, a midfielder whose final pass from the top of the box is blocked might get hammered by the model for throwing away a valuable possession, even though his team is still in good position to recover the ball and score.
Nils Mackay, who published his own early model while still a student, dealt with these problems by changing the window of Stats Perform’s PV model from estimating the probability of scoring on the current possession to the next 10 seconds. “This way you never have a cutoff of one event being the end of something,” he said. “When we did that we saw that the numbers aligned with intuition a lot more. Players can still be penalized harshly if they lose the ball, but often it won’t be as harsh if there’s still some value in what they did.”
The equally tricky flip side of the turnover problem is how to value shots. You might think that just as a winger increases her team’s chance of scoring from 5 percent to 30 percent by crossing the ball to a striker in front of the goal, the striker sends it to 100 percent when she finds the net or 0 percent when she misses. But this approach leads to large, volatile shot values similar to the fickle difference between goals and expected goals. When the KU Leuven team introduced a version of VAEP that uses xG-like shot values, players’ numbers became much more stable.
Future possession value models will see more of the game. StatsBomb is reworking its OBV model using broadcast freeze frames to capture off-ball positions. Knowing the ball carrier’s surroundings won’t just make possession values more accurate; it will allow kinds of measurement that event data can’t do. Fernández’s tracking data EPV model breaks possession value into component models for finer-grained values and a better look at players’ decision-making. It can even assign a goal value to actions like dummy runs that manipulate the defense but never touch the ball.
What are all these possession value models good for? At clubs, one of the most valuable applications is recruiting. Liverpool famously used possession value to help scout its squad that conquered Europe on a modest budget. Tracking data models may offer some tactical insights, too. “If you can say to one player, when the opponent plays a 4-4-2 the two CMs are leaving space in this particular situation, sometimes that’s enough,” Fernández said. “A coach once told me that if you can tell a player just one thing that can improve his game, he’s going to love you forever.”
As for Rudd, her possession value work helped land her a job at Arsenal, where until recently she was head of analytics. But it also offered the intrinsic reward of getting a little closer to pinning down this impossible sport. “I remember reading a comment that said, ‘We can’t quantify Santi Cazorla’s ability to split two defenders,’” she said. “And I was like, ‘Oh yeah?’ ”
Check out our latest soccer predictions.