# Predictive Modeling in Tennis

Before we begin it’s important to remember that I’m only going to talk about a few different ways to approach this problem; there are alternative routes you can take to do predictive modeling. Be warned, long post ahead! That being said let’s break this down into two parts. First, determining an individual player’s skill at point in time. Second, given a player one’s skill and a player two’s skill at point in time, determining the probability that a player will win the match. Basically, we’re trying to determine the skill of a player, and, given different players’ skills, calculate their probability of winning a match.

Now some background into tennis. If you know how tennis is played feel free to skip this part. Otherwise you may want to go to the Wikipedia page on tennis for a deeper understanding. There are different surfaces that tennis is played on; typically hard (concrete, asphalt, etc.), clay, grass, and, historically, carpet and wood. The surface a match is played on has great effect on the speed of the ball, the bounce, and multiple other factors. You can read more about surfaces here. The way tennis is played is either between two players (singles) or four players (doubles). For the sake of simplicity (and because I don’t have doubles players data) we shall only look at singles. In singles play each player alternates the role of being on serve and on return. On serve the player, surprisingly, serves the ball to the other player, whereas the other player is on return, ready to send it back. After each game finishes (games discussed later), the players swap their role. The swapping also occurs more frequently during tiebreaks, but if you want to find more about this click here.

Measuring skill in games is quite a challenge – there are many different models for doing so, of which we shall implement a few. Before getting into that though, let us look at some basics. The underlying logic is that given two players of equal skill they ought to, assuming everything remains equal and let’s say 100 (or some large number of) games are played, win 50% of the games. So how do we determine skill? We can do so by looking at wins versus losses. However, winning against a player that loses all their matches is not much of an accomplishment. Hence, more weighing should be given for difficult wins versus easy wins. Now we start getting into the territory of models like Elo, where the number of points (which correspond to your skill) increase more when you win versus someone far better than you than when you win against the lowest ranked player. That being said, I believe that you can break things down even further in tennis. That you can use skill as a determinant to decide the probability a player has of winning a mere point! In order to do so effectively we’re going to break skill down even further. First, we need to split it up into three categories: skill on clay, hard, and grass (hard will include carpet, which, as of 2009, has been phased out by the ATP and WTA) courts. Second, we will measure two different skills versus just one! What do I mean? Well typically for Elo you would have one number representing your skill. In this instance we are going to have two numbers to do that; one for skill on serve and one for skill on return. The reason for this is that there is significant variance amongst players with respect to their ability on serve and on return. For example, Milos Raonic, a Canadian tennis player currently ranked 11th in the world, has consistently, for the past 3 years, been the #1 player in first serve points won. However, his performance on return struggles. In order to standardize our measurement of performance on serve and return, we shall take the percentage of points won on serve and percentage of points won on return in a given match as an indicator.

Now we have an idea of how our skill model will work. We take inputs of performance on serve and performance on return in a given match on a given court versus a player with given skills. We then run those inputs through a model and churn out new skill ratings for both players. It’s quite simple really! We’ll look at the math far more in depth in upcoming posts when we discuss the different models.

Let’s take a brief background into scoring in tennis. Again, if you know how tennis is played feel free to skip this part. Otherwise you may want to go to the Wikipedia page on tennis for a deeper understanding. Anyways, you score a point in tennis by hitting the ball within the limits of the court and having your opponent miss returning it to your side of the court (within its limits). The scores increase at different intervals; 0, 15, 30, and 40 love are 0, 1, 2, and 3 points (love is just something to call points). In order to win a set a player must score four points, or win a point above 40-love. If both players reach 40-love there is a tie, also called a deuce. In that situation the set continues until one player gets a two point advantage in which case (s)he takes the game. Sets are won when a player reaches 6 or, sometimes, 7 games. 7 games must be won in instances where your opponent has reached 5 or 6 games. Ostensibly you are trying to get a two-game advantage in order to win the set. There are also special tie-breaker rules that you can read more about here if you’re interested. Matches of tennis are usually played as best of three sets, although Grand Slams (Wimbledon, Roland Garros (French Open), the Australian Open, and the US open) are played as best of five sets (for men’s tennis).

Markov chain for a tennis game (source: Wolfram.com)

Ultimately our goal is to determine the outcome of a match at some point in time. In order to do so, as discussed before, we will use our model of skill to help calculate the outcome of an individual point. Here the math starts to come into play and gets a bit more complicated. So on a basic level, let’s say that we want to see the probability that a player wins a point. Let that be represented by p. Similarly, the probability that player 1 loses a point would be 1 – p. Let us extend that again and again for each point until the tennis game is completed. What we get is a simple Markov chain model of probabilities (see picture above). We further extend that model until we reach a set winner and so on. It may seem quite complex and large at the moment, but in reality try and keep in mind that it’s just using simple probability and extending it to different cases. I take from O’Malley’s model to determine probabilities. Let’s quickly go into it (math warning):

Let p represent the probability of a player winning a point on serve.

Let q represent the probability of a player winning a point on return.

Let A and B represent the coefficient matrices derived by O’Malley.

Probability of winning a game:

$G(p)=p^4(15-4p-\frac{10p^2}{1-2p(1-p)})$

Let’s understand how we got there by simplifying the original equation:

$G(p)=P(winning\; game)\\=\sum_{i=0}^{\infty}P(winning\; game\; while\; losing\; i\; points)\\=p^4+4p^4(1-p)+10p^4(1-p)^2+20p^3(1-p)^3\cdot p^2\sum_{i=3}^{\infty}[2p(1-p)]^{i-3}\\=p^4+4p^4(1-p)+10p^4(1-p)^2+\frac{20p^3(1-p)^3}{1-2p(1-p)}\\=p^4(15-4p-\frac{10p^2}{1-2p(1-p)})$

Woah. Calm down, let’s walk through this big scary equation step by step and see that it’s actually not so bad. So we’re looking for all the different combinations of how a player can win a game. That means a player can win a game through winning 4 points without losing a single set ($p^4$), winning 4 points and losing 1 ($4p^4(1-p)$), winning 4 points and losing 2 ($10p^4(1-p)^2$), or winning a tie-breaker (winning 3 and the tiebreaker and losing 3)… which is slightly more complex. Before we do that, just a note, basically the 1, 4, 10, and 20 are all the numbers of combinations for that outcomes (think binomial distributions if you took data management or some equivalent in high school or uni/college) of the probabilities. If you are not familiar with this you may want to look up how it works online, I will not explain it here. First, to get to a tiebreaker you must win 3 sets and lose 3 ($20p^3(1-p)^3$). But then you are in a situation where you must gain a 2 point advantage in order to win the set, which is where we’re left with the remainder of the equation. Essentially it’s the probability of winning two points in a row ($p^2$) multiplied by the number of tie breakers, which is $\sum_{i=3}^{\infty}[2p(1-p)]^{i-3}$. Simplifying all that gives us the final equation.

Phew, that was complicated. Wait, there’s more! What about the probability of winning a tie breaker for a set? Well in that case…

$TB(p,q)=\sum_{i=1}^{28}A[i,1]p^{A[i,2]}(1-p)^{A[i,3]}q^{A[i,4]}(1-q)^{A[i,5]}pq[1-{p(1-q)+(1-p)q}]^{-1}$

Yeah… more scary equations. You should know that this equation uses O’Malley’s matrix of coefficients (for example, A[i,j] is the ijth element in the 28 by 6 matrix (hence the sum up to 28)). The concept behind it, still is, very much similar to what we discussed earlier about the basic probabilities. Take some time to go through it and see if you can understand what it is for itself, bearing in mind what we just learned above. Basically it takes all the different point combination scores in the tiebreaker (there are 28 different point combinations for a tiebreak, hence the 28×6 matrix, hence the sum until 28).

So let us extend that, there are 21 different possible outcomes (combinations) of game scores that make up a set. So the probability of winning a set is:

$S(p,q) = \sum_{i=1}^{21}B[i,1]G(p)^{B[i,2]}(1-G(p))^{B[i,3]}G(q)^{B[i,4]}(1-G(q))^{B[i,5]}[(G(p)G(q)+{G(p)(1-G(q))+(1-G(p))G(q)}TB(p,q))^{B[i,6]}]$

It’s quite complex to just stare at it, but broken down it’s fairly simple and is basically the Markov chain of the possible scores for a set where the player would win using O’Malley’s coefficients.

We’re left with fairly simple formulas to determine the probability of winning a 3-set or 5-set match.

3-set:

$M_{3}(p,q)=S(p,q)^2[1+2(1-S(p,q))]$

5-set:

$M_{3}(p,q)=S(p,q)^3[1+3(1-S(p,q))+6(1-S(p,q))^2]$

So again, this formula follows the basic principles we discussed before. Let us examine the 3-set formula for example.  $S(p,q)^2$represents the probability of the player winning two sets (which means (s)he has won the match). That equation is then is multiplied by the possible combinations of the other player winning sets multiplied by the probability of the player who won losing a set (i.e. the losing player can have won either 1 set or 0 sets, hence has 2 permutations). This is reflected in $2(1-S(p,q))$ where $1-S(p,q)$ is the probability of the player that won the match losing a set. It works on a similar level for five sets, except with an extra possible series of outcomes. Phew, hopefully all that made sense. These are the basics with which we shall formulate our model for predicting tennis matches!

Now before I conclude, a couple of things to note. There are so many different factors you can take into account when trying to predict tennis matches, such as a player’s height, weight, the temperature/weather condition (and the player’s performance at that temperature/weather condition), the time of year, etc. However in my model I focused on what appeared to be the most impactful variables to a player’s skill, i.e. their serve and return abilities. Again, please check out O’Malley’s work on the subject, in addition to some further insights you can also find the table of coefficients he uses. I highly recommend it.

Cheers!