Excel Blog

7 MIN READ

How We Built our Soccer Tournament Predictor in Excel

Former Employee

Jun 01, 2018

How We Built our Soccer Tournament Predictor in Excel

Hundreds of millions of users have created truly amazing solutions in Excel. Ranging from very traditional use cases to quite unexpected scenarios, Excel has been among the most favorite productivity tools for those users. Today we share a model we have built ourselves to help predict the future of a sports tournament. Read on to learn more.

First, let me preface this, I am not a die-hard soccer (football) fan. I grew up in Canada where our Men’s soccer team is ranked a dismal 73^rd (our Women’s team is great and ranked #4), so I do not have any skin in this game. But I am a massive sports fan and love seeing how people try to find logic and reason in sports, and for that reason I have decided to try and do that myself. However, I have always been frustrated by not being able to dig into people’s models. In many cases it is proprietary and in others it relies on file dependencies or libraries that I wouldn’t have access to. I wanted to accomplish two tasks by building this model; first, I wanted to build it entirely in the Excel grid using only formulas. Second, I wanted anyone who sought to be able to see the nitty-gritty details of the model to be able to do so and tinker with it to their hearts content.

Here are the templates:

Model Summary

The model is based off a couple of different calculations, basically taking two inputs. The first is on the Round Robin Predictions tab and is the performance indicator. In Column E, you can input how you think a team will perform (better or worse) from how we expect them to perform over the tournament. I.e. if you think Switzerland will perform better than we have them ranked, increase their percentage, and if you think England will underperform (as they usually do) then input a negative percentage in the appropriate row.

The second input is found on the This Year’s Matches tab and allows you to predict the individual goals of a match. This will override our predictions for that match and I encourage you to use it as a ledger to see what it would take for your team to make it into the knockout stage.

After that, the model will take the top two teams in every group and put them into the Playoff Predictions tab where you can see who might advance through each round.

Finally, you should remember, this model is built off a function called =RAND(). This function spits out a new random number whenever any cell is changed, which means the model is never going to be exactly the same as your friends. That is okay. We run many iterations of the model and it is stable.

I would advise against using the model to make bets on who will win and who will not. I take no responsibility for your losses (I will happily take credit for your gains). Have fun!

Detailed Construction of the Model

We started with two datasets, the first being a list of every international soccer game played over the last 146 years, shortened to include just the last 40 years. This is a publicly available data set for anyone to use (https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017). The second data set is obtained from http://eloratings.net and is the top 100 international football teams rated by Elo.[1]

Using the Elo ratings, we broke out the top 100 teams into ten different groups, the top ten teams were assigned a score of 90, with teams 11-20 being assigned a score of 80 and so on and so forth. This enables us to get a good idea of how a team ranked in one category would perform against a team of another category. We then cross reference this data with the last 40 years’ worth of games and eliminated all games where a top 100 team did not play against another top 100 team. We also eliminated all games that were won by 7 or more goals eliminating around 2% of games. This leaves us with about 12,000 games remaining.

We then calculated the matches played per group of ten when playing against a different caliber group, as well as the goals scored per side when two calibers of teams played each other. This was then divided to give us an average goals per group depending on which caliber of team they played. This is shown on the Appendix3 – Goals Per Group Sheet.

Okay so now the prediction part begins. Which requires a little bit of statistics.

Soccer is known to follow a Poisson Distribution[2]. This essentially means that every goal scored is independent of all the other goals scored throughout a match. This is a large assumption because it ignores things like momentum, necessity, etc. but when predicting anything you will have to make assumptions.

We are going to do a little math. Don’t worry, it is not difficult and as someone who was (and is) bad at math, I promise, it will be okay.

The formula for the Poisson Distribution is quite straightforward:

The components of the equation are as follows:

P(k) is the probability of scoring K goals with K being any number. In our model K = 0,1,2,3….9.
Delta, or lowercase δ is the average goals scored per team (which is different because of how we broke up the teams) is to the power of K which is 0,1,2,3…9
e is a constant (often referred to as Euler’s Number) and equals 2.71828 which is to the power of negative δ
This is all divided by k! (! Is factorial). A factorial is when you multiply K by an increment one less than itself. If K is 6, then K! is 6*5*4*3*2*1.

Every number outputted will be less than one, as it is a probability.

If you are like me and that makes some sense, but you do not really understand it unless there is a picture associated with it:

On the Y-axis we have the probability of any team scoring goals which are put on the X-axis. The δ (the average goals scored) we used is 1.2. This is an average of Columns E and F on Appendix2- Past Results. The picture makes sense, more soccer teams need to score one goal versus no goals otherwise most games would end in a 0-0 draw.

Knowing this we generate a random number (=RAND()) and match it to the result of the Poisson Distribution, this gives us the number of predicted goals scored per match. We then repeat this process 50 times per team per game. Since this is Excel after all, this is as straightforward as just dragging formulas across cells.

This helps us predict the round robin phase of the tournament. The top two teams from every group will qualify for the knockout stage, but we need to be able to figure out who those are, which is why we need to predict goals for and goals against as they are the second metric after wins, losses and draws used to determine who will qualify.

Now that we know who will qualify for the knockout stage, we no longer care about goals scored. It doesn’t matter, one team in every knockout stage game will win by at least one goal. There can be no ties.

We switch to predicting the percentage chance that a team will progress through the knockout stage. This is done by using a portion of the Elo formula, the win expectancy. We do this because we do not really care about the teams future rating, we just care about how they are expected to preform right now.

The W_e in the Elo formula is the win expectancy and dr is the difference in ratings between two teams, plus 100 for the home team. This is easy in this case as there is only one home team. That is why there is an inconsistent formula error on Appendix 4 – Odds sheet. Remember when I said you could really dig in? Ya, I meant it.

Our model, unsurprisingly so, gives Brazil winning the tournament. However, this is FAR from a sure thing. The gold medal game is expected to be between Brazil and Germany where Germany has a 49.42% chance of winning and Brazil a 50.58% chance of winning, this equates to a coin toss. There is also a scenario where Brazil meets Germany in the first knockout round. In order for this to happen, Germany will have to come second in their group and Brazil will have to come first.

This model will not be perfect, it is called a model after all. Other people have added travel time from Point A to Point B as an indicator, or if a team is playing at home or away to created better models. This makes some logical sense. The closer you play to your home field, the more of your fans will come to see you play, the more relaxed you might be. So, Germany might be given the edge if they play Brazil, and Brazil could go home heartbroken again. This sounds reasonable until you remember (and how can one forget) when Germany beat Brazil in Brazil by a score of 7 to 1. See? All models work, until the script is torn up and thrown away, that is why we play the game after all. It is all a glorious act in futility, because as much as I want this model to be correct, I also love an upset and an underdog story. I think only because, as mentioned at the top, I have no skin in this game.

To stay connected to Excel and its community, read Excel blog posts, and send us ideas and suggestions for the next version of Excel through UserVoice. You can also follow Excel on Facebook and Twitter.

Michael Molyneaux, PM Manager, Excel

[1] This is a metric developed by Dr. Arpad Elo and was initially used to calculate the ratings of chess players but has since been used in everything from Soccer to the NHL to E-Sports (https://en.wikipedia.org/wiki/Elo_rating_system). Although Elo is not used as the official way to rank teams by FIFA, a 2009 study of the metric showed that if implemented correctly, Elo has the best success at predicting winners (http://lasek.rexamine.com/football_rankings.pdf) .

[2] A Poisson Distribution “expresses the probability of a given number of events (goals) occurring in a fixed interval of time or space if these events occur with a known constant rate and independent of the time since the last event” (https://en.wikipedia.org/wiki/Poisson_distribution

Updated Jun 12, 2018

Version 6.0

blog

excel

Michael Molyneaux

Former Employee

Joined March 02, 2018

View Profile

Excel Blog

Subscribe to the Excel Blog to get the latest product announcements and updates