Friday, September 13, 2013

Tuesday Oct 29 2013: Bulls at Miami, who will win?

The 2013-2014 NBA season kicks-off Tuesday October 29th and the spotlight will be the return of two of the injured megastars: the Bulls' Derrick Rose in Miami, and the Kobe Bryant's Lakers host the Clippers.

On this blog we love drama, we love high intensity games, but we also like tough questions. Such as: who will win those two games? And while we're at it, who will come out victorious of the other 1228 games of the season?

In this post I will present my efforts to predict the outcome of a game based on metrics related to the two opposing teams.

The data

The raw data consisted of all regular-season NBA games (no play-offs, no pre-season) since the 1999-2000 season. That’s right, we’re talking about 16774 games here. For each game I pulled information about the home team, the road team and who won the game.

The metrics

After pulling the raw data, the next step was to create all the metrics relate to the home and road team’s performance up until the game I want to predict the outcome of. Due to the important restructuring that can occur in a team over the off-season, each new season starts from scratch and no results carried over from one season to the next.

Simple Metrics
The simple metrics I pulled were essentially the home, road and total victory percentages for both the home team and the road team. Say the Dallas Mavericks, who have won 17 of their 25 home games and 12 of their 24 road games visit the Phoenix Suns who have won 12 of their 24 home games and 13 of their 24 road games, I would compute the following metrics:

  • Dallas home win probability: 17 / 25 ~ 68.0%
  • Dallas road win probability: 12 / 24 ~ 50.0%
  • Dallas total win probability: 29 / 49 ~ 59.2%
  • Phoenix home win probability: 12 / 24 ~ 50.0%
  • Phoenix road win probability: 13 / 24 ~ 54.2%
  • Phoenix total win probability: 25 / 48 ~ 52.1%

Discounted simple metrics
However, these statistics seemed a little too simplistic. A lot can happen in the course of a season. Stars can get injured or return from a long injury. A new team might struggle at first to play with each other before really hitting their stride. So I included some new metrics which have some time discounting. A win early in the season shouldn’t weigh as heavily as one in the previous game. In the non-discounted world we kept track, for home games and road games separately, of the number of wins and number of losses, incrementing one or the other by 1 depending on whether the team won or lost. We do exactly the same here with a discount factor:

new_winning_performance = discount_factor * old_winning_performance + new_game_result

new_game_result is 1 if they won the game, 0 if they lost.

When setting the discount factor to 1 (no discounting), we are actually counting the number of wins and are back in the simple metrics framework.

To view the impact of discounting, let us walk through an example:
Let’s assume a team won it’s first 3 games then lost the following 3, and let us apply a discount factor of 0.9.

  • After winning the first game, the team’s performance is 1 (0 * 0.9 + 1)
  • After winning the second game, the team’s performance is 1.9 (1 * 0.9 + 1)
  • After winning the third game, the team’s performance is 2.71 (1.9 * 0.9 + 1)
  • After losing the fourth game, the team’s performance is 2.44 (2.71 * 0.9 + 0)
  • After losing the fifth game, the team’s performance is 2.20 (2.44 * 0.9 + 0)
  • After losing the sixth game, the team’s performance is 1.98 (2.20 * 0.9 + 0)

But now consider a team who lost their first three games before winning the next three:

  • After losing the first game, the team’s performance is 0 (0 * 0.9 + 0)
  • After losing the second game, the team’s performance is 0 (0 * 0.9 + 0)
  • After losing the third game, the team’s performance is 0 (0 * 0.9 + 0)
  • After winning the fourth game, the team’s performance is 1 (0 * 0.9 + 1)
  • After winning the fifth game, the team’s performance is 1.9 (1 * 0.9 + 1)
  • After winning the sixth game, the team’s performance is 2.71 (1.9 * 0.9 + 1)

Although both teams are 3-3, the sequence of wins/losses now matters. A team might start winning more if a big star is returning after an injury of due to a coach change or some other reason so our metrics should reflect important trend changes such as those. Unsure of what the discounting factor should be, I computed the metrics for various values.

Discounted opponent-adjusted metrics
A third type of metric I explored was one where the strength of the opponent is incorporated to compute a team’s current performance. In the above calculations, a win was counted as 1 and a loss as 0, no matter the opponent. But why should that be? Just like in chess with the ELO algorithm, couldn’t we give more credit to a team beating a really tough opponent (like the Bulls snapping the Heat’s 27 consecutive wins), and be harsher when losing to a really weak team?
These new metrics were computed the same way as previously (with a discounting factor) but using the opponent’s performance instead of 0/1.

Let’s look at an example. The Thunder are playing at home against the Kings. The Thunder (pretty good team) have a current home win performance of 5.7 and a current home loss performance of 2.3. This leads to a “home win percentage” of 5.7 / (5.7 + 2.3) = 71%. The Kings (pretty bad team) have a current road win performance of 1.9 and a current road loss performance of 6.1. This leads to a “road win percentage” of 1.9 / (1.9 + 6.1) = 24%.

If the Thunder win:

  • The Thunder’s home win performance is now: 0.9 * 5.7 + 0.24 = 5.37
  • The King’s road loss performance is now: 0.9 * 6.1 + (1 - 0.71) = 5.78

If the Thunder lose:

  • The Thunder’s home loss performance is now: 0.9 * 2.3 + (1 - 0.24) = 2.83
  • The King’s road loss performance is now: 0.9 * 1.9 + 0.71 = 2.42
If you win, you get credit based off of your opponent’s win percentage. If you lose, you get penalized according to your opponent’s losing percentage (hence the 1- in the above formulas). The worst teams hurt you most if you lose to them.
As seen from the example between a very good team and a very bad one, winning does not guarantee that your win performance will increase and losing does not guarantee your losing performance will increase. The purpose is not to have a strictly increasing function if you win, it is to get, at a given point in time, an up-to-date indicator of a team’s home and road performance.

The models

For all the models, the training data was all games except those of the most recent 2012-2103 season which we will use to benchmark our models.

Very simple models were first used: how about always picking the home team to win? Or the team with the best record? Or compare the home team’s home percentage to the road team’s road percentage?

I then looked into logistic models in an attempt to link all the above-mentioned metrics to our outcome of interest: “did the home team win the game?”. Logistic models are commonly used to look at binomial 0/1 outcomes.

I then looked into machine learning methods, starting with Classification and regression trees (CART). Without going into the details, a decision tree will try to link the regressor variables to the outcome variable by a succession of if/else statements. For instance, I might have tracked over a two week vacation period whether my children decided to play outside or not in the afternoon, and also kept note of the weather conditions (sky, temperature, humidity,...). The resulting tree might look something like:

If it rains without wind tomorrow, I would therefore expect them to be outside again!

Finally, I also used random forest models. For those not familiar with random forests, the statistical joke behind it is that it is simply composed of a whole bunch of decision trees. Multiple trees like the one above are “grown”. To make a prediction for a set of values (rain, no wind), I would look at the outcome predicted by each tree (play, play, no play, play….) and pick the most frequent prediction.

The results

As mentioned previously, the models established were then put to the test on the 2012-2013 NBA season.

As most of the models outlines above use metrics that require some historical data, I can’t predict the first game of the season not having any observations for past games for the two teams (yes, I do realize that the title of the post was a little misleading :-) ). I only included games for which I had at least 10 observations of home games for the home team and 10 road games for the road team.

Home team
Let’s start with the very naive approach of always going for the home team. If you had used this approach in 2013, you would have correctly guessed 60.9% of all games.

Best overall percentage
How about going with the team with the best absolute record? We get a bump to 66.3% of all games correctly predicted in 2013.

Home percentage VS Road percentage
How about comparing the home team’s home percentage with the road team’s road percentage? 65.3% of games are correctly guessed this way. Quite surprisingly, this method provides a worse result than simply comparing overall records.

Logistic regressions
Many different models were tested here but I won’t detail each. Correct guesses range from 61.0% to 66.6%. The best performing model was the simplest one which only included the intercept and the overall winning percentage for the home team and the road team. The inclusion of the intercept explains the minor improvement to the 66.3% observed in the “Best overall percentage” section.
Very surprisingly, deriving sophisticated metrics discounting old performances in order to get more accurate readings on a team’s performance did not prove to be predictive.

Decision tree
Results with a decision tree was in the neighborhood of the logistic regression models, with a value of 65.8%.

Interestingly, the final model only looks at two variables and splits: whether the home team's home performance percentage (adjusted with a discounting factor of 0.5) is greater than 0.5 or not, and whether the road team's performance incorporating opponent strength (discounting factor of 1, so no discounting actually) is greater than 0.25 or not.

Random Forests
All our hopes reside in the Random Forests to obtain a significant improvement in predictive power. Unfortunately we obtain a 66.2% value, right where we were with the decision tree, the regression model and more shamefully the model comparing the two teams overall Win-Loss record!
When looking at which variables were most important, home team and road team overall percentages came up in the first two positions.


The results are slightly disappointing from a statistical point of view. From the simplest to the most advanced techniques, none are able to break the 70% threshold of correct predictions. I will want to revisit the analyses and try to break the bar. It is rather surprising to see how overall standings matter compared to metrics that are more time sensitive. The idea that the first few games of the season matter to predict the last few games of the season is a great insight. This can be interpreted by the fact that good teams will eventually lose a few consecutive games in a row, even against bad teams, but that should not be taken too seriously. Same with bad teams winning a few games against good teams, they remain bad teams intrinsically.

From an NBA fan perspective, the results are beautiful. It shows you why this game is so addictive and generates so much emotion and tension. Even when great teams face terrible opponents, no win is guaranteed. Upsets are extremely common and every game can potentially become one for the ages!