Saturday, April 28, 2012

Who will be the 2012 NBA champs?

As of today, and after a shortened but game-packed season, the 2012 NBA playoffs are finally underway!

Let's take a look at the playoff bracket:

Some very interesting match-ups ahead!

But in addition to trying to catch as many games as possible, I also wanted to take a stab at predicting who would become the 2012 NBA champions of course!

Just as in the previous posts, I will start off with a very simple model and work from there to improve its reliability.

So, what simple model can be establish to predict the outcome of the match-up between two teams? Let's simply consider the number of victories they obtained during the course of the season and derive a probability of winning a game from there.

Let us assume team 1 won 50 games and team 2 won 40 games. I would then grossly estimate that team 1's probability of winning a game against team 2 is 50 / (50 + 44) ~ 56%.

Similarly to many other sports, each match-up is a best-of-seven, meaning that the first team to win 4 games gets to advance to the next round. Therefore, if team 1 has a probability p of winning a game against team 2, team 1's probability of winning the match-up is:

P(win match-up) = p4 (1 + 4 * (1 - p) + 10 * (1 - p)2 + 20 * (1 - p)3)

Indeed, team 1 needs to win 4 games hence the p^4, and team 2 can win anywhere from 0 to 3 games with probability (1 - p). Depending on the number of games team 2 wins, the number of arrangements varies, yielding the 1, 4, 10 and 20. In the previous example, team 1 has an overall probability of winning the series of 62%, despite having won 25% more games during the season. This very simple framework can help explain the many upsets that are regularly witnessed.

So now that we can figure out the probability of team 1 winning its first match-up, we can go to the next step and figure out the probability that will win the series against the winner of the team 3 - team 4 match-up:

P(team 1 wins second round) = P(team 1 beats team 3) * P(team 3 beats team 4) + P(team 1 beats team 4) * P(team 4 beats team 3)

I started this script this morning when none of the games had been played yet, and obtained the following probabilities for each team of becoming the NBA champions:

NBA team Champion Probability
CHI 15.0%
SAS 14.4%
OKC 11.3%
MIA 10.5%
IND 6.7%
LAL 5.8%
MEM 5.4%
ATL 5.0%
LAC 4.7%
BOS 4.3%
DEN 3.7%
ORL 3.3%
NYK 2.7%
DAL 2.6%
UTA 2.4%
PHI 2.2%

I re-ran the script after the results of today's first four games (good thing I'm single) and obtained the following similar results:

NBA team Champion Probability
CHI 17.9%
OKC 14.1%
SAS 13.6%
MIA 13.0%
LAL 5.2%
MEM 5.0%
ORL 4.5%
ATL 4.4%
LAC 4.4%
IND 4.1%
BOS 3.8%
DEN 3.4%
UTA 2.2%
NYK 1.6%
DAL 1.5%
PHI 1.2%

Winning the first game did bump up the teams by a couple of percentage points, Oklahoma City now pulled in front of San Antonio.

But wait! Before you run to the closest sports betting bar to put down all your money on the Bulls, you should be aware that the current model doesn't account for certain external events such as... Derrick Rose tearing his ACL in the Bull's first game. The Bulls played great this year without Derrick in the lineup, but his absence is definitely going to hurt their chances...

Another surprise is the relatively low probabilities for the top teams of the season. Looking at the top 4 contenders, their cumulative probability of wining the title is barely over 50%. Again, this explains some of the regular surprises we see every now and again (every other year theoretically ;-) ).

There are many other caveats with this simple model, homecourt advantage is not taken into into account, nor is the fact that certain teams having already secured their playoff position rested their star players and lost games that were of no importance.

Nevertheless, I will try to continue improving the model and adress the current limitations. Naturally, I will also regularly post updated probabilities as the playoffs progress.

Now back to the replay of Kevin Durant's shot...

Thursday, April 26, 2012

Predicting France's next president?

We're quite literally in the middle of the French Elections, a perfect opportunity to try to predict the new president 10 days ahead of time!

Before we jump in the model, a few words on the French system, thankfully much simpler than the US one!

French elections 101

The election is a two-step process. During the first step, called "first round" all candidates are eligible, and each french voter casts his vote for one of them.

After this first round, the two candidates having received the most votes go to the "second round" and are the only two eligible candidates at this point. This second round takes place exactly two weeks after the first round. Today we are right between the two rounds, and the two remaining candidates are current president Nicolas Sarkozy seeking his second term (left picture), and François Hollande (right picture).


The polls have been pretty much spot on predicting Nicolas and François would battle in the second round, with Francois Hollande having a slight advantage in first round votes.

So, can we predict who will win the second round?

I looked at historical results for the past six presidential elections (1974, 1981, 1988, 1995, 2002 and 2007), recording for each candidate first round and second round percentage of votes.

The model aims at computing the probability of becoming president for the candidate receiving the most votes in the first round.

Now out of the six past elections, the first round vote leader won only 3 elections with the challenger winning the other 3. So looking at the difference in first round percentage votes is not sufficient.

Based on various theories on election, it is also important to consider the percentage of votes received by other eliminated candidates with close affinities to the round-two candidates. Now with only 6 observations, it is difficult to introduce many variables, but I decided to add one more in addition to the first round delta percentage votes for the two leading candidates. This second variable is the delta between percentage votes for the candidates closest candidates. Let me explain based on an example:

Let us rank the 1995 first round candidates by left-right political affinity:

Candidate            First Round %       Sum closest two
Arlette Laguiller             5.30                  8.66
Robert Hue                    8.66                 28.60
Lionel Jospin                23.30                 11.98
Dominique Voynet              3.32                 41.87
Edouard Balladur             18.57                 24.16
Jacques Chirac               20.84                 23.31
Philippe de Villiers          4.74                 35.84
Jean-Marie Le Pen            15.00                  5.02
Jacques Cheminade             0.28                 15.00

For each candidate I then computed the sum of the two candidates immediately to the left and to the right on the political scale.

And the variable I introduce is the delta of this sum metric for the first round leader and the runner-up. So in 1995, te first round leader was Lionel Jospin, his first round percentage delta with second vote leader Jacques Chirac was 2.46 (23.30 - 20.84), and the "closest candidate delta" was -11.33 (11.98 - 23.31).

Model results

With these variables, I built a quick logistic model to estimate the probability of the first round leader to win the second round as a function of "first round delta" and "closest candidate delta".

Applying the model to the results of the 2012 first round results, indicates that the president for the next five years will be....

Nicolas Sarkozy !

Now, despite the small number of observations, I decided to exclude one of them which could be seen as an outlier. Indeed, in 2002 the extreme right party created a monumental surprise by reaching the second round. The second round became a right VS extreme right instead of the usual right VS left battle. And that year Jacques Chirac won the second round with an unprecedented 82% of votes whereas the values usually reside in the 45%-55% range.

Excluding that observation, the model was a perfect fit for the five remaining observations and predicted that the president for the next five years will be....

Nicolas Sarkozy !

Wait until May 6th to criticize...

Now, I could not agree more with the criticism the approach deserves of using the variables (including the intercept) in the model when we only have five or six observations.

But the objective here is not to publish in a stats journal, jsut to play around with the data. And all the polls indicate the François Hollande will be the next president. So in 10 days we'll see if this method that predicts Nicolas Sarkozy isn't as faulty as it would initially appear...

Monday, April 23, 2012

NBA player rankings

So in addition to boardgames, I am also a big NBA fan, and luckily NBA and stats mix really well.
A topic that has often been covered is how to rank players? Which player has the greatest impact? Which is the greatest player of all time (well, that one's easy ;-) )?

The debates rage because the questions are so vague and open to interpretation. What does it mean for a player to be a better basketball player than another? Does it mean having better stats? If I score more, rebound more, assist more, steal more, turnover less, clearly I am better than you? Another approach is to compare win/loss records when the player plays or sits out, although for most players it will be difficult to have a good sample size for the "sitting out" observations. A new metric that has emerged and solves this "sitting out" problem is the plus/minus statistic, which keeps track of the score before and after a player enters the game. So say a player enters the game with game tied at 10, and leaves it with his team ahead by 5. That's +5 for him. He re-enters the game with his team ahead by 10, and leaves it (without coming back) with his team only ahead by 2. That's -8. With the earlier +5 that's an overall -3 for that player in that game.

Today I wanted to look at a different approach, a more statistical approach. I have no clue where it will lead me, but after different trials and errors and tweaking here and there I hope to come up with a new interesting way to rank players.

Ultimately, what we care most about is wins. Sure it's great to score 100 in a game, but if you lose that game that's just wasted effort. So the idea is to find a relationship between a player's efforts and the impact it has on the game. In other words, how does a player's stats in a game change the probability of winning the game?

In terms of the data, I looked at the past six seasons (not including 2011-2012), and for each player looked at his stats with the game outcome for all games played. I only considered players with at least 50 wins and 50 losses, playoffs not included.

As our variable of interest is a probability (of winning the game), we naturally turn towards a logistic regression. We are not directly modeling the probability as a linear combination of the covariates but rather the log odds: log(P(win) / (1 - P(win))). The interpretation of the coefficients will not be entirely straightforward but will still allow us to rank players. Which player has the greatest coefficient, and has the greatest impact on the log odds and thus the probability of winning the game?

Well it depends on our covariates. Since we do want to find an easy way to rank, it's best to only consider one covariate.


Let's naively only consider points scored. How does scoring an extra point improve the log odds?
The top 5 impactful players are (in order): Calvin Booth, Greg Ostertag, Antonio Davis, Anderson Varejao and Bruce Bowen.

All metrics

Points might be too restrictive, since a player can have an impact without scoring. So let's consider (points + rebounds + steals + assists + blocks - turnovers), referred from here onwards as "all metrics" as the covariate.
The top 5 impactful players are (in order): Bruce Bowen, Calvin Booth, Eddie Griffin, Kevin Durant, Antonio Davis.

Minutes played

If we were to suspect that a player's impact is difficult to track with simple metrics only, let us take minutes played as a proxy for everything observed and not observed (good defense, good picks...)
In this last case, the top 5 impactful players are (in order): Kevin Durant, Othella Harrington, Gerald Wallace, Eddie Griffin, Zach Randolph.

Where are the superstars?

It's interesting to see the same names come up, and to notice that aside from Kevin Durant, none of the players have superstar status.

Talking of superstars, where are they in the rankings?

Out of the 479 players considered, here is how some superstars ranked respectively for points, all metrics, and minutes played:
Kobe Bryant: 363, 333, 479
LeBron James: 247, 101, 475
Kevin Garnett: 396, 416, 478



As I was mentioning, I am discovering these results in almost real time with you, and still a little unclear how to interpret them myself. There are a lot of things that could hurt the analysis, namely the fact that the coefficients are hear interpreted as "change in log odds for an additional unit increase in the covariate". But an additional point for Kobe isn't exactly the same thing as an additional point for Eddie Griffin.

We also have a case pointed out in "Superfreakonomics" about ranking good surgeons. Looking at patient death rate for instance can be misleading because of selection bais. People with more critical conditions will go see the better surgeon but because of there condition increase the risk of increasing the surgeons death rate because of the very critical condition. Bad doctors only seeing healthy patients will have impeccable track records. Similarly in basketball, it could be argued that when the game is on the line you will go to your superstars that will have to play exceptionally well to win the game, whereas you might put all your bench in the game when the game has already been won for a while.

There is definitely room for improvement, but I will continue to explore this approach to try to identify lesser known players that have strong yet unnoticeable impacts on the game.
Stay tuned!

Thursday, April 19, 2012

Introduction to Dominion

My first post on this brand new blog devoted to Statistics is going to be a board game I recently discovered, Dominion.

And, surprisingly, there will be no statistics. Why? First, because this site is not necessarily going to be stats only (this is the first post so still hard to figure what the exact trend in topics is going to be). Secondly, because I think that there will be quite a few posts over time on Dominion-related statistics, so I felt it would make sense to introduce the basic concepts of the game first and in a separate post for easy reference.

A very quick primer on Dominion (Disclaimer: Rules are not 100% accurate and sometimes purposely simplified)

Dominion was released in 2008 and was an instant hit in sales and awards, receiving the prestigious 2009 Spiel des Jahres award.

The game is a deck-building game played only with cards. That is to say, players start with a pre-defined set of cards and turn after turn will lose, gain, buy cards. Whoever has the most victory point cards at the end of the game is the winner.

A player's cards can either be in his drawing deck, his hand or his discard pile. Typically, the player draws 5 cards from the deck to constitute his hand, then plays on his turn with the cards in his hand, and after his turn discards his hand (and cards purchased) in the discard pile. When the drawing deck is empty, the discard pile is reshuffled and becomes the new drawing deck.

There are essentially three types of cards:
  • treasure cards:These are your money, and allow you to buy cards, including other treasure cards. There are three treasure cards:
    • coppers that cost nothing to buy, and are worth 1
    • silver that cost 3 to buy, and are worth 2
    • gold that cost 6 to buy and are worth 3
  • action cards: There is a very wide variety of these cards, which allow players to attack other players, defend yourself against other attacks, buy more cards, draw more cards, the list is very long!
  • victory cards:These are the cards that you ultimately care about to win the game. Again, there are three types of victory cards:
    • estates that cost 2 to buy, worth 1 victory point
    • duchies that cost 5 to buy, worth 3 victory points
    • provinces that cost 8 to buy, worth 6 victory points
On his turn a user plays action cards, then makes purchases. The winner is the player with the most victory points at the end of the game which is reached when the last Province is purchased. What is very peculiar with Dominion is that when you make a purchase, you keep the money spent, and both money and card(s) purchased go in the discard pile. This explains why silver and gold are more expensive to buy than when used. They should be purchased in order to increase the average value of your hand value.

What do I mean by that last sentence? Recall that your hand consists of five cards. If I only have coppers (and go after coppers like crazy since they are free to buy), my best hand for purchase will be 5 coppers which will not allow me to buy many interesting cards, let alone Provinces worth eight. When I start buying silvers and golds (which to not cause a negative profit since I keep the money spent to buy those cards), my best hands can get quite valuable. If I not have a hand with 3 coppers, a silver and a gold, I can buy my first Province!

So as the came progresses, you actually dread coppers that clutter your game. Each one makes up 20% of your hand, but has low buying power.

Enough on the Dominion basic rules, hopefully this should be enough to appreciate the upcoming posts!