Tuesday, August 20, 2013

Boston: City of Champions

This is what I saw the other day at the Boston Airport (minus a hundred other people taking their shoes off and laptops out of carry-ons):



All the championships won by a Boston team (Celtics, Bruins, Patriots, Red Sox) have their banner hanging from the ceiling right before security screening. My daughter noticed the banners too and asked if they were ordered by color. I replied that they were actually ordered by date and when the championship was won.

But taking a second look at the ordered banners showed that her interpretation was not very far off the mark.



5 red banners, 3 yellow, 11 green, 2 yellow, 5 green, 3 blue, 2 red, 1 green, 1 yellow... The same colors seem to be close to each other (with some slight variation depending on the ordering of the third Patriot Superbowl championship and Red Sox title). So the natural question is whether any conclusions can be drawn from the fact that the colors seemed to be grouped together?

It could very well be that this is all purely coincidental: with only four teams capable of winning championships, you would expect at some point the same team to win two titles without one of the other three winning in between. But would the groupings be so obvious? A hypothesis would be that every so often, one of the teams will dominate its sport and that for a certain period of time will win way more than the other three Boston teams. When a team wins a given year, it has a much greater probability of winning the next than on a random year. So color clusters are actually proxys for team dominance during a certain era.


First approach

To determine which of the two reasoning is most likely, I ran a few simulations. By a few I mean a million. I considered 17 Celtics championships, 7 Red Sox championships, 6 Bruins championships and 3 Patriots championships, then randomly sampled without replacement. Our measurement of clustering is simply the number of clusters observed. The minimum number is 4 when all teams win all their titles without interruption (17 green, 7 red, 6 yellow and 3 blue, or 6 yellow, 3 blue, 17 green and 7 red, or...). The max is 33 when teams keep interrupting each other:
G-Y-G-Y-G-Y-G-Y-G-Y-G-Y-G-B-G-B-G-B-G-R-G-R-G-R-G-R-G-R-G-R-G-R-G.

In Boston history we have 9 transitions which is definitely on the low side. The histogram for a million simulations yields:




The average and median number of clusters was greater than 22. Not only is our observation of 9 (red vertical line) much lower, but not a single of our 1000000 simulation yielded a value less than 10! Talk about p-value!


Second Approach

We definitely simplified the true dynamics on championship winning by considering our urn of championships from which we picked. A second approach would be to look year by year at the probability of a team winning a championship.

Based on our observations of the last 114 years (assuming a start date in 1900) the Celtics won 17 times, the Red Sox 7 times, the Bruins 6 times and the Patriots 3 times. We can therefore use these empirical probabilities to generate other what-if scenarios where each year the probability of any team winning is independent of the past (no team dominance).
With this simple approach I only consider that a single team can win in any given year. Starting in 1900, I flip a very biased 5 sided-coin where Celtics come up with probability 17/114, Red Sox 7/114, Bruins 6/114, Patriots 3/114 and Nobody with 81/114. And then look again at how many clusters are obtained.



In the first approach we sampled without replacement from an urn with 33 championships. However in our new approach we could get no championship at all every single year or a championship every single year! We can't just compare the number of clusters, but should instead look at cluster average: # transitions / # championships. In real life the ratio is 8/33  ~ 0.2424. In our new 1000000 sample we were able to generate a few cases that out-performed real life where 8 transitions were also observed but with 34 and even 36 championships. Some very low values of averages were observed when the number of championships won was almost half (4 transitions and 19 championships, this was huge Celtics dominance!). That being said, only 39 out of 1000000 simulations yielded an equivalent or better ratio. Here's a plot of the distribution of cluster-to-championship ratio, with the red line indicating what we observed in Boston:




So no matter how simple our approaches, we have definitely put forward some evidence of sports dominance by Boston's teams (and luckily for our analysis, Boston's teams did not dominated at the same time too often or that would have created some high frequency alternating between the two teams, thus breaking our clustering metric).

I don't believe my three-year old grasped all the subtleties involved here, but I definitely hope all those bright colors will get here interested in stats!