Thursday, January 28, 2016

X-files: The stats are out there

Everybody knows the TV series X-files. We've all seen at least one episode in the 90's.
I was never a huge fan, but definitely watched my share of episodes back then. So when I heard the series was kicking off again with the same two stars, David Duchovny and Gillian Anderson, thirteen years after the last episode, I thought I 'd take a closer look at how the original series performed and whether it had stopped at a sufficient level of popularity to warrant a follow-up (it's important to note that this is not a prequel or reboot, but just continues where we left off).

Here's the evolution of the popularity as measured by IMDB rating of each episode:

The first striking feature is the volatility in episode rating. In an old post I looked at similar plots for other popular TV series and the ratings were much more stable (over 80% of Friends' episodes were rated between 8.0 and 8.5). But here we can easily have a 6.2 episode follow an 8.9 one.

As I previously mentioned, I never was a huge fan, but I do remember that X-files episodes typically fell into one of two buckets: those focusing on the pretty complex alien conspiracy, and the one-offs independent episodes where Mulder and Scully investigated a weird murder somewhere. It could be that the volatility is partially explained by these two different episodes types. Unfortunately, classifying episodes according to this definition and compare their ratings is no simple task...

We can gain a different perspective by looking at Nielsen data, indicating how many viewers tuned in to each episode and what share of market the episodes had each week.

Rating points:


The storyline is somewhat different here. It would appear X-files started rather modestly in popularity, reaching a peak during the 4th/5th season, and then slowly trending back down, which most likely caused the end of the series in 2002.

The two new episodes have an IMDB rating of 8.6, 8.7, it will definitely be interesting to see if those relatively high ratings for X-files episodes are authentic reflections of episode quality or just avid fans who have been waiting 14 years for Mulder and Scully to reunite!

Also to note: as I was writing this post, the new season was originally classified as a new series starting at season one, but has now merged into the original series as season 10.

Wednesday, January 20, 2016

How retro is Star Wars Episode 7: The Force Awakens?

Essentially all reviews I have read of the latest Star Wars installment, The Force Awakens, discuss how well J.J. Abrams was able to rekindle with the spirit of the original trilogy (Episodes 4, 5, 6).

It's sufficiently retro that even George Lucas spoke out against it for that reason!

Of course, as mentioned in Christopher Orr's critic excerpt, we do see many familiar faces we had grown attached to from the first trilogy and whom we missed in the second trilogy prequel. But can introducing some of our old friends back in the script be enough to recreate the environment we last saw over 20 years ago? Or was J.J. Abrams really able to go beyond the quick solution and put his heart in mind into reviving the magic?

There are many ways to approach the question, but I wanted one that would rely more on statistics than a degree in film studies. My angle was to perform a clustering analysis on the seven movies, solely relying on the dialogues. I found those at Formatting differed from movie to movie so slightly painful to get everything in a suitable format...

I then characterized each movie by the frequency of words, and looked at which movies were most similar to each other in terms of what is said. A few comments:
  • frequency of occurence was normalized by number of words so comparisons were fair (episodes 5, 6 and 7 have much less dialogue than episodes 1 through 4)
  • common english words typically referred to as stopwords in text analysis ("and", "I", "in", "was"....) were removed
  • planet and character names were also removed (Han Solo doesn't even exist in episodes 1-3)
The results from the hierarchical clustering are plotted here:

What does this tell us? Well it seems that just from a frequency of word analysis, one could almost reconstruct the trilogies! Episodes 1, 2 and 3 were lumped together on the left, episodes 4, 5 and 6 together on the right. As for episode 7, the algorithm added it to the original trilogy cluster.

What does this tell us? Well, it would appear that purely from a dialogues perspective, J.J. Abrams and other writers did a very impressive job in maintaining the original look and feel of the first trilogy by using the same lexical field. As an example, in Episodes 1, 2 and 3, the word "jedi" appears in dialogues approximately 11.9 times for every 1000 (non-stop) words. In contrast, that value is 2.6 for Episodes 4, 5 and 6. For The Force Awakens, the value is 2.4. This is one of the many signals which led the clustering algorithm to find Episode 7 "closer" to the first trilogy than the second.

From a modest stat perspective of dialogues, kudos to J.J. Abrams for a perfectly well-executed "retro" style!

Stay posted, this post will be updated quite regularly in the next few years...