Wednesday, January 20, 2016

How retro is Star Wars Episode 7: The Force Awakens?

Essentially all reviews I have read of the latest Star Wars installment, The Force Awakens, discuss how well J.J. Abrams was able to rekindle with the spirit of the original trilogy (Episodes 4, 5, 6).


It's sufficiently retro that even George Lucas spoke out against it for that reason!

Of course, as mentioned in Christopher Orr's critic excerpt, we do see many familiar faces we had grown attached to from the first trilogy and whom we missed in the second trilogy prequel. But can introducing some of our old friends back in the script be enough to recreate the environment we last saw over 20 years ago? Or was J.J. Abrams really able to go beyond the quick solution and put his heart in mind into reviving the magic?


There are many ways to approach the question, but I wanted one that would rely more on statistics than a degree in film studies. My angle was to perform a clustering analysis on the seven movies, solely relying on the dialogues. I found those at http://www.imsdb.com/scripts/. Formatting differed from movie to movie so slightly painful to get everything in a suitable format...

I then characterized each movie by the frequency of words, and looked at which movies were most similar to each other in terms of what is said. A few comments:
  • frequency of occurence was normalized by number of words so comparisons were fair (episodes 5, 6 and 7 have much less dialogue than episodes 1 through 4)
  • common english words typically referred to as stopwords in text analysis ("and", "I", "in", "was"....) were removed
  • planet and character names were also removed (Han Solo doesn't even exist in episodes 1-3)
The results from the hierarchical clustering are plotted here:

What does this tell us? Well it seems that just from a frequency of word analysis, one could almost reconstruct the trilogies! Episodes 1, 2 and 3 were lumped together on the left, episodes 4, 5 and 6 together on the right. As for episode 7, the algorithm added it to the original trilogy cluster.

What does this tell us? Well, it would appear that purely from a dialogues perspective, J.J. Abrams and other writers did a very impressive job in maintaining the original look and feel of the first trilogy by using the same lexical field. As an example, in Episodes 1, 2 and 3, the word "jedi" appears in dialogues approximately 11.9 times for every 1000 (non-stop) words. In contrast, that value is 2.6 for Episodes 4, 5 and 6. For The Force Awakens, the value is 2.4. This is one of the many signals which led the clustering algorithm to find Episode 7 "closer" to the first trilogy than the second.

From a modest stat perspective of dialogues, kudos to J.J. Abrams for a perfectly well-executed "retro" style!

Stay posted, this post will be updated quite regularly in the next few years...






No comments:

Post a Comment