Datamining the next series to watch - part 1

published on

Datamining the next series to watch - part 1

I tried to kind of automate the process of deciding which movie or series to watch next and I ended up writing this instead.

Lots of stuff out there

Winter came and is still coming hard here in Germany and to fight the daily boredom and the long nights, what better than some 13 Seasons 50 minutes episodes series? (¬‿¬)
One thing though, it's ultra boring to find which one of those bazillion series to watch. There are reviews, recommendations, pseudo random generators and even gurus you can call that tell you what to do. 

I tried to go the reviews & recommendation way but in the end, if I don't like the series, it didn't matter if some guy on a remote island commented "cool series fam 10/10 would recommend". So there goes half a week jerking around to find out what to watch. 


So what's the dealy yo?

After several unsuccessful hours looking for what I would waste thousands of hours of my life on, I decided to watch nothing and went out to discover the world, travel and take pilates hiking courses while eating vegan food.

Just kidding, I found another way.


What makes a good series?

When you watch a series you actually hear stuff and see stuff and sometimes read stuff when you didn't take that language class back in the days. 
  • Watching stuff means that I can analyze video, images: cool story bro, I just have a Pentium 2 512Mb ram machine which means I can't unleash deep learning mayhem on 1080p videos 
  • Hearing stuff means that I can analyze audio: sure, will do, but audio libraries are a pita to use, i'm on it though
  • Reading stuff: Bingo! NLP textual analysis LDA Semantic High Learning Deep Learning In between learning (buzzwords for some google juice), so yeah, I went the subtitle way to stay the closest possible to the content I am analyzing, and as far as possible from some subjective representation of it
Sweet, let's go for the subtitles. A subtitle is a file that looks like it's numbers, timestamps and some text on it (almost like a parking ticket). It presents itself like this (from Southpark):

As you can see, it contains information about the duration of somebody talking, when they talk and also what they say. This simple text file has so much data it would make any advertising company (aka social network) so happy! I wish there was something like it but in real life, so you can listen to people while listening to some Kanye West or some Drake.

Planning mayhem

For starters, here's a "slaughter plan" like the Germans call it:
  • DONE Find the data 
  • DONE Get the data
  • DONE Clean the data
  • DONE Preprocess the data
  • DONE Store the data
  • DONE Query the data
  • DONE Define what I want from the data
  • DONE Do some stuff with the data
Notice the EMACS org mode goodness going on.
It's always useful to plan steps at the higher level, it's the only way you'll achieve something. 

Executing the master plan

Let me save you some time here and tell you that there are a million sources for subtitles online but the best ones are:


The second one is the most entertaining one since they no scope any ad blocker out there to display ads so that you can meet sensational Russian women. ʕ•ᴥ•ʔ
So ok, which series should I choose?
Just like any IT nerd, I went on reddit and I asked for help. Let's start with the best of breed and ask: what are the worst tv series according to you? 
I got a lot of comments on that one, which already shows that it's also a tough choice to decide which series NOT to watch. Oh boy.

Data Dictatorship

I had to decide, so I've decided I'll get the subtitles for Bones, Breaking Bad, Criminal Minds, Dexter, Game of Thrones, House MD, House of Cards, How I met your mother, Louie, Mr. Robot, Sex and the City, Southpark, Suits, The Big Bang Theory, The Vampire Diaries and The Walking Dead. (I promise I'll do Star Trek too, in a second sprint)
This small sample was to try and find out how those series are structured, and this is what I'll be showing in part 1 of this little data exploration.

Booting up Eclipse

I had to start somewhere so I obviously decided I'll install java. Joke aside, here's some code that'll get you the subtitles for a series for just one website. The imports are not so difficult to figure out. You go ahead and code the rest young padawan!


def get_subtitle(series_name, series_id, seasons):
for i in xrange(seasons):
season = str(i + 1)
download_link = '' + series_id + '-' + season + '-en.html'
r = requests.get(download_link)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
directory = series_name + '/Season ' + season
if not os.path.exists(directory):


I got the data, but data from the internet is as dangerous as unprotected sex. So I had to be cautious and check for stuff like encoding, formatting, extensions, file (☞゚∀゚)☞ size, language and stuff like that.
But that is part of the data cleaning process which is boring and which NOBODY EVER tells you about in school. You always get that sequoia quechua plant data set which I call a unicorn dataset. You will never get any other dataset that is so clean from the interwebz. 
So, several HOURS of frustration later, we did it guys. The data is clean and ready to be analyyyyyzed. 

Cool bro, now what?

Let's define the goals of the analysis. This step requires that you already looked at the data and also that you have 2 neurons dedicated to creativity and open-mindedness. 
I don't half-ass my work so here are the features I define as important:


• Talk time: How much talk is there in a series?
• Talk frequency: How frequently do they talk and when?
• Episode/Movie duration: How long was the thing?
• Idle time: Slideshow time (only pictures on screen)
• Number of words
• Number of sentences
• Most used words
• Words length
• Sentence length
• Vocabulary richness: How much $$ the vocabulary owns
• Time to read: How long would it take you to read the subs?
• SMOG grade: How old do you have to be to understand it?
• Topic modeling: What are the topics?
• Summary: Ain't nobody got time for that
• Polarity: Doesn't involve bears
• Word usage: How are the words used?
• Sentence beginnings: How are the sentence structured?
In order to do all of this stuff, I headed over to the awesome Python community and I was staggered that there is so much GOOD OL Stuff out there that the dataset is just begging for it to be analyzed.
Here's a short recap of all the libraries involved in the project:
• Python has libraries, lots of them, using Python 2.7 (problem?)
• Parse subtitles: pysrt
• Parse text: textblob, sumy (text summarizer), re
• Analyze text: gensim (topic modelling), spacy (kick ass nltk), readability etc
• Various data manipulation: pandas, numpy
• Display and plots: Wordcloud, matplotlib

• Download: requests, BeautifulSoup, zipfile    


Don't worry, this will be available on my Github repo I guess someday, but it's the most hackathon code you'll ever have the privilege to load on your V8 engine. 

Enough chit chat, the results please 

My friends know, I like to talk a lot when I feel good, and right now, I feel (ノ◕ヮ◕)ノ*:・゚✧ delighted ✧゚・: *ヽ(◕ヮ◕ヽ). 
Ok, so I tested on that series called DEXTER, I bet you know it right? 
So here's how it looks:


The average talk time for a Dexter episode is 30 minutes, and the average duration of an episode is 50 minutes. The talk density, defined by 30/50 is 0.6, which means 60% of the series is talking. You can notice that the talk time goes up the more we advance in the intrigue. But not by much, I'm still digging into the data but I'm sure this is due to the fact that they repeat a lot of stuff from the previous episode. 

Next series! Sex and the City. 

Ok so I don't like this series at all, and you'll know exactly why in the second part of the article. In the meantime, here is the time distribution:

The average talk time is 15 minutes and the episode duration is around 25 minutes. The talk density for this one is 15/25, which gives us 60% too! So, Sex and the City shows us that size doesn't matter, since it has the same density as Dexter, and both series have some arguable success for different demographics. 

Let's mess with everyone's favorite now: Game of Thrones. 

Nice curves! Notice how one is going down and the other is going up. I think there was something sketchy that came with the subtitles at episode 0 of Season 5. They started counting in Python! Yay!

Sweet, we have some preliminary results and the program works well. But this is just the tip of the ice cream cone.

Let's dig deeper! 

Let's see how that looks like with movies. I picked out the top 10 4chan movies, and here are the results:

One thing we can derive from the above stuff is: It really doesn't matter if there is a lot of talk (Pulp Fiction, Fight Club..) or almost none (Space Odyssey, Lord of the Rings, Blade Runner..), there is a lot more to a good series/movie than just the words. 


Let's look at the Star Wars episodes. The times differ with the ones you'll find online because I don't take into account the credits and other stuff.

It's cool to know that even if the duration stays roughly the same, the talk time goes down! DOWN! DOWN!. There are lots of stuff that can go into explaining this behavior. I like to think that during the last decade, our visual cortex grew bigger while other stuff got compressed. This means, now more than ever, that we are visual creatures.
PS: I didn't like the last Episode that much. 

If you're still reading, here's a potato for your patience. 

It goes deeper

Up to now, it was some minor stuff. Now it gets interesting. What if, instead of watching the stuff, we only read the subtitles? How much life-hours would we waste instead of helping the other countries or advancing research?

So here it is. 

Useless knowledge: The average person reads 250 words per minute, which means you already wasted 7 minutes.

In order to calculate these values, I took ALL the subtitles possible for a given format. Mr Robot has only 1 season for example.

This doesn't speak for itself so let's put it into context like Steve Jobs. A good example would be to see how thick a subtitles book would be. Here for some Maths:

Two things before we continue, I don't use INCHES and my multiplications and additions are commutative. Not like that one dude explaining how (°ロ°)☝ 3 + 3 + 3 + 3 + 3 is not the same as 5 + 5 + 5 lol.

So here you got it, If you took the whole Game of Thrones series, printed out the subtitles, and sold it as a book, it won't be thicker than 4 stacked Playboy magazines (this dude was kind enough to measure the thickness)

Each red line represents a full pocket sized book by the way, you can meditate on the other things on your free time.


Sweet, now that we did some weird stuff, let's plot some eye candy and do some word clouds. Those clouds were generated with an online javascript thingie. Analyzing them is up to you but here you have, only for the first seasons:

aaaaaand ༼ つ ◕_◕ ༽つ

My writing as a therapy hits its end, I still have a lot I want to share but it's not yet formatted well so stay tuned if you want to know what makes a good series/movie. I found some characteristics using a mix of sound/video and text. If I'm too lazy just send me a message and we can talk code like ultra nerds.


I did some NLP stuff but it gets really complicated plotting it and showing in a user friendly way. My javascript skills go as far as console.log is required and don't even start me on React.

But here's a little preview for part 2 though, that will be available on this blag:


  • Every point of view is unique, counting stars or just the simple fact of AVERAGING review counts or ratings is fundamentally flawed for series, even if your SVM's and KNN's go their way. This approach works for politics. ლ(´ڡ`ლ)
  • Them feels, not them bits 
  • Closer to the content aka DEEPER!
  • Exploring hidden paths
Note: If you plan to aggressively throw the dataset into an RNN and generate a series or movie and record it, first of all you're awesome and second, I'd be glad to help you with the scenario and set up! 


Thanks for reading, you're awesome!