Datamining the next series to watch - part 1
Lots of stuff out there
I tried to go the reviews & recommendation way but in the end, if I don't like the series, it didn't matter if some guy on a remote island commented "cool series fam 10/10 would recommend". So there goes half a week jerking around to find out what to watch.
So what's the dealy yo?
After several unsuccessful hours looking for what I would waste thousands of hours of my life on, I decided to watch nothing and went out to discover the world, travel and take pilates hiking courses while eating vegan food.
Just kidding, I found another way.
What makes a good series?
- Watching stuff means that I can analyze video, images: cool story bro, I just have a Pentium 2 512Mb ram machine which means I can't unleash deep learning mayhem on 1080p videos
- Hearing stuff means that I can analyze audio: sure, will do, but audio libraries are a pita to use, i'm on it though
- Reading stuff: Bingo! NLP textual analysis LDA Semantic High Learning Deep Learning In between learning (buzzwords for some google juice), so yeah, I went the subtitle way to stay the closest possible to the content I am analyzing, and as far as possible from some subjective representation of it
As you can see, it contains information about the duration of somebody talking, when they talk and also what they say. This simple text file has so much data it would make any advertising company (aka social network) so happy! I wish there was something like it but in real life, so you can listen to people while listening to some Kanye West or some Drake.
- DONE Find the data
- DONE Get the data
- DONE Clean the data
- DONE Preprocess the data
- DONE Store the data
- DONE Query the data
- DONE Define what I want from the data
- DONE Do some stuff with the data
Executing the master plan
Let me save you some time here and tell you that there are a million sources for subtitles online but the best ones are:
Booting up Eclipse
def get_subtitle(series_name, series_id, seasons):
for i in xrange(seasons):
season = str(i + 1)
download_link = 'http://fr.tvsubtitles.net/download-' + series_id + '-' + season + '-en.html'
r = requests.get(download_link)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
directory = series_name + '/Season ' + season
if not os.path.exists(directory):
Cool bro, now what?
Enough chit chat, the results please
The average talk time for a Dexter episode is 30 minutes, and the average duration of an episode is 50 minutes. The talk density, defined by 30/50 is 0.6, which means 60% of the series is talking. You can notice that the talk time goes up the more we advance in the intrigue. But not by much, I'm still digging into the data but I'm sure this is due to the fact that they repeat a lot of stuff from the previous episode.
Next series! Sex and the City.
Ok so I don't like this series at all, and you'll know exactly why in the second part of the article. In the meantime, here is the time distribution:
The average talk time is 15 minutes and the episode duration is around 25 minutes. The talk density for this one is 15/25, which gives us 60% too! So, Sex and the City shows us that size doesn't matter, since it has the same density as Dexter, and both series have some arguable success for different demographics.
Let's mess with everyone's favorite now: Game of Thrones.
Nice curves! Notice how one is going down and the other is going up. I think there was something sketchy that came with the subtitles at episode 0 of Season 5. They started counting in Python! Yay!
Sweet, we have some preliminary results and the program works well. But this is just the tip of the ice cream cone.
Let's dig deeper!
Let's see how that looks like with movies. I picked out the top 10 4chan movies, and here are the results:
One thing we can derive from the above stuff is: It really doesn't matter if there is a lot of talk (Pulp Fiction, Fight Club..) or almost none (Space Odyssey, Lord of the Rings, Blade Runner..), there is a lot more to a good series/movie than just the words.
Let's look at the Star Wars episodes. The times differ with the ones you'll find online because I don't take into account the credits and other stuff.
It's cool to know that even if the duration stays roughly the same, the talk time goes down! DOWN! DOWN!. There are lots of stuff that can go into explaining this behavior. I like to think that during the last decade, our visual cortex grew bigger while other stuff got compressed. This means, now more than ever, that we are visual creatures.
PS: I didn't like the last Episode that much.
If you're still reading, here's a potato for your patience.
It goes deeper
Up to now, it was some minor stuff. Now it gets interesting. What if, instead of watching the stuff, we only read the subtitles? How much life-hours would we waste instead of helping the other countries or advancing research?
So here it is.
Useless knowledge: The average person reads 250 words per minute, which means you already wasted 7 minutes.
In order to calculate these values, I took ALL the subtitles possible for a given format. Mr Robot has only 1 season for example.
This doesn't speak for itself so let's put it into context like Steve Jobs. A good example would be to see how thick a subtitles book would be. Here for some Maths:
Two things before we continue, I don't use INCHES and my multiplications and additions are commutative. Not like that one dude explaining how (°ロ°)☝ 3 + 3 + 3 + 3 + 3 is not the same as 5 + 5 + 5 lol.
So here you got it, If you took the whole Game of Thrones series, printed out the subtitles, and sold it as a book, it won't be thicker than 4 stacked Playboy magazines (this dude was kind enough to measure the thickness)
Each red line represents a full pocket sized book by the way, you can meditate on the other things on your free time.
aaaaaand ༼ つ ◕_◕ ༽つ
My writing as a therapy hits its end, I still have a lot I want to share but it's not yet formatted well so stay tuned if you want to know what makes a good series/movie. I found some characteristics using a mix of sound/video and text. If I'm too lazy just send me a message and we can talk code like ultra nerds.
But here's a little preview for part 2 though, that will be available on this blag:
- Every point of view is unique, counting stars or just the simple fact of AVERAGING review counts or ratings is fundamentally flawed for series, even if your SVM's and KNN's go their way. This approach works for politics. ლ(´ڡ`ლ)
- Them feels, not them bits
- Closer to the content aka DEEPER!
- Exploring hidden paths
Note: If you plan to aggressively throw the dataset into an RNN and generate a series or movie and record it, first of all you're awesome and second, I'd be glad to help you with the scenario and set up!
Thanks for reading, you're awesome!