I tried to go the reviews & recommendation way but in the end, if I don't like the series, it didn't matter if some guy on a remote island commented "cool series fam 10/10 would recommend". So there goes half a week jerking around to find out what to watch.
After several unsuccessful hours looking for what I would waste thousands of hours of my life on, I decided to watch nothing and went out to discover the world, travel and take pilates hiking courses while eating vegan food.
Just kidding, I found another way.
Let me save you some time here and tell you that there are a million sources for subtitles online but the best ones are:
def get_subtitle(series_name, series_id, seasons):
for i in xrange(seasons):
season = str(i + 1)
download_link = 'http://fr.tvsubtitles.net/download-' + series_id + '-' + season + '-en.html'
r = requests.get(download_link)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
directory = series_name + '/Season ' + season
if not os.path.exists(directory):
• Download: requests, BeautifulSoup, zipfile
The average talk time for a Dexter episode is 30 minutes, and the average duration of an episode is 50 minutes. The talk density, defined by 30/50 is 0.6, which means 60% of the series is talking. You can notice that the talk time goes up the more we advance in the intrigue. But not by much, I'm still digging into the data but I'm sure this is due to the fact that they repeat a lot of stuff from the previous episode.
Next series! Sex and the City.
Ok so I don't like this series at all, and you'll know exactly why in the second part of the article. In the meantime, here is the time distribution:
The average talk time is 15 minutes and the episode duration is around 25 minutes. The talk density for this one is 15/25, which gives us 60% too! So, Sex and the City shows us that size doesn't matter, since it has the same density as Dexter, and both series have some arguable success for different demographics.
Let's mess with everyone's favorite now: Game of Thrones.
Nice curves! Notice how one is going down and the other is going up. I think there was something sketchy that came with the subtitles at episode 0 of Season 5. They started counting in Python! Yay!
Sweet, we have some preliminary results and the program works well. But this is just the tip of the ice cream cone.
Let's dig deeper!
Let's see how that looks like with movies. I picked out the top 10 4chan movies, and here are the results:
One thing we can derive from the above stuff is: It really doesn't matter if there is a lot of talk (Pulp Fiction, Fight Club..) or almost none (Space Odyssey, Lord of the Rings, Blade Runner..), there is a lot more to a good series/movie than just the words.
If you're still reading, here's a potato for your patience.
It goes deeper
Up to now, it was some minor stuff. Now it gets interesting. What if, instead of watching the stuff, we only read the subtitles? How much life-hours would we waste instead of helping the other countries or advancing research?
So here it is.
Useless knowledge: The average person reads 250 words per minute, which means you already wasted 7 minutes.
In order to calculate these values, I took ALL the subtitles possible for a given format. Mr Robot has only 1 season for example.
This doesn't speak for itself so let's put it into context like Steve Jobs. A good example would be to see how thick a subtitles book would be. Here for some Maths:
Two things before we continue, I don't use INCHES and my multiplications and additions are commutative. Not like that one dude explaining how (°ロ°)☝ 3 + 3 + 3 + 3 + 3 is not the same as 5 + 5 + 5 lol.
So here you got it, If you took the whole Game of Thrones series, printed out the subtitles, and sold it as a book, it won't be thicker than 4 stacked Playboy magazines (this dude was kind enough to measure the thickness)
Each red line represents a full pocket sized book by the way, you can meditate on the other things on your free time.
aaaaaand ༼ つ ◕_◕ ༽つ
My writing as a therapy hits its end, I still have a lot I want to share but it's not yet formatted well so stay tuned if you want to know what makes a good series/movie. I found some characteristics using a mix of sound/video and text. If I'm too lazy just send me a message and we can talk code like ultra nerds.
But here's a little preview for part 2 though, that will be available on this blag:
Note: If you plan to aggressively throw the dataset into an RNN and generate a series or movie and record it, first of all you're awesome and second, I'd be glad to help you with the scenario and set up!
Thanks for reading, you're awesome!