I've been offline for almost 2 years now and tried a few times to get my blog up and running again without much motivation. But now I seem to have gotten a grip on the whole shebang. Previously hosted on Github Pages with Publii it is now a self hosted Ghost instance. I prefer this because I can write my posts from wherever I want without needing any credential management or git installation or whatever. So far the experience has been smooth and rather enjoyable. I just need a browser, my credentials and I'm good to go.

The installation is self-hosted and rather standard, Nginx serving the Ghost App with some SSL kung fu provided by Let's Encrypt. I'm not using any analytics or vanity metrics (Google, Matomo, etc..) because once again it is my personal blog and to be honest the only metrics that matters to me is to have sustainable fun and think that somewhere someone reads this and enjoys it.

The instance the blog is hosted on is not the beefiest and doesn't have RGB's for more FPS but I tried some load testing with locust.io and it should handle around multiple thousands of concurrent connections without a problem.

Smooth sailing so far.

I am obsessed with performance and measuring things ( ͡° ͜ʖ ͡°) so I keep a close eye on the server load and the nginx/server logs. Once you take a closer look though you realise that what is happening online is comparable to the wild west.

My logs are full of bogus requests, port scans and unsollicited pen-tests that seem to be just a set of scripts scouring the internet to recruit more machines for silly botnets or whatnot.

Here are some requests:

Removed the IP's because i'm not a snitch

 [23/Apr/2020:21:13:32 +0000] "9\xCD\xC3V\x8C&\x12Dz/\xB7\xC0t\x96C\xE2" 400 182 "-" "-"
 [24/Apr/2020:01:18:45 +0000] "POST /boaform/admin/formPing HTTP/1.1" 400 182 "-" "polaris botnet"
 [25/Apr/2020:14:07:37 +0000] "POST /cgi-bin/mainfunction.cgi HTTP/1.1" 404 152 "-" "XTC BOTNET"
 28/Apr/2020:21:08:16 +0000] "GET / HTTP/1.1" 400 280 "http://www.virus-respirators.com" "https://www.virus-schutzmasken.de"
 [20/Apr/2020:21:32:38 +0000] "GET / HTTP/1.1" 302 37 "-" "Mozilla/5.0 zgrab/0.x (compatible; Researchscan/t13rl; +http://researchscan.comsys.rwth-aachen.de)"
 [21/Apr/2020:00:14:08 +0000] "GET / HTTP/1.0" 200 612 "-" "masscan/1.0 (https://github.com/robertdavidgraham/masscan)"
 [21/Apr/2020:06:08:42 +0000] "GET /2014/license.txt HTTP/1.1" 301 75 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
 

Obviously, one can learn a lot by reading and seeing what is going on.

There  are many ways of how to deal with these requests. Either you block them or you mess with them. Messing with them would be to send them bogus files to deal with or redirect them to Rick Gervay's nice tune. I currently don't have time for the latter options as this approach can be a liability. In Germany, if you hit someone who breaks into your house you can still be liable. It wouldn't be far fetched that if you respond to an automated "GET wp-login.php" with a coronavirus.zip zip bomb or a free WinRAR license key you could get into trouble. So I went the boring route and decided to just block them. (pinkie promise)  ( ͡° ͜ʖ ͡°)

Blocking IP's

The main gist of how it's done is simple. Someone asks my server to get access to a link and my server answers with different things called status codes. Sometimes it's "I don't have this file": 404 status code. Sometimes it's "I can hear you but I can't understand you" aka 400 status code. These status codes alone give up a whole lot of information already, essentially they are a source of information about my server that could reveal potential weak points. That's why these scanners span a very broad net in the hopes of finding at least one weak point. But it's not without leaving a trace.

Every time someone asks my server something, their request gets written into a log file in this case the nginx access log file.

The log file contains things like IP adress, what the request was and what the server responded. Using this and some bash utilities one can filter out good and bad requests and send the bad requests straight to jail ie ban them.

A status quo of jailable IP's

Since we need to start somewhere, I wrote some sample rules by hand. Filtered the logs by 301 and 404 status codes and just pasted some contents into a file called rules:

contents of rules

wp-login
home.asp
htmlV

With the help of this blog post, https://linux-audit.com/blocking-ip-addresses-in-linux-with-iptables/, I setup ipset and a blacklist file to start waving the ban hammer.

First of all, using the mentioned rules file, I can search the logs for any line that matches one of these rules:

grep -f rules /var/log/nginx/access.log

And thus extract a list of IP's to send to jail:

grep -f rules /var/log/nginx/access.log | awk '{print $1}' | uniq > banned.ips

Using the same tools, I can join the log file and the IP to find out why an IP was  blocked, this is easy, for audit or something like that because I would expect that vulnerability scans, if not asked for should be illegal. How would you explain going door to door and trying out different keys or checking the knobs or windows of random houses? My blog is my garden so don't mess it up.

And we're ready for blocking them all at once.

for ip in $(cat banned.ips); do ipset add blacklist $ip; done

This way we setup an initial IP jail for people scanning and doing weird shit.

Automating Jail aka Keeping it going

We need a way now to keep this running in the background for new IP's and new requests. Automating this is easy, we just need to pipe until it works, here's a SAAS in a bash one-liner:

tail -f /var/log/nginx/access.log | grep --line-buffered -f rules | $(awk '{ system("ipset add blacklist " $1) }')

What this does is it will constantly scan the nginx log and as soon as something matches the rules, you're right: jail.

jail

This works so good that while testing it with my local connection I got kicked out from the server. The only way back in was to setup a hotspot with my phone, SSH into the server and remove my own IP from the blacklist. I used to be a programmer until I shot myself in the IP, I guess.

Now throw this in systemd, don't forget to persist the blacklist, and you should be good to go.

The most important thing is then to setup a clean rule set so as not to screw it all up but this exercise is left to the reader. Or just use fail2ban lol. But I like to be in charge and learn along the way. This is just a blog and not a bank website.

There are many ways to create a rule set. Either manually or automatically. For the time being I will do it manually because it's fun to read the crap that is posted as User Agents or as Refferers. By searching my server's vulnerabilities maybe they are exposing theirs? Fun times.

A fun and easy way to do it is to use the sitemap generated by Ghost as input for a rule set we could match any other query against. This way if a random entity hits the webserver with a funky query, jail.

Automating the ruleset using Deep Learning (aparté)

(the SAAS just increased it's value tenfold)

Essentially what this is, is a classification problem. What is a good request and what is a bad request? What if you could block the request before it even hits your webserver? Some kind of sentiment analysis for nginx logs. That would be really cool.

So how would you do this with the least amount of effort? I thought about using prodigy which is a good annotation tool and would allow me to train models on the fly but the price of almost 60 döner kebab without being able to try it out on my own data is something I do not sign up for.

Ultimately I stumbled upon Autokeras which is a kind of AutoML tool. But in order to do that, I need to find a way to create and annotating the dataset without too much hassle. This way needs a bit of fiddling and ultimately some more testing.

To get to a model quickly, I decided to use a cool service called MonkeyLearn. I am in no way affiliated to them I just believe this is how ML should be or at least the explorative phase. You need to be able to get to a model fast and try it out even faster.  

It kinda worked but I won't dwell on it as this post is already getting too long. Maybe in part 2.

Analytics

Ok now what if you wanted to do analytics on your logs, some basic stuff like number of hits, when they happen and so on?

One simple route would be to use some ELK stack but that's a bit overkill for my blog and sincerely it's not my favorite. I worked a ton with that stack so it's time to try something more lean and different that still could scale.

This will be covered in the second part :)