My blog was previously hosted on Github Pages with Publii but it is now a self hosted Ghost instance. I prefer this because I can write my posts from wherever I want without needing any credential management or git installation or whatever. Plus, the Ghost editor is really good and doesn't get in my way. For the rest of my notes there's emacs. So far the experience has been smooth and rather enjoyable. I just need a browser, my credentials and I'm good to go.
If I want to switch to a static blog, I just have to attach to the content api that Ghost exposes and it's all good.
The installation is self-hosted and rather standard, Nginx serving the Ghost application with some SSL kung fu provided by Let's Encrypt. The instance the blog is hosted on is not the beefiest and doesn't have RGB's for more FPS but I tried some load testing with locust.io and it should handle around multiple thousands of concurrent connections without a problem.
Smooth sailing so far.
I am concerned with performance and measuring things so I keep a close eye on the server load and the nginx/server logs. This is also a benefit of self hosting, you have access to that stuff! If you host on Github or some other platform you will have to use JS for analytics. Which means GDPR, notifications, pop-ups and everyone knows that's bloated.
This means I needed an analytics solution that was simple and also useful with the most important (vanity) metrics.
Why not use GoAccess/CowAccess then you might ask? Well I had some issues with filtering out irrelevant metrics and running my own queries on the logs. I want to be able to process the logs for example: "get me a list of all ip's that make a request to a php resource and group by date" or "get me a list of all unique user-agents for a given referrer". It somehow led me to not trust the numbers at some point. So GoAccess is good if you want to know overview metrics but not specific ones you want to work on.
Hint: it's not yet another ELK or Grafana/Prometheus stack :)
Here is how it looks like:
There are a lot of other advantages using this solution that I will dwell on later. So far I'm pretty happy with this hands-off approach.
What I like when I design my systems is that they should be simple to maintain without having to sacrifice too much flexibility. Every component should be responsible for one thing without too many dependencies. Here are the required components that you will need to achieve this project:
- Unix tools (grep, awk, tail, ...)
- Self hosted Metabase installation
The way this works is as follows: You create a service that continuously monitors the nginx logs, formats them and writes them to an SQLite database on the fly. The SQLite database (a file) is then mounted as a data source in the Metabase runtime which serves as a UI for the analytics. That's it.
The reason why I chose Metabase instead of Grafana is that Metabase supports SQLite out of the box while Grafana seemingly doesn't want to support it (asking people to vote on an issue just to close it and open a new one is really grinding my gears). I'm not going to set up and manage a MySQL database for something as simple as querying logs on such a small scale. And to be honest, if my logs get 140 Terabytes big I will have other kind of problems (aka "suffering from success" - DJ Khaled) so take that last statement with a grain of salt.
Rant aside, the analytics that you want to run are only limited by your SQL skills.
Preparation - Housekeeping
One of the first things I like to do when I'm setting up Nginx is to tweak the log format, especially the date format. The date in default Nginx logs is in a non ISO format so we have to change that to avoid some unnecessary work in the future:
Tailing the logs
The first step of the process is to get access to the Nginx Logs, extract the important fields, run some anonymisation and have an output that is useful enough.
Using tail and awk you can do something like this:
This will get every new log line and process it through AWK by extracting field $1 (ip) and field $4 (timestamp).
This gives us a way of parsing the log file and effectively putting every part of the line in a single variable. This is necessary for the next step.
Inserting into SQLite
The next step is to insert the parsed log data into an SQLite database. For this you create an SQLite database and define it's schema like follows:
Every field is a text field because at this point it really doesn't matter. You can tweak it to your liking.
Connecting Awk and SQLite
I had some troubles piping the output of awk into SQLite mainly due to buffering issues which I couldn't wrap my head around. The issue was solved when I used gawk instead of awk. Why it works is a mistery to me until this day and I'd be more than happy to know why.
Which makes it possible to run the following queries on the logs:
root@funnybretzel-prod:~# sqlite3 tasks.db "select datetime(timestamp), referrer, status_code from todo;" 2020-05-21 14:57:58|https://funnybretzel.com/ghost/|200 2020-05-21 14:58:21|https://funnybretzel.com/ghost/|200 2020-05-21 14:58:25|https://funnybretzel.com/ghost/|200 2020-05-21 14:58:39|https://funnybretzel.com/ghost/|200 2020-05-21 15:02:54|https://funnybretzel.com/ghost/|200 2020-05-21 15:04:38|https://funnybretzel.com/ghost/|200 2020-05-21 15:09:01|-|404 2020-05-21 15:09:01|-|499 2020-05-21 15:11:41|-|200 2020-05-21 15:11:41|-|200 2020-05-21 15:13:18|https://funnybretzel.com/ghost/|200 2020-05-21 15:13:27|https://funnybretzel.com/ghost/|200 2020-05-21 15:13:34|https://funnybretzel.com/ghost/|200 2020-05-21 15:13:38|https://funnybretzel.com/ghost/|200
So now you have a simple bash script that writes every new log line into an SQLite database. It would be nice to have this utility running continuously as a service.
Creating a service
I decided to take the easy route and create a systemd service like follows:
For more information on how to setup a service you can check here: https://www.freedesktop.org/software/systemd/man/systemd.service.html, it would be also good to add some logging mechanism here as well in case we fail to parse some convoluted lines. What I do with the lines that cannot be parsed is write them to another file and simply to block the ip's.
The last step with the service is then to start everything:
# start the service systemctl start loggingsqlite # make the service start on reboot, this will cause a few duplicates in our database but that is fine for the moment, just need to offset tail -f -n=1 or so systemctl enable loggingsqlite # check on the status systemctl status loggingsqlite
Getting started with metabase is very easy if you decide to use docker.
You mount the sqlite database into the Metabase docker instance (you can mount many of them) and start the service. The rest, like setting up SSL and configuring NGINX is covered here in depth.
Cool thing about this, the SQLite database is updated in real time and I can add as many SQLite databases as I want. The process is the same: figure out the log file on my server, "tailgawk" it to SQLite and import it into metabase.
After that, it's set and forget. You can do a lot more with Metabase and I am still figuring it out.
I now have a google-less, js-less reliable and effective way to process my blog logs and learn from my data, and not a subset of it. The logs contain the truth, and now it's just a matter of time to figure it out and do some fun stuff like a/b testing, dynamically block IP's, implement rate limiting. This is all possible because I can add more databases (for examples an Orders/Conversions Database dashboard) and even write in a semi-automatic fashion to a database which will then be used as an input for scripts runnig on the server.
Hope this could help you set up your own version of it. If you have troubles just mail me and i'll help you out. I am currently figuring out how to filter and detect bots and also write a pyinfra script for this setup since it's proven very useful on another project :)