Moving away from Google Analytics

Published: 2016-01-03
Last modified: 2016-01-03

I am proud of the fact that this website uses very little JavaScript. None of it is essential, and the JavaScript that is used exists to provide supplementary functionality to the content on each page. Things like dynamically generated tables of contents provide convenience for the user, but the content is available without JavaScript. This is good, and I was happy to provide such a well-made site in the wake of the embarrassments that now populate the Web. However, there was one outlier: the Google Analytics code included in the header of every page.

Google Analytics sucks. It sucks for a number of reasons: from a technical perspective, it is actually quite marvelous, though fundamentally flawed for reasons I will get to later. The other reason it sucks is that large scale user tracking and privacy infringement is antisocial, malicious, and quite evil. Whether or not user tracking is immoral, whether or not it is even useful, and whether or not Big Data is completely misguided would be an interesting topic of discussion for elsewhere. Suffice to say it should be something that one opts in to, not something that one opts out of. When I browse websites, I block Google Analytics, so including it on my site was rather hypocritical.

For a small personal website like this, analytics isn't a crucial operational component; it's just nice to have. I can use it to monitor page hits, which makes me feel nice knowing that people visit my site. I can use it to see roughly what visitors are interested in and roughly what kind of people they might be. Up until now, I have used Google Analytics because it was convenient, but it had always made me uncomfortable to sell out my readers to ~~Big Brother~~Google, so I have finally removed it. Now, I use my web server logs and FOSS software for my analytics needs.

Getting analytics data

For analytics, you need data. Google Analytics gets its data from the JavaScript code embedded in webpages, as well as any and all applicable data from other parts of the Google hivemind. I now get my data from my web server logs.

What information does the server have access to? In order to get a webpage, the client must request the page from the server. The server therefore has access to all of the information in the request headers, such as who sent the request, when the request was received, and what was requested.

The request headers contain a lot of auxiliary information, including the user agent and referrer fields. These headers can be useful for analytics and potentially damaging to one's privacy, but they can be forged or omitted, so their value depends on whether or not they can be corroborated with other, more trustworthy data.

What subset of this information is available in the server logs depends on your server and configuration. A server that keeps logs merely for diagnostic purposes might only keep the time of each request, the page that was requested, and the IP that requested it, whereas a server depending on them for analytics might log everything and send it to a database instead of a log file on the server.

Analytics with server logs

With data in hand, the analysis can be done however you want. On the simple side, one might simply plot the number of requests per day to see whether there are any interesting patterns, such as peaks on weekends. On the other extreme, one could throw state of the art machine learning algorithms at the data to identify individuals and their browsing habits, like Google no doubt does.

For me, some basic reporting is sufficient. Two programs I have found and tried are visitors and GoAccess. visitors is old: it doesn't recognize Chrome's User Agent, for example, and its reports are visually simple. GoAccess is newer and primarily intended for live monitoring, though it can generate reports too. Both work well.

After playing with both, I settled on a short script for generating reports:

#!/bin/bash

scp -r user@server.example.com:/var/log/webserver logs
find logs -name "*.gz" | parallel gunzip
cat logs/* | goaccess -p ~/.goaccessrc -a > report.html

Some analytics for this site

Before I finish, I'd like to share some general insights I got from this site's analytics.

First, from Google Analytics. There was a peak in visits in the latter half of December, with an average of about 30 per day. Before that, the average was around 7 per day. Of the roughly 300 visits in the past month, almost all go to / (the homepage). There were 250 Chrome users and 40 Firefox users.

Next, from my server logs. There was a fairly even distribution of visits in November and December, averaging roughly 60 per day. There were a total of 3000 visits to /, and 150 visits to /robots.txt (these requests failed, as this site uses /index.xml instead). There were 2000 web crawlers, 1500 Firefox users, and 100 Chrome users.

There are huge discrepancies in the data from Google Analytics and my server logs. In particular, Google Analytics only tracked 10% of the actual visits to my site. This brings me to my earlier point, Google Analytics not only sucks morally, it sucks technically as well.

I'm not qualified to give deep insights about this, but I suspect the reason is JavaScript blocking. Web crawlers and bots did not show up in Google Analytics because they do not execute JavaScript, but many human users (or visitors that appear to be human users) did not show up either. As the average user grows more and more aware of privacy issues and begins blocking tracking ads and scripts, Google Analytics and other forms of JavaScript tracking are just going to suck more and more.

P.S.: For you bots and script kiddies poking at phpMyAdmin and cgi-bin, this site is completely static. I wish you good luck trying to exploit the HTML files.