Google Analytics sucks. It sucks for a number of reasons: from a technical perspective, it is actually quite marvelous, though fundamentally flawed for reasons I will get to later. The other reason it sucks is that large scale user tracking and privacy infringement is antisocial, malicious, and quite evil. Whether or not user tracking is immoral, whether or not it is even useful, and whether or not Big Data is completely misguided would be an interesting topic of discussion for elsewhere. Suffice to say it should be something that one opts in to, not something that one opts out of. When I browse websites, I block Google Analytics, so including it on my site was rather hypocritical.
For a small personal website like this, analytics isn't a crucial operational
component; it's just nice to have. I can use it to monitor page hits, which
makes me feel nice knowing that people visit my site. I can use it to see
roughly what visitors are interested in and roughly what kind of people they
might be. Up until now, I have used Google Analytics because it was convenient,
but it had always made me uncomfortable to sell out my readers to
Brother, so I have finally removed it. Now, I use my web
server logs and FOSS software for my analytics needs.
What information does the server have access to? In order to get a webpage, the client must request the page from the server. The server therefore has access to all of the information in the request headers, such as who sent the request, when the request was received, and what was requested.
The request headers contain a lot of auxiliary information, including the user agent and referrer fields. These headers can be useful for analytics and potentially damaging to one's privacy, but they can be forged or omitted, so their value depends on whether or not they can be corroborated with other, more trustworthy data.
What subset of this information is available in the server logs depends on your server and configuration. A server that keeps logs merely for diagnostic purposes might only keep the time of each request, the page that was requested, and the IP that requested it, whereas a server depending on them for analytics might log everything and send it to a database instead of a log file on the server.
With data in hand, the analysis can be done however you want. On the simple side, one might simply plot the number of requests per day to see whether there are any interesting patterns, such as peaks on weekends. On the other extreme, one could throw state of the art machine learning algorithms at the data to identify individuals and their browsing habits, like Google no doubt does.
For me, some basic reporting is sufficient. Two programs I have found and tried are visitors and GoAccess. visitors is old: it doesn't recognize Chrome's User Agent, for example, and its reports are visually simple. GoAccess is newer and primarily intended for live monitoring, though it can generate reports too. Both work well.
After playing with both, I settled on a short script for generating reports:
#!/bin/bash scp -r firstname.lastname@example.org:/var/log/webserver logs find logs -name "*.gz" | parallel gunzip cat logs/* | goaccess -p ~/.goaccessrc -a > report.html
Before I finish, I'd like to share some general insights I got from this site's analytics.
First, from Google Analytics. There was a peak in visits in the latter half of
December, with an average of about 30 per day. Before that, the average was
around 7 per day. Of the roughly 300 visits in the past month, almost all go to
/ (the homepage). There were 250 Chrome users and 40 Firefox
Next, from my server logs. There was a fairly even distribution of visits in
November and December, averaging roughly 60 per day. There were a total of 3000
/, and 150 visits to
requests failed, as this site uses
/index.xml instead). There were
2000 web crawlers, 1500 Firefox users, and 100 Chrome users.
There are huge discrepancies in the data from Google Analytics and my server logs. In particular, Google Analytics only tracked 10% of the actual visits to my site. This brings me to my earlier point, Google Analytics not only sucks morally, it sucks technically as well.
P.S.: For you bots and script kiddies poking at phpMyAdmin and cgi-bin, this site is completely static. I wish you good luck trying to exploit the HTML files.