I've been mulling over the idea of writing my own analytics package for a while now, and finally decided to sit down and do it. NA is a simple web analytics package that runs on a target web server, feeds directly from the server's web traffic and requires no tracking Javascript code to be placed on site's pages.
I'm early in the process, so any ideas and feedback is welcome.
Building analytics from access logs is massively unreliable. I have a few domains that answer for http traffic but only serve "Welcome to nginx". When I tail the access logs I see thousands of requests from obvious bots, but also that traffic seems to be pretending to be other forms of traffic also.
There are some bots that pretend to be phones browsing around, IE, firefox.
Just remember that relying purely on access log analytics isn't something that will return accurate results.
I wouldn't call it massively unreliable from what I've seen so far.
For one it is offset by the fact that the site it is used on is Ajax'ed, so it is fairly simple to filter out those who poll just the container page and don't follow up with querying any actual content. For two, I'm getting good at detecting bots by simple filtering by User-agent strings and IPs. So it is far from the picture of an utter disaster that you describe.
I'm curious as to why you chose to stick code into your web application to explicitly insert log entries into a database rather than using the built in logs that the server provides.
After all, the web logs that Apache spits out match your schema almost identically. They're perfectly parseable and have the advantage of being parseable at intervals (possibly from a different server) rather than forcing an insert from every page load.
By doing things this way, you essentially remove the ability to scale the application that's being logged. You're doing at least one database insert per page load, thus destroying any gains you might have seen from caching the content you're displaying. And of course, you can't actually cache anyway, since you need to actually run the PHP script for the page in question so that it will fire the database insert to register a hit.
So yeah, I'd suggest dropping that step and just having a worker swing by every once in a while, pull up the logfile, see what's new and bulk insert it into your database. Preferably, with said database and said worker all living on another machine so that they don't bog your production box.
I hear you, thanks. It's a pet project and switching to the cron job or just tail'ing the log file is a trivial change, just as you said. Doing it directly from the page is just easier for now.
Parsing Apache logs isn't hard. There's a well-defined spec for them. Of course you have to keep track of stuff like escaping, but that's pretty trivial.
I am really interested in your project. I am working on an service where I think analytics will have a big benefit to my customers. I am struggling with rolling my own or using a 3rd party service. Both have pros and cons. I think in the end rolling my own will be the way to go, since I want to show users how many people visited their site ( and how many people contacted them) in a very simple way. Basically just show them the two numbers.
Another advantage of "native" analytics is to capture downloads of all file types. Google Analytics does not know about your PDF downloads (unless there is something I don't know about), and or me that is quite important.
I hope you figure out how to block referrer spam. I use both GA and awstats as provided by my webhost, and the numbers are widely different. The referrer list at awstats is almost useless since I get hundreds of hits from viagra selling sites etc, despite the fact that I don't publish the referrer list publicly anywhere.
tl;dr - reinventing the wheel is the only way to build a better wheel :)
Seriously though - I have used some of these in the past (settling however on Mint, which is not open-source in its common sense), none were perfect. I am a programmer, not a sysadmin, and it is more interesting for me to write something than to mess with package dependencies, default installation paths and what not. Digging through someone else's code and trying to bend it to my needs is not the best pastime either. On the other hand, writing analytics from scratch gives a chance to build exactly what I need, especially when it comes to the reporting function and the UI. In fact, in the UI/UX department I am pretty sure I can do better than existing O/S packages.
What are the disadvantages of 3rd party services? Are you really giving away secrets to 3rd party services?
Also, whats the realistic percentage who disable javascript nowadays? Does AdBlock really block javascript? Does it really impact 3rd party services that use javascript to record the visit like say Google Analytics?
Yes. Right now we are looking at installing an affiliate plugin on a Magento site with Varnish caching. Instead of inventing a new Magento/Varnish plugin, we are planning to use a javascript-driven solution.
> Javascript dependency - not relying on the browser-side code to capture a page visit means that the NA can correctly account for clients using NoScript and AdBlock.
> Privacy - capturing the analytics data directly on the web server means not sharing it with other parties. For some people it is not an issue, but for others, me included, it is.
I'm early in the process, so any ideas and feedback is welcome.