Website analytics tools are complicated—and usually expensive. Sure, Google Analytics is free but it requires you to donate your user’s behavior data to help Google’s ad business. Yuck! Luckily, around the time GDPR was introduced, a handful of developers got to work and created some privacy-focused alternatives.
Fathom is the most popular of these new tools but others like Plausible, Umami, and Offen also exist. There’s one problem, though; these new offerings are extremely basic compared the behemoths like Google Analytics—only providing simple stats like views, popular pages, and screen sizes.
These limitations are fine for my personal projects, however, so I could have used one, but the simplicity got me wondering if I could just build it? Long story short, I spent a weekend making analytics.schier.co and pointed my website at it.
Architecting the data
The hardest part of writing analytics software is maintaining query performance at scale. Luckily, my website maxes out at ~1000 daily visitors so scale isn’t an issue. Raw page view events can simply be stored and queried from a regular Postgres table.
The simplest solution requires just a single page view table, but I created two more tables for user accounts and websites since I plan to add a login system later. In the meantime, it’s fine to manually create these records by logging into the database. 🤓
Here’s what the schema looks like:
The only thing left is to massage the data into a format usable by the front-end. Note that the following SQL queries are specific to Postgres, but MySQL has similar features with a slightly different syntax.
Sending the tracking data
Data gets into the page view table via HTTP request. Like most analytics tools, the tracked website embeds a snippet on every page that executes a script to generate and send the required parameters. Here’s an example of one of those requests.
If you’d like to inspect the snippet in detail, it can be viewed on GitHub. Now let’s get to the fun part: querying.
Querying chart data in buckets
Postgres is great at generating bucketed time series results which must be done in order to have a count for every point on the chart. The following SQL generates bucketed data by flooring
created_at to the nearest hour and using it in a
GROUP BY aggregation.
-- Count total and unique views, bucketed by hour SELECT COUNT(id) AS count_total, COUNT(DISTINCT sid) AS count_unique, TO_TIMESTAMP(FLOOR((EXTRACT('epoch' FROM created_at) / 3600)) * 3600) AT TIME ZONE 'UTC' AS bucket FROM analytics_pageviews WHERE website_id = $1 AND created_at BETWEEN $2 AND $3 GROUP BY bucket; -- Example Result: -- 16, 13, 2021-07-21 06:00:00.000000 -- 21, 16, 2021-07-21 07:00:00.000000 -- 12, 11, 2021-07-21 08:00:00.000000 -- 11, 10, 2021-07-21 09:00:00.000000 -- ...
Querying popularity lists
Perhaps the most useful feature of a analytics tools is seeing the most popular pages, countries, and screen sizes. Here’s what the data looks like in the dashboard.
To limit the number of expensive queries sent to the database, popular pages, countries, and screen sizes can be combined into a single query using
GROUPING SETS. Grouping Sets allow aggregation into multiple groups within a single query! That’s three queries in one! 👏
-- Aggregate popular screens, pages, countries SELECT screen_size, path, host, country_code, COUNT(id) AS count_total, COUNT(DISTINCT sid) AS count_unique FROM analytics_pageviews WHERE country_code != '' AND website_id = $1 AND created_at BETWEEN $2 AND $3 GROUP BY GROUPING SETS ( (screen_size), (country_code), (path, host), () ) ORDER BY count_unique DESC LIMIT 50;
Querying live users
Finally, it’s time to tackle the “live” users query. Most analytics tools define a live user as someone who’s viewed a page recently. The following query does just that by counting distinct sessions IDs in the last 5 minutes.
-- Count unique session IDs from the past 5 minutes SELECT COUNT(DISTINCT sid) FROM analytics_pageviews WHERE website_id = $1 AND created_at > (NOW() - INTERVAL '5 minutes') LIMIT 20;
Putting it all together
Was this really worth it?
I’ve been using this solution for over a year now and yes, it was worth it. I love having only provides the necessary information and can be customized in any way imaginable! 💄
My DIY solution won’t last forever, though. A simple load test showed that a ~30x in views would start slowing down those complex aggregate queries. But, moving to something like TimescaleDB Continuous Aggregates should be a straightforward migration and would scale much further.
So, should you build your own analytics tool? Heck ya! It’s a small-scope project and a great way to learn about data visualization, advanced SQL, and real-time updates! Maybe don’t use it for your next VC-funded startup but it’s a great way to track all those side-projects.
Let me know if you have any questions! Oh, and the source can be found at github.com/gschier/analytics. 👋