🪴 Home-Grown Web Analytics

Feb 5, 2023 • 5 min read

Website analytics tools are complicated—and usually expensive. Sure, Google Analytics is free but it requires you to donate your user’s behavior data to help Google’s ad business. Yuck! Luckily, around the time GDPR was introduced, a handful of developers got to work and created some privacy-focused alternatives.

Fathom is the most popular of these new tools but others like Plausible, Umami, and Offen also exist. There’s one problem, though; these new offerings are extremely basic compared the behemoths like Google Analytics—only providing simple stats like views, popular pages, and screen sizes.

These limitations are fine for my personal projects, however, so I could have used one, but the simplicity got me wondering if I could just build it? Long story short, I spent a weekend making analytics.schier.co and pointed my website at it.

Overview of DIY Analytics Dashboard

Architecting the data

The hardest part of writing analytics software is maintaining query performance at scale. Luckily, my website maxes out at ~1000 daily visitors so scale isn’t an issue. Raw page view events can simply be stored and queried from a regular Postgres table.

The simplest solution requires just a single page view table, but I created two more tables for user accounts and websites since I plan to add a login system later. In the meantime, it’s fine to manually create these records by logging into the database. 🤓

Here’s what the schema looks like:

Database Schema Structure

The only thing left is to massage the data into a format usable by the front-end. Note that the following SQL queries are specific to Postgres, but MySQL has similar features with a slightly different syntax.

Sending the tracking data

Data gets into the page view table via HTTP request. Like most analytics tools, the tracked website embeds a snippet on every page that executes a script to generate and send the required parameters. Here’s an example of one of those requests.

Tracking request taken from Chrome DevTools

If you’d like to inspect the snippet in detail, it can be viewed on GitHub. Now let’s get to the fun part: querying.

Querying chart data in buckets

Postgres is great at generating bucketed time series results which must be done in order to have a count for every point on the chart. The following SQL generates bucketed data by flooring created_at to the nearest hour and using it in a GROUP BY aggregation.

-- Count total and unique views, bucketed by hour
SELECT COUNT(id)              AS count_total,
       COUNT(DISTINCT sid)    AS count_unique,
       TO_TIMESTAMP(FLOOR((EXTRACT('epoch' FROM created_at) / 3600)) * 3600)
           AT TIME ZONE 'UTC' AS bucket
FROM analytics_pageviews
WHERE website_id = $1 
    AND created_at BETWEEN $2 AND $3
GROUP BY bucket;

-- Example Result:
-- 16, 13, 2021-07-21 06:00:00.000000
-- 21, 16, 2021-07-21 07:00:00.000000
-- 12, 11, 2021-07-21 08:00:00.000000
-- 11, 10, 2021-07-21 09:00:00.000000
-- ...

Querying popularity lists

Perhaps the most useful feature of a analytics tools is seeing the most popular pages, countries, and screen sizes. Here’s what the data looks like in the dashboard.

Overview of popularity lists

To limit the number of expensive queries sent to the database, popular pages, countries, and screen sizes can be combined into a single query using GROUPING SETS. Grouping Sets allow aggregation into multiple groups within a single query! That’s three queries in one! 👏

-- Aggregate popular screens, pages, countries
SELECT screen_size,
       path,
       host,
       country_code,
       COUNT(id)           AS count_total,
       COUNT(DISTINCT sid) AS count_unique
FROM analytics_pageviews
WHERE country_code != ''
    AND website_id = $1
    AND created_at BETWEEN $2 AND $3
GROUP BY GROUPING SETS (
    (screen_size),
    (country_code),
    (path, host),
    ()
)
ORDER BY count_unique DESC
LIMIT 50;

Querying live users

Finally, it’s time to tackle the “live” users query. Most analytics tools define a live user as someone who’s viewed a page recently. The following query does just that by counting distinct sessions IDs in the last 5 minutes.

-- Count unique session IDs from the past 5 minutes
SELECT COUNT(DISTINCT sid) FROM analytics_pageviews
WHERE website_id = $1 
    AND created_at > (NOW() - INTERVAL '5 minutes')
LIMIT 20;

Putting it all together

The only piece left to mention is the dashboard, which periodically fetches those three queries from a Go backend and displays the data using React, TailwindCSS, and visx charts.

Railway dashboard showing the Postgres and Go services

It’s automatically deployed to Railway and can be found at analytics.schier.co.

Was this really worth it?

I’ve been using this solution for over a year now and yes, it was worth it. I love having only provides the necessary information and can be customized in any way imaginable! 💄

My DIY solution won’t last forever, though. A simple load test showed that a ~30x in views would start slowing down those complex aggregate queries. But, moving to something like TimescaleDB Continuous Aggregates should be a straightforward migration and would scale much further.

So, should you build your own analytics tool? Heck ya! It’s a small-scope project and a great way to learn about data visualization, advanced SQL, and real-time updates! Maybe don’t use it for your next VC-funded startup but it’s a great way to track all those side-projects.

Let me know if you have any questions! Oh, and the source can be found at github.com/gschier/analytics. 👋