Unleash your data capabilities (with Snowplow Analytics)

Here at Gousto we are very customer and product focused, tailoring our service to customers all the time. For example, last year we were the first recipe delivery kit company to introduce shorter lead time, enabling our customers to receive their boxes 3 days after ordering, rather than 10 days as before. This year there is much more to come - I’m sure you will notice it.

Not only are such initiatives heavily data-driven, e.g. to ensure shorter lead-time being viable we need to forecast the demand very accurately, but we also strongly believe that data is voice of our customers and therefore (almost) all our decisions are based on and validated by data, no matter whether it’s in marketing, operations or food team.

To empower all our teams to produce a tailored made product, we’ve been working hard to set up state-of-the-art analytics pipeline providing them very granular and reliable data.

The core of this analytics pipeline is open-source, event analytics platform Snowplow, which I believe is a game-changer in the data analytics world. The platform leverages the latest (AWS) technologies, is cost effective1 and has lively community2.

“Snowplow is the event data pipeline, enabling you to track all events across all channels and action the data in real-time.”

Why not just use Google Analytics?

Before Snowplow, we were using Google Analytics (GA) for web and marketing analytics, transactional database for cohort/retention analytics, email campaigns were barely analysed using Mailchimp default reports etc. All those data sources were quite disconnected. For example, it was difficult (impossible) to answer questions like

  • “What is the retention of customers signed up through particular marketing channel using mobile devices?”
  • “How much time do high-value customers spend on our website?”
  • “What is the proportion of customers visited our web site during TV ad, but signed up later?”

The main issue is that GA does not allow you to access user-level data, i.e. all reports are sampled, unless you pay for premium account, which is quite expensive. But even then, it’s tricky to connect or integrate transactional data into GA which would enable you to answer questions above. You need to do some hacking, which can easily escalate to a super mess.

To crack this nut, I started looking into solutions, enabling us to link all those data sources and to build unified customer journey from their very first visit. Frankly, there are quite a few such tools/platforms offering various functionalities with different business models and focuses, e.g. Segment, Keen IO, Kissmetrics, Mixpanel, Heap Analytics. I’m not going to compare Snowplow with these other tools/platforms, but instead, I’ll provide a few main differentiators of why I chose Snowplow at the end.

  1. You own the data - you are not locked to a particular provider or database, it’s really up to you where you want to store the data. The data is stored in S3 by default and Redshift, AWS scalable columnar database, works out of the box.
  2. Open-sourced and highly customisable - you can easily add custom trackers, webhooks, enrichments, data models or metrics that suit your business.
  3. Runs in AWS - all our services run in AWS anyway, so it was really convenient for us to integrate Snowplow into existing infrastructure.
  4. Cost effective - there are no additional costs besides AWS resources for which you pay as you go. Also, you can control AWS costs by optimising AWS resources.
  5. Suitable for online and ‘offline’ - a lot of tools mentioned before are mainly focused on tracking events on the website and/or native mobile apps, but Snowplow, due to it’s customisability, enables you to easily plug in 3rd party data sources, e.g. deliveries, phone calls, email campaigns etc.
  6. Unified log - all the events are stored in one place (Redshift), which guarantees us a single source of the truth.
  7. Strong community - contributors on online forum2 are very responsive and technical. They also do regular Meetups across the world, where typically engineers share their experience and results.

That’s great, but how does it work?

The data pipeline has a modular architecture, allowing you to choose what parts you want implement. The main concept of it is an event, which is not limited to web-page views or clickstream data only, but can be pretty much anything. In our case for example, besides obvious events, such as when customer signups or when customer places an order, we can collect events when customer receives a box or when customer calls in our customer service etc.

Each event is processed through the pipeline in multiple stages and stored into one or more data stores - see the diagram below.

  1. Trackers & webhooks are elements (pieces of software) capturing/triggering events when they occur and send them to the collector. They are meant to be easily integrated into your website (JavaScript), mobile applications (iOS/Android SDKs) or your backend code (Python/Java/Go etc.). And 3rd party webhooks can be used as well. See the list of available trackers and webhooks here and here.
  2. Collector is a backend code (‘server’) receiving events from trackers. There are three different collectors at the moment. They are all implemented as AWS resources in an auto-scaling group, sinking events in S3 or Kinesis. Clojure collector and Scala stream collector both enables tracking across domain, while Cloudfront collector is not, but it is much quicker to set up.
  3. Validation stage performs basic validation against your defined event ontologies (self-describing JSONs). The good thing is that data that fails validation is kept in an error log. This means that no data is dropped during the processing stage so you can fix errors later.
  4. Enrichment stage processes events and add some extra information to them. There are some out of the box configurable enrichments, e.g. observing location information from IP addresses, extracting device information from user-agent and parsing referrer data for tracking sources of site traffic.
  5. Modeling process has three-fold function. It enables you to: 1. join events with other existing datasets such as transactional, marketing, financial data etc., 2. to aggregate event data into smaller datasets such as sessions, visitors etc. and 3. to apply business logic such as user segmentation, marketing attribution etc.
  6. Storage stage enables you to store processed, enriched and modelled data to a variety of data stores. The data is stored into S3 by default, but Snowplow also support Redshift, PostgreSQL and ElasticSearch out of the box.

The pipeline can be implemented in a batch mode or as a real-time processing. As the real-time pipeline is not production ready yet, we are still running our pipeline in a batch mode. However, we are already looking into implementing real-time pipeline in parallel with the current pipeline. I will write about it in one of the future posts.

What about some real use cases?

This is my first post on Snowplow with a purpose to cover basics and get you ready for more technical stuff on Snowplow in the future. But let me give you a simple example on how to measure conversion rate (in a better way) anyway. As a teaser.

The conversion rate is calculated as a number of conversions divided by a number of new unique visitors during a particular time period. The number of conversions might be simply a number of new customers in your transactional database, while the number of new visitors is a bit more vague.

Commonly a number of new unique visitors from GA is used. The conversion rate using this metric actually gives you a conversion rate of first time visitors only. But not all visitors convert the first time they visit your website, right? For example, a lot of visitors come on your website when they see your TV ad, but they don’t convert immediately as exploring your service might take them longer than TV advertisement block lasts and they don’t want to miss a TV show they are watching so they come back some other time and convert.

More appropriate metric for calculating the conversion rate would be a number of non-converted visitors. So let’s have a look how it can be worked out of Snowplow data with a SQL query below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
WITH events AS (
SELECT domain_userid,
network_userid,
collector_tstamp,
page_urlpath,
dvce_type,
user_id,
LAG(collector_tstamp, 1) OVER (PARTITION BY domain_userid ORDER BY collector_tstamp) AS last_event
FROM snowplow_atomic.events
WHERE event = 'page_view'
AND page_urlhost = 'YOUR_WEBSITE_HOST'
),
final AS (
SELECT *,
CASE WHEN DATE_PART('EPOCH', collector_tstamp) - DATE_PART('EPOCH', last_event) >= 3600 OR last_event IS NULL THEN 1 ELSE 0 END AS is_new_session
FROM events
),
sessions_events AS (
SELECT *,
SUM(is_new_session) OVER (PARTITION BY domain_userid ORDER BY collector_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
FROM final
),
sessions AS (
SELECT DISTINCT
domain_userid,
user_session_id,
MIN(collector_tstamp) OVER(PARTITION BY domain_userid, user_session_id) started_at,
MAX(collector_tstamp) OVER(PARTITION BY domain_userid, user_session_id) ended_at,
FIRST_VALUE(page_urlpath) OVER(PARTITION BY domain_userid, user_session_id ORDER BY collector_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) landing_page,
LAST_VALUE(page_urlpath) OVER(PARTITION BY domain_userid, user_session_id ORDER BY collector_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) bouncing_page,
CASE WHEN MAX(user_id) OVER(PARTITION BY domain_userid, user_session_id) IS NULL THEN 0 ELSE 1 END identified,
SUM(CASE WHEN page_urlpath ILIKE '%YOUR_WELCOME_PAGE%' THEN 1 ELSE 0 END) OVER(PARTITION BY domain_userid, user_session_id) welcome_num,
MIN(CASE WHEN page_urlpath ILIKE '%YOUR_WELCOME_PAGE%' THEN collector_tstamp ELSE NULL END) OVER(PARTITION BY domain_userid, user_session_id) welcome_at,
MAX(dvce_type) OVER(PARTITION BY domain_userid, user_session_id) device_type
FROM sessions_events
),
sessions_conv AS (
SELECT domain_userid,
user_session_id,
started_at,
ended_at,
device_type,
CASE WHEN SUM(identified) OVER(PARTITION BY domain_userid ORDER BY user_session_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) > 0 OR SUM(welcome_num) OVER(PARTITION BY domain_userid ORDER BY user_session_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) > 0 THEN 0 ELSE 1 END not_converted_visitor,
welcome_num
FROM sessions
)

SELECT DATE(started_at) date,
device_type,
COUNT(1) sessions,
SUM(CASE WHEN not_converted_visitor >= 1 AND welcome_num >= 1 THEN 1 ELSE 0 END) conversions,
CAST(SUM(CASE WHEN not_converted_visitor >= 1 AND welcome_num >= 1 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(1) * 100 conversion_rate
FROM sessions_conv
GROUP BY 1,2
ORDER BY 1,2

Note that this conversion rate will be lower than the conversion rate using a number of new unique visitors as the latter will always be smaller than a number of non-converted visitors. However, remember that at the end the most important thing is a trend no matter what metric you use.

This query can be used as a basis to start digging deeper into conversions and other web metrics. For example, you can figure out how many times customers visit your website before they convert and break the conversion rate down by a number of visits. Moreover, you can work out through what channels visitors came before they converted, which leads you to very hot topic nowadays - marketing attribution and much more. I’m going to write more about this stuff in the future posts so stay tuned.

Dejan
Senior Data Scientist


  1. This pricing calculator is based on the old AWS pricing model. The costs should be even cheaper nowadays.
  2. They used to use Google groups before, but they moved to Discourse recently.