Snowplow: evolve your analytics stack with your business

  • Published on

  • View

  • Download

Embed Size (px)


  • Snowplow: evolve your analytics stack with your


    Snowplow Meetup San Francisco, Feb 2017

  • Our businesses are constantly evolving

    Our digital products (apps and platforms) are constantly developing

    The questions we ask of our data are constantly changing

    It is critical that our analytics stack can evolve with our business

  • Self-describing data Event data modeling+

    Analytics stack that evolves with your business

    How Snowplow users evolve their analytics stacks with their business

  • Self-describing dataOverview

  • Event data varies widely by company

  • As a Snowplow user, you can define your own events and entities


    Entities (contexts)

    Build castle Form alliance Declare war

    Player Game Level Currency

    View product Buy product Deliver product

    Product Customer Basket Delivery van

  • You then define a schema for each event and entity

    { "$schema": "", "description": "Schema for a fighter context", "self": { "vendor": "com.ufc", "name": "fighter_context", "format": "jsonschema", "version": "1-0-1" },

    "type": "object", "properties": { "FirstName": { "type": "string" }, "LastName": { "type": "string" }, "Nickname": { "type": "string" }, "FacebookProfile": { "type": "string" }, "TwitterName": { "type": "string" }, "GooglePlusProfile": { "type": "string" },

    "HeightFormat": { "type": "string" }, "HeightCm": { "type": ["integer", "null"] }, "Weight": { "type": ["integer", "null"] }, "WeightKg": { "type": ["integer", "null"] }, "Record": { "type": "string", "pattern": "^[0-9]+-[0-9]+-[0-9]+$" }, "Striking": { "type": ["number", "null"], "maxdecimal": 15 }, "Takedowns": { "type": ["number", "null"], "maxdecimal": 15 }, "Submissions": { "type": ["number", "null"], "maxdecimal": 15 }, "LastFightUrl": { "type": "string" },

    "LastFightEventText": { "type": "string" }, "NextFightUrl": { "type": "string" }, "NextFightEventText": { "type": "string" }, "LastFightDate": { "type": "string", "format": "timestamp" } }, "additionalProperties": false }

    Upload the schema to Iglu

  • Then send data into Snowplow as self-describing JSONs

    1. Validation 2. Dimension widening3. Data


    { schema: iglu:com.israel365/temperature_measure/jsonschema/1-0-0, data: { timestamp: 2016-11-16 19:53:21, location: Berlin, temperature: 3 units: Centigrade } }

    { "$schema": "", "description": "Schema for an ad impression event", "self": { "vendor": com.israel365", "name": temperature_measure", "format": "jsonschema", "version": "1-0-0" }, "type": "object",

    "properties": { "timestamp": { "type": "string" }, "location": { "type": "string" }, }, }


    Schema reference


  • The schemas can then be used in a number of ways

    Validate the data (important for data quality)

    Load the data into tidy tables in your data warehouse

    Make it easy / safe to write downstream data processing application (e.g. for real-time users)

  • Event data modelingOverview

  • What is event data modeling?

    1. Validation 2. Dimension widening3. Data


    Event data modeling is the process of using business logic to aggregate over event-level data to produce 'modeled' data that is simpler for querying.

  • event 1

    event n




    Immutable. Unopiniated. Hard to consume. Not contentious

    Mutable and opinionated. Easy to consume. May

    be contentious

    Unmodeled data Modeled data

  • In general, event data modeling is performed on the complete event stream

    Late arriving events can change the way you understand earlier arriving events

    If we change our data models: this gives us the flexibility to recompute historical data based on the new model

  • The evolving event data pipeline

  • How do we handle pipeline evolution?


    What is being tracked will change over



    What questions are being asked of the data will change

    over time

    Businesses are not static, so event pipelines should not be either




    Comms channels


    Data warehouse

    Data exploration

    Predictive modeling

    Real-time dashboards

    Real-time, data-driven applicationsRT

    Bidder Voucher


    Collection Processing

    Smart car / home

  • Push example: new source of event data

    If data is self-describing it is easy to add an additional sources

    Self-describing data is good for managing bad data and pipeline evolution

    Im an email send event and I have information about the recipient (email address, customer ID) and the email

    (id, tags, variation)

  • Pull example: new business question




  • Answering the question: 3 possibilities

    1. Existing data model supports answer

    2. Need to update data model

    3. Need to update data model and data


    Possible to answer question with existing modeled data

    Data collected already supports answer

    Additional computation required in data modeling step (additional logic)

    Need to extend event tracking

    Need to update data models to incorporate additional data (and potentially additional logic)

  • Self-describing data and the ability to recompute data models are essential to enable pipeline evolution

    Self-describing data Recompute data models on entire data set

    Updating existing events and entities in a backward compatible way e.g. add optional new fields

    Update existing events and entities in a backwards incompatible way e.g. change field types, remove fields, add compulsory fields

    Add new event and entity types

    Add new columns to existing derived tables e.g. add new audience segmentation

    Change the way existing derived tables are generated e.g. change sessionization logic

    Create new derived tables

  • Questions?