Continuous Data Pipelines - Snowflake Inc. · CHANGE DATA CAPTURE (STREAMS) IN-DATABASE TRANSFORMS (TASKS) 1.Repeatable Process 2.Low Latency 3.SQL Based Transforms Data Pipelines

© 2019 Snowflake Inc. All Rights Reserved

Continuous Data PipelinesJORDAN UDE, SENIOR SALES ENGINEER

WORLD SUMMIT - CHICAGO

● Clickstream● App logs● Unstructured data

● Breaking news● Consumer tastes ● IoT / device-generated

● PoS transactions● App logs● Device events

VOLUME VELOCITY VARIETY

Data Arrives ContinuouslyFreshness is Critical for Decision Making

Continuous Pipeline Objectives● Provide built-in and simple to use transform (T part of ELT) capabilities

○ Reduce latency for queries by ingesting and transforming data quickly as it arrives○ Provide Change Data Capture (CDC) for tables○ Simplify transformations using SQL; not require external scheduler/orchestrator○ Easier to build & manage, within-Snowflake data pipelines

● Better support telematics scenarios○ More sources of data with lower ingestion latencies○ Better management of ingestion pipelines - including stages

● Incrementally deliver value○ Ingestion: Snowpipe w/ auto-ingest ○ Change tracking in stage tables: streams○ Scheduled transforms on new data: tasks

DATA INGESTION (SNOWPIPE)

CHANGE DATA CAPTURE

(STREAMS)

IN-DATABASE TRANSFORMS

(TASKS)

1. Low Latency

2. Semi-Structured Data

3. Ecosystem Integration

1. Simple Change Tracking

2. Metadata Driven Solution

1. Repeatable Process

2. Low Latency

3. SQL Based Transforms

Data Pipelines Flow

REAL WORLD EXAMPLE (TPC-DI)

1. Industry Recognized Test

2. Complete Low Latency Data Flow

3. Complexity Matching Real World Needs

Purpose: ingest data into Snowflake continuously w/ no code / orchestrator

Mechanism: files stream into Snowpipe, get parsed & rows are inserted

Steps:1. Create a Snowpipe with staging bucket in cloud and a target table in Snowflake

2. Stream data into buckets - chunked into files

3. Auto-ingest listens for notifications and loads the data

Snowpipe w/ Auto Ingest

Continuous Ingestion

Snowflake

S3, Azure Blob,or GCS

Staging Table

Target Table 1

Target Table 2

Target Table 3

TableStream Task

TableStream

TableStream

TaskSnowpipe

(w/ auto-ingest)


CHANGE DATA CAPTURE

(STREAMS)


(TASKS)

1. Simple Change Tracking

2. Metadata Driven Solution


2. Low Latency


Data Pipelines Flow





<1 Min IngestionAll Data Types

Third Party Data Flows

Continuous Data Pipelines

Snowflake

S3, Azure Blob,or GCS

Staging Table

Target Table 1

Target Table 2

Target Table 3

TableStream Task

TableStream

TableStream

TaskSnowpipe

(w/ auto-ingest)

How do I know what changed in the staging

table?

Where do I keep the business logic for my

transformations?

How do I run this on a schedule?

How do I make this run reliably in the cloud?

How do I ensure exactly once semantics?

Table streams: Change Data Capture (CDC)

● Provide a set of changes made to the underlying table since last time

○ Covers all transactions previously not consumed

● Consumed in a DML statement: designed for transforms (“T” in ELT)

● Consuming advances the “offset” on commit - somewhat like a forward-only cursor

● Multiple streams possible on a given table

● Intended for retention period; O(hours) up to a day or whatever the retention period is

create or replace stream stage_changes on table stage;

insert into clicks (...)

select ... from stage_changes where metadata$action = 'INSERT';

Table streams: changes since offset


CHANGE DATA CAPTURE

(STREAMS)


(TASKS)


2. Low Latency


Data Pipelines Flow







Simple CreationZero Maintenance

Aggregate Change Tracking

Tasks● Executes DML statement / stored proc

○ Run on schedule; or ○ Run after predecessor task completes (pipeline scenario realized as tree of tasks)

● Tasks can be used with table streams or by themselves○ Table stream available for use in DML/sproc: insert, update, delete, merge○ Optionally add condition on stream having data: when

● Use cases include○ Shred JSON/XML from staging table into multiple tables (using table streams)○ Keep aggregates up-to-date and generate reports periodically○ alter pipe … refresh to catch up on auto-ingest


Tasks allow scheduled execution of DML. But they enable a lot more than managed cron:

● Streams ensure exactly once semantics for new/changed data (CDC). Offset is updated so you see only unprocessed rows.

● Data is not lost even if task execution window is missed; next execution of a task using a stream automatically catches up with whatever was missed due to an error

● Streams + Tasks provide a good separation between rapid ingestion to a staging table and transactional transforms on the ingested data

BEYOND SCHEDULED DML


CHANGE DATA CAPTURE

(STREAMS)


(TASKS)

Data Pipelines Flow









Captures Business LogicProcess Rows Exactly Once

Complete Integration with Streams

TPC.orgThe TPC is a non-profit corporation founded to define transaction processing and database

benchmarks and to disseminate objective, verifiable TPC performance data to the industry.

15

TPC-DS

TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but

not limited to, Big Data systems.

TPC-H

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries

and concurrent data modifications.

TPC-DI

The TPC-DI benchmark combines and transforms data extracted from an On-Line

Transaction Processing (OTLP) system along with other sources of data, and loads it into a data warehouse. The source and

destination data models, data transformations and implementation rules

have been designed to be broadly representative of modern data integration

requirements.

Stre

ams/

Task

s/Pr

ocs

CO

PY/S

now

pipe

/Kaf

ka

TPC-DIIn

SF UsingSnowflake

Data PipelineFeatures

TPC-DI DEPENDENCIES

Arrow pointing to a task indicates a dependency on that task

DIM_FINANCIAL

DIM_CUSTOMER

DIM_ACCOUNT DIM_SECURITY

DIM_COMPANY

DIM_TRADE

FACT_CASH_BALANCES

FACT_HOLDINGS

FACT_WATCHES

FACT_MARKET_HISTORYFACT_PROSPECT

DIM_BROKER

DIM_DATEDIM_TIMEDIM_TAX_RATEDIM_STATUS_TYPE DIM_TRADE_TYPE DIM_INDUSTRY

TPC-DI PROCESSING: JOBSBatch 1 - Historical Load

Batch 2 - Incremental Load 1

Batch 3 - Incremental Load 2

The Historical Load has more sources than the IncrementalCode is shared between Historical/Incremental where possible

DIM_SECURITY_STMDIM_ACCOUNT_STM

TASK PREDECESSORS - HISTORICAL

DIM_FINANCIAL_TSK

REFERENCE_TSK

DIM_CUSTOMER_TSK

DIM_ACCOUNT_TSK DIM_SECURITY_TSK

DIM_COMPANY_TSK

DIM_TRADE_TSK

FACT_CASH_BALANCES_TSK

FACT_HOLDINGS_TSK

FACT_WATCHES_TSK

FACT_MARKET_HISTORY_TSKFACT_PROSPECT_TSK

Arrow pointing to a task indicates a predecessor link to that taskDotted line with a ball-point indicates a task condition

DIM_BROKER_TSK

CONTROL_TSK

TASK PREDECESSORS - INCREMENTAL

DIM_CUSTOMER_TSK

DIM_ACCOUNT_TSK

DIM_TRADE_TSK FACT_CASH_BALANCES_TSK

FACT_HOLDINGS_TSK

FACT_WATCHES_TSK

FACT_MARKET_HISTORY_TSK

FACT_PROSPECT_TSK

Arrow pointing to a task indicates a predecessor link to that task

CUSTOMER_STG_STM DAILYMARKET_STG_STM


CHANGE DATA CAPTURE

(STREAMS)


(TASKS)

Data Pipelines Flow



Reuse Existing Data Flows



Captures Business LogicProcess Rows Exactly Once

Complete Integration with Streams

Full and Incremental Load ProcessesFully Automated

All Within Snowflake Service


Thank You

Documents

Continuous Data Pipelines - Snowflake Inc. · CHANGE DATA CAPTURE (STREAMS) IN-DATABASE TRANSFORMS (TASKS) 1.Repeatable Process 2.Low Latency 3.SQL Based Transforms Data Pipelines