Upload
others
View
17
Download
2
Embed Size (px)
Citation preview
© 2019 Snowflake Inc. All Rights Reserved
Continuous Data PipelinesJORDAN UDE, SENIOR SALES ENGINEER
WORLD SUMMIT - CHICAGO
● Clickstream● App logs● Unstructured data
● Breaking news● Consumer tastes ● IoT / device-generated
● PoS transactions● App logs● Device events
VOLUME VELOCITY VARIETY
Data Arrives ContinuouslyFreshness is Critical for Decision Making
Continuous Pipeline Objectives● Provide built-in and simple to use transform (T part of ELT) capabilities
○ Reduce latency for queries by ingesting and transforming data quickly as it arrives○ Provide Change Data Capture (CDC) for tables○ Simplify transformations using SQL; not require external scheduler/orchestrator○ Easier to build & manage, within-Snowflake data pipelines
● Better support telematics scenarios○ More sources of data with lower ingestion latencies○ Better management of ingestion pipelines - including stages
● Incrementally deliver value○ Ingestion: Snowpipe w/ auto-ingest ○ Change tracking in stage tables: streams○ Scheduled transforms on new data: tasks
DATA INGESTION (SNOWPIPE)
CHANGE DATA CAPTURE
(STREAMS)
IN-DATABASE TRANSFORMS
(TASKS)
1. Low Latency
2. Semi-Structured Data
3. Ecosystem Integration
1. Simple Change Tracking
2. Metadata Driven Solution
1. Repeatable Process
2. Low Latency
3. SQL Based Transforms
Data Pipelines Flow
REAL WORLD EXAMPLE (TPC-DI)
1. Industry Recognized Test
2. Complete Low Latency Data Flow
3. Complexity Matching Real World Needs
Purpose: ingest data into Snowflake continuously w/ no code / orchestrator
Mechanism: files stream into Snowpipe, get parsed & rows are inserted
Steps:1. Create a Snowpipe with staging bucket in cloud and a target table in Snowflake
2. Stream data into buckets - chunked into files
3. Auto-ingest listens for notifications and loads the data
Snowpipe w/ Auto Ingest
Continuous Ingestion
Snowflake
S3, Azure Blob,or GCS
Staging Table
Target Table 1
Target Table 2
Target Table 3
TableStream Task
TableStream
TableStream
TaskSnowpipe
(w/ auto-ingest)
DATA INGESTION (SNOWPIPE)
CHANGE DATA CAPTURE
(STREAMS)
IN-DATABASE TRANSFORMS
(TASKS)
1. Simple Change Tracking
2. Metadata Driven Solution
1. Repeatable Process
2. Low Latency
3. SQL Based Transforms
Data Pipelines Flow
REAL WORLD EXAMPLE (TPC-DI)
1. Industry Recognized Test
2. Complete Low Latency Data Flow
3. Complexity Matching Real World Needs
<1 Min IngestionAll Data Types
Third Party Data Flows
Continuous Data Pipelines
Snowflake
S3, Azure Blob,or GCS
Staging Table
Target Table 1
Target Table 2
Target Table 3
TableStream Task
TableStream
TableStream
TaskSnowpipe
(w/ auto-ingest)
How do I know what changed in the staging
table?
Where do I keep the business logic for my
transformations?
How do I run this on a schedule?
How do I make this run reliably in the cloud?
How do I ensure exactly once semantics?
Table streams: Change Data Capture (CDC)
● Provide a set of changes made to the underlying table since last time
○ Covers all transactions previously not consumed
● Consumed in a DML statement: designed for transforms (“T” in ELT)
● Consuming advances the “offset” on commit - somewhat like a forward-only cursor
● Multiple streams possible on a given table
● Intended for retention period; O(hours) up to a day or whatever the retention period is
create or replace stream stage_changes on table stage;
insert into clicks (...)
select ... from stage_changes where metadata$action = 'INSERT';
Table streams: changes since offset
DATA INGESTION (SNOWPIPE)
CHANGE DATA CAPTURE
(STREAMS)
IN-DATABASE TRANSFORMS
(TASKS)
1. Repeatable Process
2. Low Latency
3. SQL Based Transforms
Data Pipelines Flow
REAL WORLD EXAMPLE (TPC-DI)
1. Industry Recognized Test
2. Complete Low Latency Data Flow
3. Complexity Matching Real World Needs
<1 Min IngestionAll Data Types
Third Party Data Flows
Simple CreationZero Maintenance
Aggregate Change Tracking
Tasks● Executes DML statement / stored proc
○ Run on schedule; or ○ Run after predecessor task completes (pipeline scenario realized as tree of tasks)
● Tasks can be used with table streams or by themselves○ Table stream available for use in DML/sproc: insert, update, delete, merge○ Optionally add condition on stream having data: when
● Use cases include○ Shred JSON/XML from staging table into multiple tables (using table streams)○ Keep aggregates up-to-date and generate reports periodically○ alter pipe … refresh to catch up on auto-ingest
© 2019 Snowflake Inc. All Rights Reserved
Tasks allow scheduled execution of DML. But they enable a lot more than managed cron:
● Streams ensure exactly once semantics for new/changed data (CDC). Offset is updated so you see only unprocessed rows.
● Data is not lost even if task execution window is missed; next execution of a task using a stream automatically catches up with whatever was missed due to an error
● Streams + Tasks provide a good separation between rapid ingestion to a staging table and transactional transforms on the ingested data
BEYOND SCHEDULED DML
DATA INGESTION (SNOWPIPE)
CHANGE DATA CAPTURE
(STREAMS)
IN-DATABASE TRANSFORMS
(TASKS)
Data Pipelines Flow
REAL WORLD EXAMPLE (TPC-DI)
1. Industry Recognized Test
2. Complete Low Latency Data Flow
3. Complexity Matching Real World Needs
<1 Min IngestionAll Data Types
Third Party Data Flows
Simple CreationZero Maintenance
Aggregate Change Tracking
Captures Business LogicProcess Rows Exactly Once
Complete Integration with Streams
TPC.orgThe TPC is a non-profit corporation founded to define transaction processing and database
benchmarks and to disseminate objective, verifiable TPC performance data to the industry.
15
TPC-DS
TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but
not limited to, Big Data systems.
TPC-H
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries
and concurrent data modifications.
TPC-DI
The TPC-DI benchmark combines and transforms data extracted from an On-Line
Transaction Processing (OTLP) system along with other sources of data, and loads it into a data warehouse. The source and
destination data models, data transformations and implementation rules
have been designed to be broadly representative of modern data integration
requirements.
Stre
ams/
Task
s/Pr
ocs
CO
PY/S
now
pipe
/Kaf
ka
TPC-DIIn
SF UsingSnowflake
Data PipelineFeatures
TPC-DI DEPENDENCIES
Arrow pointing to a task indicates a dependency on that task
DIM_FINANCIAL
DIM_CUSTOMER
DIM_ACCOUNT DIM_SECURITY
DIM_COMPANY
DIM_TRADE
FACT_CASH_BALANCES
FACT_HOLDINGS
FACT_WATCHES
FACT_MARKET_HISTORYFACT_PROSPECT
DIM_BROKER
DIM_DATEDIM_TIMEDIM_TAX_RATEDIM_STATUS_TYPE DIM_TRADE_TYPE DIM_INDUSTRY
TPC-DI PROCESSING: JOBSBatch 1 - Historical Load
Batch 2 - Incremental Load 1
Batch 3 - Incremental Load 2
The Historical Load has more sources than the IncrementalCode is shared between Historical/Incremental where possible
DIM_SECURITY_STMDIM_ACCOUNT_STM
TASK PREDECESSORS - HISTORICAL
DIM_FINANCIAL_TSK
REFERENCE_TSK
DIM_CUSTOMER_TSK
DIM_ACCOUNT_TSK DIM_SECURITY_TSK
DIM_COMPANY_TSK
DIM_TRADE_TSK
FACT_CASH_BALANCES_TSK
FACT_HOLDINGS_TSK
FACT_WATCHES_TSK
FACT_MARKET_HISTORY_TSKFACT_PROSPECT_TSK
Arrow pointing to a task indicates a predecessor link to that taskDotted line with a ball-point indicates a task condition
DIM_BROKER_TSK
CONTROL_TSK
TASK PREDECESSORS - INCREMENTAL
DIM_CUSTOMER_TSK
DIM_ACCOUNT_TSK
DIM_TRADE_TSK FACT_CASH_BALANCES_TSK
FACT_HOLDINGS_TSK
FACT_WATCHES_TSK
FACT_MARKET_HISTORY_TSK
FACT_PROSPECT_TSK
Arrow pointing to a task indicates a predecessor link to that task
CUSTOMER_STG_STM DAILYMARKET_STG_STM
DATA INGESTION (SNOWPIPE)
CHANGE DATA CAPTURE
(STREAMS)
IN-DATABASE TRANSFORMS
(TASKS)
Data Pipelines Flow
REAL WORLD EXAMPLE (TPC-DI)
<1 Min IngestionAll Data Types
Reuse Existing Data Flows
Simple CreationZero Maintenance
Aggregate Change Tracking
Captures Business LogicProcess Rows Exactly Once
Complete Integration with Streams
Full and Incremental Load ProcessesFully Automated
All Within Snowflake Service
© 2019 Snowflake Inc. All Rights Reserved
Thank You