Test strategies for data processing pipelines

www.mapflat.com

Lars Albertsson, independent consultant (Mapflat)Øyvind Løkling, Schibsted Products & Technology

www.mapflat.com

Who’s talking?Swedish Institute of Computer. Science. (test & debug tools)Sun Microsystems (large machine verification)Google (Hangouts, productivity)Recorded Future (NLP startup) (data integrations)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling, productivity)Schibsted Products & Tech (data processing & modelling)Mapflat (independent data engineering consultant)

www.mapflat.com

Agenda● Data applications from a test perspective● Testing stream processing product● Testing batch processing products● Data quality testing

Main focus is functional, regression testing

Prerequisites: Backend dev testing, basic data experience

www.mapflat.com

Test valueFor data-centric applications, in this order:● Productivity

○ Move fast without breaking things● Fast experimentation

○ 10% good ideas, 90% bad● Data quality

○ Challenging, more important than● Technical quality

○ Technical failure => ops hassle, stale data4

www.mapflat.com

Streaming data product anatomy

Pub / sub

Unified log

Ingress Stream processing Egress

DBService

TopicJobPipeline

Service

Export

Businessintelligence

www.mapflat.com

Test harness

Testfixture

Test concepts

System under test(SUT)

3rd party component(e.g. DB)

3rd party component

Test input

Test oracle

Test framework (e.g. JUnit, Scalatest)

Build tools

www.mapflat.com

Test scopes

Component

Integration

System / acceptance

● Pick stable seam● Small scope

○ Fast?○ Easy, simple?

● Large scope○ Real app value?○ Slow, unstable?

● Maintenance, cost○ Pick few SUTs

www.mapflat.com

● Output = function(input, code)○ No external factors => deterministic

● Pipeline and job endpoints are stable○ Correspond to business value

● Internal abstractions are volatile○ Reslicing in different dimensions is common

Data-centric application properties

www.mapflat.com

● Output = function(input, code)○ Perfect for test!○ Avoid: external service calls, wall clock

● Pipeline/job edges are suitable seams○ Focus on large tests

● Internal seams => high maintenance, low value○ Omit unit tests, mocks, dependency injection!

● Long pipelines crosses teams○ Need for end-to-end tests, but culture challenge

Data-centric app test properties

www.mapflat.com

Suitable seams, streaming

Pub / sub

DBService

TopicJobPipeline

Service

Export

www.mapflat.com

2, 7. Scalatest. 4. Spark Streaming jobsIDE, CI, debug integration

Streaming SUT, example harness

TopicKafka

5. Test input

6. Test oracle

3. Docker

1. IDE / Gradle

Polling

www.mapflat.com

Test lifecycle

1. Start fixture containers2. Await fixture ready3. Allocate test case resources4. Start jobs5. Push input data to Kafka6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }7. While (moreTests) { Goto 3 }8. Tear down fixtureFor absence test, send dummy sync messages at end.

www.mapflat.com

Input generation

● Input & output is denormalised & wide● Fields are frequently changed

○ Additions are compatible○ Modifications are incompatible =>

new, similar data type● Static test input, e.g. JSON files

○ Unmaintainable● Input generation routines

○ Robust to changes, reusable

www.mapflat.com

Test oracles

● Compare with expected output● Check fields relevant for test

○ Robust to field changes○ Reusable for new, similar types

● Tip: Use lenses○ JSON: JsonPath (Java), Play JSON (Scala)○ Case classes: Monocle

● Express invariants for each data type

www.mapflat.com

Batch data product anatomy

Cluster storage

Unified log

Ingress ETL Egress

Datalake

DBService

DatasetJobPipeline

Service

Export

Import

Datasets● Pipeline equivalent of objects● Dataset class == homogeneous records, open-ended

○ Compatible schema○ E.g. MobileAdImpressions

● Dataset instance = dataset class + parameters○ Immutable○ Finite set of homogeneous records○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)

www.mapflat.com

Directory datasets

hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json

● Some tools, e.g. Spark, understand Hive name conventions

Dataset class

Instance parameters,Hive convention

Seal PartitionsPrivacylevel

Schemaversion

Batch processing● outDatasets =

code(inDatasets)● Component that scale up

○ Spark, (Flink, Scalding, Crunch)

● And scale down○ Local mode○ Most jobs fit in one

machine18

Gradual refinement1. Wash

- time shuffle, dedup, ...2. Decorate

- geo, demographic, ...3. Domain model

- similarity, clusters, ...4. Application model

- Recommendations, ...

www.mapflat.com

Workflow manager● Dataset “build tool”● Run job instance when

○ input is available○ output missing

● Backfills for previous failures● DSL describes dependencies● Includes ingress & egressSuggested: Luigi / Airflow

www.mapflat.com

ClientSessions A/B tests

DSL DAG example (Luigi)

class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + \ [UserDB(date=self.hour.date)] ...

class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ...

class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]

def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))

Actions, hourly

● Expressive, embedded DSL - a must for ingress, egress○ Avoid weak DSL tools: Oozie, AWS Data Pipeline

UserDB

Time shuffle, user decorate

Form sessions

A/B compare

ClientActions, daily

A/B session evaluationDataset instance

Job (aka Task) classes

www.mapflat.com

Batch processing testing● Omit collection in SUT

○ Technically different● Avoid clusters

○ Slow tests● Seams

○ Between jobs○ End of pipelines

Data lake

Artifact of business valueE.g. service index

JobPipeline

www.mapflat.com

Batch test frameworks - don’t

● Spark, Scalding, Crunch variants● Seam == internal data structure

○ Omits I/O - common bug source● Vendor lock-in

When switching batch framework:○ Need tests for protection○ Test rewrite is unnecessary burden

Test input Test output

www.mapflat.com

Testing single job

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run in local mode 3. Verify output

f() p()

Don’t commit - expensive to maintain.Generate / verify with code.

Runs well in CI / from IDE

www.mapflat.com

Testing pipelines - two options

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run custom multi-job 3. Verify output

f() p()

Customised workflow manager setup

+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance

p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python

B:● Both can be extended with egress DBs

Test job with sequence of jobs

www.mapflat.com

Testing with cloud services● PaaS components do not work locally

○ Cloud providers should provide fake implementations

○ Exceptions: Kubernetes, Cloud SQL, (S3)● Integrate PaaS service as fixture component

○ Distribute access tokens, etc○ Pay $ or $$$

www.mapflat.com

Quality testing variants● Functional regression

○ Binary, key to productivity● Golden set

○ Extreme inputs => obvious output○ No regressions tolerated

● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy

www.mapflat.com

Hadoop / Spark counters

● Processing tool (Spark/Hadoop) counters○ Odd code path => bump counter

● Dedicated quality assessment pipelines○ Reuse test oracle invariants in production

Obtaining quality metrics

Quality assessment job

www.mapflat.com

Quality testing in the process● Binary self-contained

○ Validate in CI● Relative vs history

○ E.g. large drops○ Precondition for

publishing dataset● Push aggregates to DB

○ Standard ops: monitor, alert

Code ∆!

www.mapflat.com

Computer program anatomy

Input data

Process Output

FileHID

VariableFunctionExecution path

Lookupstructure

Output data

Window

FileFile

Device

www.mapflat.com

Data pipeline = yet another programDon’t veer from best practices● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location,

duplication, ...In data engineering, slipping is in the culture... :-(● Mix in solid backend engineers, document “golden path”

www.mapflat.com

Top anti-patterns1. Test as afterthought or in production

Data processing applications are suited for test!2. Static test input in version control3. Exact expected output test oracle4. Unit testing volatile interfaces5. Using mocks & dependency injection6. Tool-specific test framework7. Calling services / databases from (batch) jobs8. Using wall clock time9. Embedded fixture components

www.mapflat.com

Further resources. Questions?

http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solidhttp://www.mapflat.com/lands/resources/reading-listhttp://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-pipelines-qcon-london-2016http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

www.mapflat.com

Bonus slides

www.mapflat.com

Unified log● Replicated append-only log● Pub / sub with history● Decoupled producers and consumers

○ In source/deployment○ In space○ In time

● Recovers from link failures● Replay on transformation bug fix

Credits: Confluent

www.mapflat.com

Stream pipelines● Parallelised jobs● Read / write to Kafka● View egress

○ Serving index○ SQL / cubes for Analytics

● Stream egress○ Services subscribe to topic○ E.g. REST post for export

Credits: Apache Samza

www.mapflat.com

The data lakeUnified log + snapshots● Immutable datasets● Raw, unprocessed● Source of truth from batch

processing perspective● Kept as long as permitted● Technically homogeneous

○ Except for raw imports

Cluster storage

Data lake

www.mapflat.com

Batch pipelines● Things will break

○ Input will be missing○ Jobs will fail○ Jobs will have bugs

● Datasets must be rebuilt● Determinism, idempotency● Backfill missing / failed● Eventual correctness

Cluster storage

Data lake

Pristine,immutabledatasets

Intermediate

Derived,regenerable

www.mapflat.com

Job == function([input datasets]): [output datasets]● No orthogonal concerns

○ Invocation○ Scheduling○ Input / output location

● Testable● No other input factors, no side-effects● Ideally: atomic, deterministic, idempotent● Necessary for audit

Batch job

www.mapflat.com

Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss

Data pipelines

www.mapflat.com

Data platform, pipeline chains

Common data infrastructureProductivity, privacy, end-to-end agility, complexity

Beware: producer-consumer disconnect

www.mapflat.com

Ingress / egress representation

Larger variation:● Single file● Relational database table● Cassandra column family, other NoSQL● BI tool storage● BigQuery, Redshift, ...Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it.

www.mapflat.com

Egress datasets

● Serving○ Precomputed user query answers○ Denormalised○ Cassandra, (many)

● Export & Analytics○ SQL (single node / Hive, Presto, ..)○ Workbenches (Zeppelin)○ (Elasticsearch, proprietary OLAP)

● BI / analytics tool needs change frequently○ Prepare to redirect pipelines

Deployment

Hg/git repo Luigi DSL, jars, config

my-pipe-7.tar.gzHDFS

Luigidaemon

> pip install my-pipe-7.tar.gz

WorkerWorker

Redundant cron schedule, higher frequency + backfill (Luigi range tools)

* 10 * * * bin/my_pipe_daily \ --backfill 14

All that a pipeline needs, installed atomically

Test strategies for data processing pipelines

Data & Analytics

The Natural Gas Industry-an Overvie production gas wells & lease equipment gathering / processing transmission plants , pipelines & storage distribution consumers ldc pipelines & equipment

Strategies and Tools for Group Processing

Lowering the Latency of Data Processing Pipelines Through FPGA … · 2019-09-18 · Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration Muhsen

Mule flow processing strategies

An Exclusive Gas Processing Texas Pipelines Map · An Exclusive Gas Processing Texas Pipelines Map Published with the December 2016 issues of Gas Processing, World Oil and Hydrocarbon

The Chandra Source Catalog 2.0: Data Processing Pipelines … · The Chandra Source Catalog 2.0: Data Processing Pipelines Joseph B. Miller1, Christopher Allen1, Jamie A. Budynkiewicz1,

Automatic Processing Pipelines with XNAT and REDCap Vanderbilt University

Replifihiklacement Strategies for High Risk Distribution ... · Replifihiklacement Strategies for High Risk Distribution Pipelines AGA Management of Vintage Distribution Pipelines

Test strategies for data processing pipelines, v2.0

Large scale data processing pipelines at trivago: a use case

Embedding of Asynchronous Wave Pipelines into Synchronous Data Processing

2014 Enbridge Day - Gas Pipelines and Processing

Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

MediaStream processing pipelines on the Augmented Web Platform

High Level Synthesis for Packet Processing Pipelines Cristian

JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Gas Pipelines & Processing - Enbridge/media/Enb/Documents/Investor Relations... · Enbridge Day 2014 Gas Pipelines & Processing Greg Harper President, Gas Pipelines & Processing

Processing Strategies

Top Down Processing Strategies for YLs

Agile Engineering & Processing Strategies 2012 Agenda