View
614
Download
0
Category
Preview:
Citation preview
www.mapflat.com
Test strategies for data processing pipelines
Lars Albertsson, independent consultant (Mapflat)Øyvind Løkling, Schibsted Products & Technology
1
www.mapflat.com
Who’s talking?Swedish Institute of Computer. Science. (test & debug tools)Sun Microsystems (large machine verification)Google (Hangouts, productivity)Recorded Future (NLP startup) (data integrations)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling, productivity)Schibsted Products & Tech (data processing & modelling)Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda● Data applications from a test perspective● Testing stream processing product● Testing batch processing products● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience
3
www.mapflat.com
Test valueFor data-centric applications, in this order:● Productivity
○ Move fast without breaking things● Fast experimentation
○ 10% good ideas, 90% bad● Data quality
○ Challenging, more important than● Technical quality
○ Technical failure => ops hassle, stale data4
www.mapflat.com
Streaming data product anatomy
5
Pub / sub
Unified log
Ingress Stream processing Egress
DBService
TopicJobPipeline
Service
Export
Businessintelligence
DB
www.mapflat.com
Test harness
Testfixture
Test concepts
6
System under test(SUT)
3rd party component(e.g. DB)
3rd party component
3rd party component
Test input
Test oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build tools
www.mapflat.com
Test scopes
7
Unit
Class
Component
Integration
System / acceptance
● Pick stable seam● Small scope
○ Fast?○ Easy, simple?
● Large scope○ Real app value?○ Slow, unstable?
● Maintenance, cost○ Pick few SUTs
www.mapflat.com
● Output = function(input, code)○ No external factors => deterministic
● Pipeline and job endpoints are stable○ Correspond to business value
● Internal abstractions are volatile○ Reslicing in different dimensions is common
Data-centric application properties
8
q
www.mapflat.com
● Output = function(input, code)○ Perfect for test!○ Avoid: external service calls, wall clock
● Pipeline/job edges are suitable seams○ Focus on large tests
● Internal seams => high maintenance, low value○ Omit unit tests, mocks, dependency injection!
● Long pipelines crosses teams○ Need for end-to-end tests, but culture challenge
Data-centric app test properties
9
q
www.mapflat.com
Suitable seams, streaming
10
Pub / sub
DBService
TopicJobPipeline
Service
Export
Businessintelligence
DB
www.mapflat.com
2, 7. Scalatest. 4. Spark Streaming jobsIDE, CI, debug integration
Streaming SUT, example harness
11
DB
TopicKafka
5. Test input
6. Test oracle
3. Docker
1. IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
12
1. Start fixture containers2. Await fixture ready3. Allocate test case resources4. Start jobs5. Push input data to Kafka6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }7. While (moreTests) { Goto 3 }8. Tear down fixtureFor absence test, send dummy sync messages at end.
www.mapflat.com
Input generation
13
● Input & output is denormalised & wide● Fields are frequently changed
○ Additions are compatible○ Modifications are incompatible =>
new, similar data type● Static test input, e.g. JSON files
○ Unmaintainable● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
14
● Compare with expected output● Check fields relevant for test
○ Robust to field changes○ Reusable for new, similar types
● Tip: Use lenses○ JSON: JsonPath (Java), Play JSON (Scala)○ Case classes: Monocle
● Express invariants for each data type
www.mapflat.com
Batch data product anatomy
15
Cluster storage
Unified log
Ingress ETL Egress
Datalake
DBService
DatasetJobPipeline
Service
Export
Businessintelligence
DBDB
Import
Datasets● Pipeline equivalent of objects● Dataset class == homogeneous records, open-ended
○ Compatible schema○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters○ Immutable○ Finite set of homogeneous records○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
16
www.mapflat.com
Directory datasets
17
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json
● Some tools, e.g. Spark, understand Hive name conventions
Dataset class
Instance parameters,Hive convention
Seal PartitionsPrivacylevel
Schemaversion
Batch processing● outDatasets =
code(inDatasets)● Component that scale up
○ Spark, (Flink, Scalding, Crunch)
● And scale down○ Local mode○ Most jobs fit in one
machine18
Gradual refinement1. Wash
- time shuffle, dedup, ...2. Decorate
- geo, demographic, ...3. Domain model
- similarity, clusters, ...4. Application model
- Recommendations, ...
www.mapflat.com
Workflow manager● Dataset “build tool”● Run job instance when
○ input is available○ output missing
● Backfills for previous failures● DSL describes dependencies● Includes ingress & egressSuggested: Luigi / Airflow
19
DB
`
www.mapflat.com
ClientSessions A/B tests
DSL DAG example (Luigi)
20
class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + \ [UserDB(date=self.hour.date)] ...
class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ...
class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions, hourly
● Expressive, embedded DSL - a must for ingress, egress○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
UserDB
Time shuffle, user decorate
Form sessions
A/B compare
ClientActions, daily
A/B session evaluationDataset instance
Job (aka Task) classes
www.mapflat.com
Batch processing testing● Omit collection in SUT
○ Technically different● Avoid clusters
○ Slow tests● Seams
○ Between jobs○ End of pipelines
21
Data lake
Artifact of business valueE.g. service index
JobPipeline
www.mapflat.com
Batch test frameworks - don’t
● Spark, Scalding, Crunch variants● Seam == internal data structure
○ Omits I/O - common bug source● Vendor lock-in
When switching batch framework:○ Need tests for protection○ Test rewrite is unnecessary burden
22
Test input Test output
Job
www.mapflat.com
Testing single job
23
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Don’t commit - expensive to maintain.Generate / verify with code.
Runs well in CI / from IDE
Job
www.mapflat.com
Testing pipelines - two options
24
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job 3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance
p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python
f()
B:● Both can be extended with egress DBs
Test job with sequence of jobs
www.mapflat.com
Testing with cloud services● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, (S3)● Integrate PaaS service as fixture component
○ Distribute access tokens, etc○ Pay $ or $$$
25
www.mapflat.com
Quality testing variants● Functional regression
○ Binary, key to productivity● Golden set
○ Extreme inputs => obvious output○ No regressions tolerated
● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy
26
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters○ Odd code path => bump counter
● Dedicated quality assessment pipelines○ Reuse test oracle invariants in production
Obtaining quality metrics
27
DB
Quality assessment job
www.mapflat.com
Quality testing in the process● Binary self-contained
○ Validate in CI● Relative vs history
○ E.g. large drops○ Precondition for
publishing dataset● Push aggregates to DB
○ Standard ops: monitor, alert
28
DB
∆?
Code ∆!
www.mapflat.com
RAM
Input
Computer program anatomy
29
Input data
Process Output
FileHID
VariableFunctionExecution path
Lookupstructure
Output data
Window
FileFile
Device
www.mapflat.com
Data pipeline = yet another programDon’t veer from best practices● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...In data engineering, slipping is in the culture... :-(● Mix in solid backend engineers, document “golden path”
30
www.mapflat.com
Top anti-patterns1. Test as afterthought or in production
Data processing applications are suited for test!2. Static test input in version control3. Exact expected output test oracle4. Unit testing volatile interfaces5. Using mocks & dependency injection6. Tool-specific test framework7. Calling services / databases from (batch) jobs8. Using wall clock time9. Embedded fixture components
31
www.mapflat.com
Further resources. Questions?
32
http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solidhttp://www.mapflat.com/lands/resources/reading-listhttp://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-pipelines-qcon-london-2016http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
www.mapflat.com
Unified log● Replicated append-only log● Pub / sub with history● Decoupled producers and consumers
○ In source/deployment○ In space○ In time
● Recovers from link failures● Replay on transformation bug fix
34
Credits: Confluent
www.mapflat.com
Stream pipelines● Parallelised jobs● Read / write to Kafka● View egress
○ Serving index○ SQL / cubes for Analytics
● Stream egress○ Services subscribe to topic○ E.g. REST post for export
35
Credits: Apache Samza
www.mapflat.com
The data lakeUnified log + snapshots● Immutable datasets● Raw, unprocessed● Source of truth from batch
processing perspective● Kept as long as permitted● Technically homogeneous
○ Except for raw imports
36
Cluster storage
Data lake
www.mapflat.com
Batch pipelines● Things will break
○ Input will be missing○ Jobs will fail○ Jobs will have bugs
● Datasets must be rebuilt● Determinism, idempotency● Backfill missing / failed● Eventual correctness
37
Cluster storage
Data lake
Pristine,immutabledatasets
Intermediate
Derived,regenerable
www.mapflat.com
Job == function([input datasets]): [output datasets]● No orthogonal concerns
○ Invocation○ Scheduling○ Input / output location
● Testable● No other input factors, no side-effects● Ideally: atomic, deterministic, idempotent● Necessary for audit
Batch job
38
q
www.mapflat.com
Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines
www.mapflat.com
Data platform, pipeline chains
Common data infrastructureProductivity, privacy, end-to-end agility, complexity
Beware: producer-consumer disconnect
www.mapflat.com
Ingress / egress representation
Larger variation:● Single file● Relational database table● Cassandra column family, other NoSQL● BI tool storage● BigQuery, Redshift, ...Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it.
41
www.mapflat.com
Egress datasets
● Serving○ Precomputed user query answers○ Denormalised○ Cassandra, (many)
● Export & Analytics○ SQL (single node / Hive, Presto, ..)○ Workbenches (Zeppelin)○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently○ Prepare to redirect pipelines
42
Deployment
43
Hg/git repo Luigi DSL, jars, config
my-pipe-7.tar.gzHDFS
Luigidaemon
> pip install my-pipe-7.tar.gz
WorkerWorker
WorkerWorker
WorkerWorker
WorkerWorker
Redundant cron schedule, higher frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily \ --backfill 14
All that a pipeline needs, installed atomically
Recommended