Upload
alexander-dean
View
222
Download
0
Embed Size (px)
Citation preview
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean, Snowplow
Analytics
Introducing myself Alexander Dean
Co-founder and technical lead at Snowplow, the open-source event analytics pipeline
Weekend writer of Unified Log Processing
Co-author at Snowplow of Iglu, our open-source schema registry system
One-time undergraduate historian at Cambridge University
In 1855 Britain’s workshops built 120 new gunboats for the Royal Navy in
just 90 days An astonishing feat of engineering in a war known for its modern methods
This was made possible by industrial standardization
Working at data-sophisticated companies, we need a new piece of industrial standardization of our own, in the form of schema registries
Back to 1855, and a grand alliance including Britain and the Ottoman
Empire is at war with Russia
Britain’s Royal Navy badly needed new gunboats for the war’s Baltic campaign The catch? The Royal Navy needed 120 of these gunboats…
… in just 90 days!
The challenge for John Penn was building the engine sets in time
Marine engineer John Penn was a huge engine/propeller innovator and the major supplier to the RN during their sail to steam transition
Assembling the gunboats was straightforward…
… the challenge for Penn was to build an additional 90 sets of engines (he had 30 already) in record time
Penn sent the parts from two engines to Britain’s best workshops for copying
Part A
…Part B
Part Z
Part A
…Part B
Part Z
Part A #1
Part A #2
Part A #45
…Part B
#1Part Z
#1
Part B #2
Part B #45
Part Z #2
Part Z #45
…
…
Part A #46
Part A #47
Part A #90
…Part B #46
Part Z #46
Part B #47
Part B #90
Part Z #47
Part Z #90
…
…
Penn then assembled the finished parts into 90 new engine sets
Engine #1
Engine #2
Engine #45
Engine #46
Engine #47
Engine #90
… …
Part A #1
Part A #2
Part A #45
…Part B
#1Part Z
#1
Part B #2
Part B #45
Part Z #2
Part Z #45
…
…
Part A #46
Part A #47
Part A #90
…Part B #46
Part Z #46
Part B #47
Part B #90
Part Z #47
Part Z #90
…
…
Britain’s early industrial workshops were the micro-services of their day
Photograph copyright Les Chatfield, licensed under CC BY 2.0
But how did Penn know that parts from disparate workshops would assemble?
There were no formal communications between individual workshops
Penn had to be confident that engine parts from opposite ends of the country would assemble into a functional engine, sight unseen
How?
The answer is the Whitworth thread Sir Joseph Whitworth was an English engineer and
entrepreneur who pioneered high precision machinery
In 1841 he devised the world’s first national screw thread standard, replacing individual companies’ in-house “standards”
The Whitworth thread (later becoming the British Standard Whitworth) was a key enabler of mass-production techniques
So Penn’s workshops did have a common contract: the Whitworth thread
The Whitworth thread
This was the first use of mass production techniques in marine engineering
“The orders were executed with unfailing regularity, and he [John Penn] actually completed ninety sets of engines of 60 horsepower in ninety days – a feat which made the great Continental Powers stare with wonder, and which was possible only because the Whitworth standards of measurement and of accuracy and finish were by that time thoroughly recognised and established throughout the country.”
The Times obituary for Sir Joseph Whitworth, 24 January 1887
Today, our “workshops” are our company’s data processing jobs
Content personalization
Fraud detection
A/B testing
Management reporting
Real-time marketing
Data archival
And like Penn’s workshops, our jobs do not communicate directly
Kafka topic or Kinesis stream
produce
HDFS or Amazon S3
produceconsume
consume
consume
Kafka topic or Kinesis stream
consume
produce
Kafka topic or Kinesis stream
Schemas serve as the common contract between producers and consumers
S S S S SKafka topic or Kinesis
stream
produce
consume
consume
Producing job commits to emitting events in a given schema, S
Consuming jobs expect to read events represented in schema S
So, what is a schema?
Schema is the Greek word for shape
A schema is a declaration that a set of data records follow a pre-defined shape
There are some widely used schema technologies
Often thought of as data serialization systems
But these four technologies all support at least one schema language to describe your schemas (not all data serialization systems have this)
Apache Thrift JSON Schema
Some key attributes of schema technologies
Types – a syntax to precisely define the types of the properties within our given schema, e.g. strings, integers, arrays, structs, timestamps
Validation rules – a syntax to define contracts that go beyond types, e.g. “long/lat is a floating point in the range -180 to +180”
Code generation – for bindings to your schemas in a given programming language Data encodings – one or more ways of representing the instances of the schema,
e.g. a compact binary format versus human-readable JSON Schema evolution – making it easy to consume different versions of a schema as
it changes over time, e.g. backfilling for a new field with a default
… in just 90 days!
Interest in schema technology is growing steadily
* Excluded “Thrift” given the search term is so generic
So we now have schemas for our data – where do we store them?
Co-locating the schema with the record (or even a file of records) doesn’t scale:
The schema definition is often larger than the record!
Kafka topic or Kinesis stream
Schema
Record
Schema
Record
Schema
Record
Schema
Record
Schema
Record
Instead, we want to store a pointer to the schema with each record
Kafka topic or Kinesis stream
Record Record Record Record Record
Some store of schemas
We call this central source of truth for our schemas a schema registry
Our schema registry
There are two widely used open source schema registries
Confluent Schema Registry – an integral part of the Confluent Platform for Kafka-based data pipelines
https://github.com/confluentinc/schema-registry
Iglu – an integral part of the Snowplow open source event data pipeline
https://github.com/snowplow/iglu
The architectures are broadly similar
Underlying storage
mechanism
RESTful API
Confluent Schema Registry uses Kafka as the underlying storage mechanism
Iglu uses Postgres or a statically hosted website as the storage mechanism
The Confluent Schema Registry is closely tied to Avro and Kafka
Supports Avro only, with first class support for Avro’s schema evolution
Uses Kafka as the underlying storage mechanism
Distributed system with a single master architecture
Assigns a registry-unique ID (monotonically increasing) to each registered schema
The schema definition is often larger than the record!
Iglu takes a slightly different approach
Supports multiple schema technologies (Thrift, JSON Schema, Avro etc)
PostgreSQL storage option is single-node, but a schema registry can be hosted statically (e.g. on S3 with CloudFront) for performance
Used heavily in Snowplow but intended to be general-purpose (with Scala and Objective-C client libraries)
The schema definition is often larger than the record!
Iglu uses semantic URIs to address schemas
iglu:com.channel2.vod/video_played/jsonschema/1-0-0
The schema URI protocol is “iglu”
The vendor of this schema
The name of this schema
Schema technology
Schema version
The semantic URIs let us share schemas across company boundaries
iglu:com.channel2.vod/video_played/jsonschema/1-0-0
Schema resolution
Iglu Central Partner vendor registry
channel2.com registry
Search for the schema across multiple registries
Our public registry for commonly
used schemas
video_played
To sum up Like Sir Joseph Whitworth’s screw thread, we need standards to allow our
decoupled data processing jobs to interact efficiently
Schemas provide the required contract between our data producers and our data consumers
Our schemas need a home – we can house them in a schema registry
Tagging schemas with semantic URIs lets us share schemas across company boundaries, as Whitworth’s screw thread was shared across workshops
Thank you! Questions? Twitter: @alexcrdean
Email: [email protected]
Iglu: https://github.com/snowplow/iglu