31
What Crimean War gunboats teach us about the need for schema registries Alexander Dean, Snowplow Analytics

What Crimean War gunboats teach us about the need for schema registries

Embed Size (px)

Citation preview

Page 1: What Crimean War gunboats teach us about the need for schema registries

What Crimean War gunboats teach us about the need for schema registriesAlexander Dean, Snowplow

Analytics

Page 2: What Crimean War gunboats teach us about the need for schema registries

Introducing myself Alexander Dean

Co-founder and technical lead at Snowplow, the open-source event analytics pipeline

Weekend writer of Unified Log Processing

Co-author at Snowplow of Iglu, our open-source schema registry system

One-time undergraduate historian at Cambridge University

Page 3: What Crimean War gunboats teach us about the need for schema registries

In 1855 Britain’s workshops built 120 new gunboats for the Royal Navy in

just 90 days An astonishing feat of engineering in a war known for its modern methods

This was made possible by industrial standardization

Working at data-sophisticated companies, we need a new piece of industrial standardization of our own, in the form of schema registries

Page 4: What Crimean War gunboats teach us about the need for schema registries

Back to 1855, and a grand alliance including Britain and the Ottoman

Empire is at war with Russia

Page 5: What Crimean War gunboats teach us about the need for schema registries

Britain’s Royal Navy badly needed new gunboats for the war’s Baltic campaign The catch? The Royal Navy needed 120 of these gunboats…

… in just 90 days!

Page 6: What Crimean War gunboats teach us about the need for schema registries

The challenge for John Penn was building the engine sets in time

Marine engineer John Penn was a huge engine/propeller innovator and the major supplier to the RN during their sail to steam transition

Assembling the gunboats was straightforward…

… the challenge for Penn was to build an additional 90 sets of engines (he had 30 already) in record time

Page 7: What Crimean War gunboats teach us about the need for schema registries

Penn sent the parts from two engines to Britain’s best workshops for copying

Part A

…Part B

Part Z

Part A

…Part B

Part Z

Part A #1

Part A #2

Part A #45

…Part B

#1Part Z

#1

Part B #2

Part B #45

Part Z #2

Part Z #45

Part A #46

Part A #47

Part A #90

…Part B #46

Part Z #46

Part B #47

Part B #90

Part Z #47

Part Z #90

Page 8: What Crimean War gunboats teach us about the need for schema registries

Penn then assembled the finished parts into 90 new engine sets

Engine #1

Engine #2

Engine #45

Engine #46

Engine #47

Engine #90

… …

Part A #1

Part A #2

Part A #45

…Part B

#1Part Z

#1

Part B #2

Part B #45

Part Z #2

Part Z #45

Part A #46

Part A #47

Part A #90

…Part B #46

Part Z #46

Part B #47

Part B #90

Part Z #47

Part Z #90

Page 9: What Crimean War gunboats teach us about the need for schema registries

Britain’s early industrial workshops were the micro-services of their day

Photograph copyright Les Chatfield, licensed under CC BY 2.0

Page 10: What Crimean War gunboats teach us about the need for schema registries

But how did Penn know that parts from disparate workshops would assemble?

There were no formal communications between individual workshops

Penn had to be confident that engine parts from opposite ends of the country would assemble into a functional engine, sight unseen

How?

Page 11: What Crimean War gunboats teach us about the need for schema registries

The answer is the Whitworth thread Sir Joseph Whitworth was an English engineer and

entrepreneur who pioneered high precision machinery

In 1841 he devised the world’s first national screw thread standard, replacing individual companies’ in-house “standards”

The Whitworth thread (later becoming the British Standard Whitworth) was a key enabler of mass-production techniques

Page 12: What Crimean War gunboats teach us about the need for schema registries

So Penn’s workshops did have a common contract: the Whitworth thread

The Whitworth thread

Page 13: What Crimean War gunboats teach us about the need for schema registries

This was the first use of mass production techniques in marine engineering

“The orders were executed with unfailing regularity, and he [John Penn] actually completed ninety sets of engines of 60 horsepower in ninety days – a feat which made the great Continental Powers stare with wonder, and which was possible only because the Whitworth standards of measurement and of accuracy and finish were by that time thoroughly recognised and established throughout the country.”

The Times obituary for Sir Joseph Whitworth, 24 January 1887

Page 14: What Crimean War gunboats teach us about the need for schema registries

Today, our “workshops” are our company’s data processing jobs

Content personalization

Fraud detection

A/B testing

Management reporting

Real-time marketing

Data archival

Page 15: What Crimean War gunboats teach us about the need for schema registries

And like Penn’s workshops, our jobs do not communicate directly

Kafka topic or Kinesis stream

produce

HDFS or Amazon S3

produceconsume

consume

consume

Kafka topic or Kinesis stream

consume

produce

Kafka topic or Kinesis stream

Page 16: What Crimean War gunboats teach us about the need for schema registries

Schemas serve as the common contract between producers and consumers

S S S S SKafka topic or Kinesis

stream

produce

consume

consume

Producing job commits to emitting events in a given schema, S

Consuming jobs expect to read events represented in schema S

Page 17: What Crimean War gunboats teach us about the need for schema registries

So, what is a schema?

Schema is the Greek word for shape

A schema is a declaration that a set of data records follow a pre-defined shape

Page 18: What Crimean War gunboats teach us about the need for schema registries

There are some widely used schema technologies

Often thought of as data serialization systems

But these four technologies all support at least one schema language to describe your schemas (not all data serialization systems have this)

Apache Thrift JSON Schema

Page 19: What Crimean War gunboats teach us about the need for schema registries

Some key attributes of schema technologies

Types – a syntax to precisely define the types of the properties within our given schema, e.g. strings, integers, arrays, structs, timestamps

Validation rules – a syntax to define contracts that go beyond types, e.g. “long/lat is a floating point in the range -180 to +180”

Code generation – for bindings to your schemas in a given programming language Data encodings – one or more ways of representing the instances of the schema,

e.g. a compact binary format versus human-readable JSON Schema evolution – making it easy to consume different versions of a schema as

it changes over time, e.g. backfilling for a new field with a default

… in just 90 days!

Page 20: What Crimean War gunboats teach us about the need for schema registries

Interest in schema technology is growing steadily

* Excluded “Thrift” given the search term is so generic

Page 21: What Crimean War gunboats teach us about the need for schema registries

So we now have schemas for our data – where do we store them?

Co-locating the schema with the record (or even a file of records) doesn’t scale:

The schema definition is often larger than the record!

Kafka topic or Kinesis stream

Schema

Record

Schema

Record

Schema

Record

Schema

Record

Schema

Record

Page 22: What Crimean War gunboats teach us about the need for schema registries

Instead, we want to store a pointer to the schema with each record

Kafka topic or Kinesis stream

Record Record Record Record Record

Some store of schemas

Page 23: What Crimean War gunboats teach us about the need for schema registries

We call this central source of truth for our schemas a schema registry

Our schema registry

Page 24: What Crimean War gunboats teach us about the need for schema registries

There are two widely used open source schema registries

Confluent Schema Registry – an integral part of the Confluent Platform for Kafka-based data pipelines

https://github.com/confluentinc/schema-registry

Iglu – an integral part of the Snowplow open source event data pipeline

https://github.com/snowplow/iglu

Page 25: What Crimean War gunboats teach us about the need for schema registries

The architectures are broadly similar

Underlying storage

mechanism

RESTful API

Confluent Schema Registry uses Kafka as the underlying storage mechanism

Iglu uses Postgres or a statically hosted website as the storage mechanism

Page 26: What Crimean War gunboats teach us about the need for schema registries

The Confluent Schema Registry is closely tied to Avro and Kafka

Supports Avro only, with first class support for Avro’s schema evolution

Uses Kafka as the underlying storage mechanism

Distributed system with a single master architecture

Assigns a registry-unique ID (monotonically increasing) to each registered schema

The schema definition is often larger than the record!

Page 27: What Crimean War gunboats teach us about the need for schema registries

Iglu takes a slightly different approach

Supports multiple schema technologies (Thrift, JSON Schema, Avro etc)

PostgreSQL storage option is single-node, but a schema registry can be hosted statically (e.g. on S3 with CloudFront) for performance

Used heavily in Snowplow but intended to be general-purpose (with Scala and Objective-C client libraries)

The schema definition is often larger than the record!

Page 28: What Crimean War gunboats teach us about the need for schema registries

Iglu uses semantic URIs to address schemas

iglu:com.channel2.vod/video_played/jsonschema/1-0-0

The schema URI protocol is “iglu”

The vendor of this schema

The name of this schema

Schema technology

Schema version

Page 29: What Crimean War gunboats teach us about the need for schema registries

The semantic URIs let us share schemas across company boundaries

iglu:com.channel2.vod/video_played/jsonschema/1-0-0

Schema resolution

Iglu Central Partner vendor registry

channel2.com registry

Search for the schema across multiple registries

Our public registry for commonly

used schemas

video_played

Page 30: What Crimean War gunboats teach us about the need for schema registries

To sum up Like Sir Joseph Whitworth’s screw thread, we need standards to allow our

decoupled data processing jobs to interact efficiently

Schemas provide the required contract between our data producers and our data consumers

Our schemas need a home – we can house them in a schema registry

Tagging schemas with semantic URIs lets us share schemas across company boundaries, as Whitworth’s screw thread was shared across workshops

Page 31: What Crimean War gunboats teach us about the need for schema registries

Thank you! Questions? Twitter: @alexcrdean

Email: [email protected]

Iglu: https://github.com/snowplow/iglu