Data Governance and Schema Registry for Software Developers

Simplify Governance of Streaming Data

Gwen Shapira

Product Manager, Apache Kafka™ PMC

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Date: Thursday, March 30, 2017

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Speaker: Clarke Patterson, Senior Director, Product

Marketing

What is Data Governance?

Data Access

Master Data

Management

Data AgilityData Reliability

Data Quality

To Grow a Successful Data Platform, You need Some Guarantees

If you can’t trust your data,

what can you trust?

Schemas Are a Key Part of This

1. Schemas: Enforced Contracts Between your Components

2. Schemas need to be agile but compatible

3. You need tools & process that will let you:• Move fast and not break things

Cautionary tale: Uber

Producers

Service 1 Service 2

RedshiftS3

{created_at: “Dec 1,

2015”}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

Producers

Service 1 Service 2

RedshiftS3

{created_at:

1491368429}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

Cautionary tale: Uber

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

BeachPromo If (property.location == beach &&

customer.level == platinum)

sendMessage({topic: RoomGifts, customer_id: customer.id,

room: property.roomnum, gift: seaShell})

There is special value in standardizing parts of schema

{ "type": "record",

"name": "Event",

"fields": [{ "name": "headers", "type": {

“name” : “headers”,

“type”: ”record”

“fields”: [

{“name”: “source_app”, “type”: “string”},

{“name”: “host_app”, “type”: “string”},

{“name”: “destination_app”, “type”: [“null,“string”]},

{“name”: “timestamp”, “type”: “long”}

{ "name": "body", "type": "bytes" }] }

Standard headers allowed

analyzing patterns of communication

In Request Response World – APIs between services are key

In Async Stream Processing World – Message Schemas ARE the API

…except they stick around for a lot longer

Lets assume you see why you need

schema compatibility…

How do we make it work for us?

Schema Registry

Elastic

Cassandra

Streams App

Example Consumers

SerializerApp 1

SerializerApp 2

Kafka Topic!

Schema

Registry

Define the expected fields for each Kafka topic

Automatically handle schema changes (e.g. new fields)

Prevent incompatible changes

Supports multi-datacenter environments

Key Points:

• Schema Registry is efficient – due to caching

• Schema Registry is transparent

• Schema Registry is Highly Available

• Schema Registry prevents bad data at the source

Big Benefits

• Schema Management is part of your entire app lifecycle – from dev to prod

• Single source of truth for all apps and data stores

• No more “Bad Data” randomly breaking things

• Increased agility! add, remove and modify fields with confidence

• Teams can share and discover data

• Smaller messages on the wire and in storage

But the Best Part is…

RDBMHadoop /

Kafka w/

Schema

Registry

Source

SchemaSchema

Schema

SchemaSchema

Everyone Schema Registry

Compatibility tips!

• Do you want to upgrade producer without touching consumers?

• You want *forward* compatibility.

• You can add fields.

• You can delete fields with defaults

• Do you want to update schema in a DWH but still read old messages?

• You want *backward* compatibility

• You can delete fields

• You can add fields with defaults

• Both?

• You want *full* compatibility.

• Default all things!

• Except the primary key which you never touch.

App Lifecycle

DevelopementSource

RepositoryPull Request

CI Staging / Test

Production

“Nightly build”

Lets Try

Lets Do It!

OptionalGood

YES Almost

too late

resort

Finding Schemas

1. Write Schemas -> Generate Code

2. Get Schemas from registry -> Generate code

3. Write code -> Generate Schemas

Summary!

1. Data Quality is critical

2. Schema are critical

3. Schema Registry is used to maintain quality from Dev to Prod

4. Schema Registry is in Confluent Open Source

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Speaker: Clarke Patterson, Senior Director, Product

Marketing

Get Started with Apache Kafka Today!

https://www.confluent.io/downloads/

THE place to start with Apache Kafka!

Thoroughly tested and

quality assured

More extensible developer

experience

Easy upgrade path to

Confluent Enterprise

30% off conference feecode = kafpcf17

New York City | May 8, 2017

Presented by

Thank You!

Data Governance and Schema Registry for Software Developers

Software

ML Schema: Machine Learning Schema

Schema delle posizioni (Lallai 2) Schema A. Schema delle posizioni (Iannucci 2) Schema A

Kafka and Avro with Confluent Schema Registry

Dublin Core Metadata Schema Registry at Tsukuba Shigeo Sugimoto, Mitsuharu Nagamori Graduate School of Library, Information and Media Studies University

XML Gov 8/04 E-Government Registry E-Government Registry / Repository for Data Dictionary & XML Schema Proof of Concept UK Office of eEnvoy & MOD –Leverage

The JISC IE Metadata Schema Registry Pete Johnston UKOLN, University of Bath JISC Joint Programmes Meeting Brighton, 6-7 July 2004

Developers Developers Developers

The LDAP Schema Registry and its requirements on Slapd development

14 Apr 2005 VOEvent workshop1 XML Schema and the VO Registry Matthew J. Graham CACR/Caltech T HE US N ATIONAL V IRTUAL O BSERVATORY

Industry Training Register (ITR) - Schema definition · 2018. 9. 25. · ITR schema definition Page 7 of 251 1 Introduction 1.1 About this document This document is for: SMS/TMS developers

The JISC IE Metadata Schema Registry and IEEE LOM Application Profiles Pete Johnston UKOLN, University of Bath CETIS Metadata & Digital Repositories SIG,

A Multilingual Metadata Schema Registry Based on … Multilingual Metadata Schema Registry Based on ... This paper describes a multilingual metadata schema registry which has been

Schema Registry - Set you Data Free

OASIS/ebXML Registry Information Model v2.02 Bug … · Registry Registry to Registry

PostgreSQL, SQLAlchemy, and schema-less data. - wmmi.net · PostgreSQL, SQLAlchemy, and schema-less data. MySQL is the root of all evil Most developers met 'databases' through MySQL

Week 2 Lecture 2 Structure of a database. External Schema Conceptual Schema Internal Schema Physical Schema

E-Government Registry / Repository for Data Dictionary & XML Schema

A Possible Relational Schema for the Registry

JISC IE Metadata Schema Registry Technical Update 23 November 2004

Deutsches Rechnungslegungs Standards Committee e.V ... · - Data Type Registry 1.x - Functions Registry 1.0 - Link Role Registry 2.0 - Link Role Registry 1.0 - Units Registry 1.0