Data Governance and Schema Registry for Software Developers

Preview:

Citation preview

1

Simplify Governance of Streaming Data

Gwen Shapira

Product Manager, Apache Kafka™ PMC

2

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Date: Thursday, March 30, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Date: Thursday, March 16, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Date: Thursday, March 23, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Date: Thursday, March 9, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Clarke Patterson, Senior Director, Product

Marketing

3

What is Data Governance?

Data Access

Master Data

Management

Data AgilityData Reliability

Data Quality

4

To Grow a Successful Data Platform, You need Some Guarantees

5

If you can’t trust your data,

what can you trust?

7

Schemas Are a Key Part of This

1. Schemas: Enforced Contracts Between your Components

2. Schemas need to be agile but compatible

3. You need tools & process that will let you:• Move fast and not break things

9

Cautionary tale: Uber

Producers

Service 1 Service 2

RedshiftS3

EMR

{created_at: “Dec 1,

2015”}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

10

Producers

Service 1 Service 2

RedshiftS3

EMR

{created_at:

1491368429}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

Cautionary tale: Uber

11

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

12

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

BeachPromo If (property.location == beach &&

customer.level == platinum)

sendMessage({topic: RoomGifts, customer_id: customer.id,

room: property.roomnum, gift: seaShell})

13

There is special value in standardizing parts of schema

{ "type": "record",

"name": "Event",

"fields": [{ "name": "headers", "type": {

“name” : “headers”,

“type”: ”record”

“fields”: [

{“name”: “source_app”, “type”: “string”},

{“name”: “host_app”, “type”: “string”},

{“name”: “destination_app”, “type”: [“null,“string”]},

{“name”: “timestamp”, “type”: “long”}

]},

{ "name": "body", "type": "bytes" }] }

Standard headers allowed

analyzing patterns of communication

S3

15

In Request Response World – APIs between services are key

In Async Stream Processing World – Message Schemas ARE the API

…except they stick around for a lot longer

16

Lets assume you see why you need

schema compatibility…

How do we make it work for us?

17

Schema Registry

Elastic

Cassandra

Streams App

Example Consumers

SerializerApp 1

SerializerApp 2

!

Kafka Topic!

Schema

Registry

Define the expected fields for each Kafka topic

Automatically handle schema changes (e.g. new fields)

Prevent incompatible changes

Supports multi-datacenter environments

18

Key Points:

• Schema Registry is efficient – due to caching

• Schema Registry is transparent

• Schema Registry is Highly Available

• Schema Registry prevents bad data at the source

19

Big Benefits

• Schema Management is part of your entire app lifecycle – from dev to prod

• Single source of truth for all apps and data stores

• No more “Bad Data” randomly breaking things

• Increased agility! add, remove and modify fields with confidence

• Teams can share and discover data

• Smaller messages on the wire and in storage

20

But the Best Part is…

RDBMHadoop /

Hive

Kafka w/

Schema

Registry

JDBC

Source

HDFS

Sink

SchemaSchema

Schema

SchemaSchema

21

Everyone Schema Registry

22

Compatibility tips!

• Do you want to upgrade producer without touching consumers?

• You want *forward* compatibility.

• You can add fields.

• You can delete fields with defaults

• Do you want to update schema in a DWH but still read old messages?

• You want *backward* compatibility

• You can delete fields

• You can add fields with defaults

• Both?

• You want *full* compatibility.

• Default all things!

• Except the primary key which you never touch.

23

App Lifecycle

DevelopementSource

RepositoryPull Request

CI Staging / Test

Production

“Nightly build”

Lets Try

Lets Do It!

OptionalGood

Idea

YES Almost

too late

Last

resort

24

Finding Schemas

1. Write Schemas -> Generate Code

2. Get Schemas from registry -> Generate code

3. Write code -> Generate Schemas

25

Summary!

1. Data Quality is critical

2. Schema are critical

3. Schema Registry is used to maintain quality from Dev to Prod

4. Schema Registry is in Confluent Open Source

26

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Date: Thursday, March 30, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Date: Thursday, March 16, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Date: Thursday, March 23, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Date: Thursday, March 9, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Clarke Patterson, Senior Director, Product

Marketing

27

Get Started with Apache Kafka Today!

https://www.confluent.io/downloads/

THE place to start with Apache Kafka!

Thoroughly tested and

quality assured

More extensible developer

experience

Easy upgrade path to

Confluent Enterprise

28

30% off conference feecode = kafpcf17

New York City | May 8, 2017

Presented by

29

Thank You!

Recommended