26
Data Evolution on HBase with Kiji

Data Evolution on HBase (with Kiji)

Embed Size (px)

DESCRIPTION

Data changes over time often requiring carefully planned changes to database tables and application code. KijiSchema integrates best practices with serialization, schema design & evolution, and metadata management common in NoSQL storage solutions. In particular, KijiSchema provides strong guarantees of schema evolution and validation of reads and writes issued by application code. We'll be looking at how you can take advantage of KijiSchema in your HBase applications, especially if you're new to HBase. The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications. Developers are using Kiji to build: • Product and content recommendation systems • Risk analysis and fraud monitoring • Customer profile and segmentation applications • Energy usage analytics & reporting

Citation preview

Page 1: Data Evolution on HBase (with Kiji)

Data Evolution on HBase with Kiji

Page 2: Data Evolution on HBase (with Kiji)

Who am I?

Page 3: Data Evolution on HBase (with Kiji)

How do we store data in HBase?

• HBase provides us with a single value type: byte[]

• In an application it’s necessary to store various data types in a cell: e.g. Java primitives, Java objects…

• The description of data we store in an HBase cell is the schema.

• Write application code or a library to convert our data to/from byte[].

Page 4: Data Evolution on HBase (with Kiji)

What about when we want to get our data back from HBase?

• HBase is unaware of what we put in a cell.

• What’s in the bytes[]?\x0010312985\x00 column=B:B:D, timestamp=1381493621000, value=\x0B\x00\xB0\x9A\xA4\x9B\xB3P\x02\x80\xF6\xC4\xD5\xE6'\x00\x00<<b>

• Already wrote a library to serialize/deserialize this data, so everything’s great, right?

Page 5: Data Evolution on HBase (with Kiji)

Sometime soon…

• The data structure has changed

• Ok, so we update our library with the changes.

• What happens when we try to read back old data?

• Raise an exception?

• Write a bunch of if (…) else if (…) code to determine the correct format?

Page 6: Data Evolution on HBase (with Kiji)

Instead, use a serialization library with evolvable records

• Examples:

• Avro

• Thrift

• Protocol Buffers

• Have some notion of compatible changes to help us avoid common pitfalls.

Page 7: Data Evolution on HBase (with Kiji)

A little bit about Avro

• Datum structure defined by schema

• Rules for compatible and incompatible schema changes.

• Backward-compatibility, Forward-compatibility

• Assumes a linear evolution of schema

• Reality: Schema evolution is more complicated.

Page 8: Data Evolution on HBase (with Kiji)

Ideal vs. RealitySchema v1

Schema v2

Schema v3

Schema v4

Schema v1

Schema v2a Schema v2b

Schema v3a

Schema v4aa Schema v4ab

Page 9: Data Evolution on HBase (with Kiji)

A little (more) about Avro

• Schemas can be defined in JSON or IDL format.

• Specific & Generic API

• Avro Schema Example (IDL):record Pet { string name; int age; string owner_name;}

Page 10: Data Evolution on HBase (with Kiji)

(even more) about Avro

• You don't have to use compiled record classes.

• Can use GenericReader API to deserialize records with a specified schema.

• Makes it easier to migrate data when you do have to make an incompatible schema change. (sleep on it)

Page 11: Data Evolution on HBase (with Kiji)

Adding New Fields

• Old Schema:record Pet { string name; // Ms. Kitty}

!

• New Schema:record Pet { string name; // kitten string kind = “animal”;}

Page 12: Data Evolution on HBase (with Kiji)

Remove Fields

• Old Schema:record Pet { string name; string kind;}

• New Schema:record Pet { string name; string kind;}

What happens when an old reader reads this new record?Can’t find kind

Page 13: Data Evolution on HBase (with Kiji)

Remove Fields

• Old Schema:record Pet { string name = “”; string kind = “animal”;}

• New Schema:record Pet { string name = “”; string kind = “animal”;}

When an ‘old’ reader encounters a new record, the default value will be used.

!

Protip: Always provide default field values so you don’t kill your kittens.

Page 14: Data Evolution on HBase (with Kiji)

Type Promotion

• Old Schema:record PetOwner { int ownerId; string name;}

!

• New Schema:record PetOwner { long ownerId; string name;}

Detailed specification: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution

Page 15: Data Evolution on HBase (with Kiji)

Enter Schema

• Human-friendly table layouts

• Uses Avro for serialization

• Supports primitive types and complex records

• Schema is stored as part of the datum

• Provides schema validation & audit trail

Page 16: Data Evolution on HBase (with Kiji)

KijiTables >> Plain tables

• Layout defined using JSON or DDL

• Formatted row keys

• Schemas stored in metadata table

• Schema Validation on read & write

• Basically, an enhanced HBase table

Page 17: Data Evolution on HBase (with Kiji)

Example Table Definition

CREATE TABLE foo WITH DESCRIPTION 'some data'ROW KEY FORMAT (pet_id LONG)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, COMPRESSED WITH SNAPPY, FAMILY info WITH DESCRIPTION 'basic information' ( metadata CLASS com.mycompany.avro.Pet);

• Visit www.kiji.org for more details

Page 18: Data Evolution on HBase (with Kiji)

Beyond Client-side Validation

• Server-side Schema validation

• Ensure your reader is able to read stored records.

• Ensure you make compatible schema changes

• Ensure you don’t accidentally introduce a new schema

Page 19: Data Evolution on HBase (with Kiji)

Schema Validation

• Three available modes:

• Developer

• Strict

• None (Not recommended in most cases, but still possible.)

Page 20: Data Evolution on HBase (with Kiji)

Developer Mode

!

!

• Don’t need an ALTER statement to write with a new schema

• New schemas are automatically registered on write

• Incompatible writers are rejected at run-time, so still safe

• Convenience when developing

Page 21: Data Evolution on HBase (with Kiji)

Strict Mode

!

!

• New schemas must be registered with an ALTER statement.

• Incompatible readers and writers are rejected at registration time.

• Production-safe

Page 22: Data Evolution on HBase (with Kiji)

ALTER Examples

• ALTER TABLE t SET VALIDATION = STRICT;

• ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN info:foo;In column: 'info:foo' Reader schema: "int" is incompatible with writer schema: "long".

Page 23: Data Evolution on HBase (with Kiji)

Avoid Common Pitfalls w/ KijiSchema Validation

1. Record has string field with default value

2. Field removed (compatible)

3. New field with same name added but different type (compatible from perspective of 2)

4. Incompatible between 1 and 3!

Page 24: Data Evolution on HBase (with Kiji)

• Apache v2 Licensed Open Source

• Includes KijiSchema as well as components for writing

• MapReduce

• Hive Adapter

• Scalding flows for data science

• REST API supporting on-demand computation for real-time web applications

Page 25: Data Evolution on HBase (with Kiji)

BentoBox

• Complete development environment for Kiji & HBase

• Single process Hadoop & HBase cluster

• We accept community contributions

• Try it today: www.kiji.org

• User Mailing ListDeveloper Mailing List

Page 26: Data Evolution on HBase (with Kiji)

Questions?Adam Kunicki

[email protected] @ramblingpolak on Twitter