Data Evolution on HBase (with Kiji)

Data Evolution on HBase with Kiji

Who am I?

How do we store data in HBase?

• HBase provides us with a single value type: byte[]

• In an application it’s necessary to store various data types in a cell: e.g. Java primitives, Java objects…

• The description of data we store in an HBase cell is the schema.

• Write application code or a library to convert our data to/from byte[].

What about when we want to get our data back from HBase?

• HBase is unaware of what we put in a cell.

• What’s in the bytes[]?\x0010312985\x00 column=B:B:D, timestamp=1381493621000, value=\x0B\x00\xB0\x9A\xA4\x9B\xB3P\x02\x80\xF6\xC4\xD5\xE6'\x00\x00<<b>

• Already wrote a library to serialize/deserialize this data, so everything’s great, right?

Sometime soon…

• The data structure has changed

• Ok, so we update our library with the changes.

• What happens when we try to read back old data?

• Raise an exception?

• Write a bunch of if (…) else if (…) code to determine the correct format?

Instead, use a serialization library with evolvable records

• Examples:

• Avro

• Thrift

• Protocol Buffers

• Have some notion of compatible changes to help us avoid common pitfalls.

A little bit about Avro

• Datum structure defined by schema

• Rules for compatible and incompatible schema changes.

• Backward-compatibility, Forward-compatibility

• Assumes a linear evolution of schema

• Reality: Schema evolution is more complicated.

Ideal vs. RealitySchema v1

Schema v2

Schema v3

Schema v4

Schema v1

Schema v2a Schema v2b

Schema v3a

Schema v4aa Schema v4ab

A little (more) about Avro

• Schemas can be defined in JSON or IDL format.

• Specific & Generic API

• Avro Schema Example (IDL):record Pet { string name; int age; string owner_name;}

(even more) about Avro

• You don't have to use compiled record classes.

• Can use GenericReader API to deserialize records with a specified schema.

• Makes it easier to migrate data when you do have to make an incompatible schema change. (sleep on it)

Adding New Fields

• Old Schema:record Pet { string name; // Ms. Kitty}

!

• New Schema:record Pet { string name; // kitten string kind = “animal”;}

Remove Fields

• Old Schema:record Pet { string name; string kind;}

• New Schema:record Pet { string name; string kind;}

What happens when an old reader reads this new record?Can’t find kind

Remove Fields

• Old Schema:record Pet { string name = “”; string kind = “animal”;}

• New Schema:record Pet { string name = “”; string kind = “animal”;}

When an ‘old’ reader encounters a new record, the default value will be used.

!

Protip: Always provide default field values so you don’t kill your kittens.

Type Promotion

• Old Schema:record PetOwner { int ownerId; string name;}

!

• New Schema:record PetOwner { long ownerId; string name;}

Detailed specification: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution

http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution

Enter Schema

• Human-friendly table layouts

• Uses Avro for serialization

• Supports primitive types and complex records

• Schema is stored as part of the datum

• Provides schema validation & audit trail

KijiTables >> Plain tables

• Layout defined using JSON or DDL

• Formatted row keys

• Schemas stored in metadata table

• Schema Validation on read & write

• Basically, an enhanced HBase table

Example Table Definition

CREATE TABLE foo WITH DESCRIPTION 'some data'ROW KEY FORMAT (pet_id LONG)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, COMPRESSED WITH SNAPPY, FAMILY info WITH DESCRIPTION 'basic information' ( metadata CLASS com.mycompany.avro.Pet);

• Visit www.kiji.org for more details

http://www.kiji.org

Beyond Client-side Validation

• Server-side Schema validation

• Ensure your reader is able to read stored records.

• Ensure you make compatible schema changes

• Ensure you don’t accidentally introduce a new schema

Schema Validation

• Three available modes:

• Developer

• Strict

• None (Not recommended in most cases, but still possible.)

Developer Mode

!

!

• Don’t need an ALTER statement to write with a new schema

• New schemas are automatically registered on write

• Incompatible writers are rejected at run-time, so still safe

• Convenience when developing

Strict Mode

!

!

• New schemas must be registered with an ALTER statement.

• Incompatible readers and writers are rejected at registration time.

• Production-safe

ALTER Examples

• ALTER TABLE t SET VALIDATION = STRICT;

• ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN info:foo;In column: 'info:foo' Reader schema: "int" is incompatible with writer schema: "long".

Avoid Common Pitfalls w/ KijiSchema Validation

1. Record has string field with default value

2. Field removed (compatible)

3. New field with same name added but different type (compatible from perspective of 2)

4. Incompatible between 1 and 3!

• Apache v2 Licensed Open Source

• Includes KijiSchema as well as components for writing

• MapReduce

• Hive Adapter

• Scalding flows for data science

• REST API supporting on-demand computation for real-time web applications

BentoBox

• Complete development environment for Kiji & HBase

• Single process Hadoop & HBase cluster

• We accept community contributions

• Try it today: www.kiji.org

• User Mailing ListDeveloper Mailing List

http://www.kiji.org

https://groups.google.com/a/kiji.org/forum/?fromgroups#!forum/user

https://groups.google.com/a/kiji.org/forum/?fromgroups#!forum/dev

Questions?Adam Kunicki

[email protected] @ramblingpolak on Twitter

mailto:[email protected]

Technology

Data Evolution on HBase (with Kiji)