Download pdf - Kite SDK introduction for Portland Big Data

Kite SDK: It’s for developersRyan Blue, Software Engineer

Resources

©2014 Cloudera, Inc. All rights reserved.

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

http://tiny.cloudera.com/KiteGuide


http://tiny.cloudera.com/Datasets


http://tiny.cloudera.com/KiteCLI


https://github.com/kite-sdk/kite


https://github.com/kite-sdk/kite-examples


Agenda


• Kite background• Kite data

What problem does Kite solve?


• Accessibility for getting started• Easy to get started, without being an expert• Use before understanding

• Save time for experienced developers• Off-the-shelf tools for common tasks• Quickly iterate and test configurations

Kite Datasets: Motivation


• Focus on using data, not managing files• Developers shouldn’t have to maintain data files• Use through configuration, not code• Need consistency across the platform



Application

Database

Data files

User code

Provided

Maintained by the database



Application Application

Database

Data files

Data files HBase

User code



Application ApplicationApplication

Database

Data files

Data files

Kite Data

HBaseData files HBase

Maintained by the Kite

Kite Datasets: Goals


• Think in terms of data: datasets, views, records• Describe data, layout and Kite does the right thing• Should work consistently across the platform• Reliable

Kite Datasets: Compatibility


Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Current compatibility (0.15.0)


Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Agenda


• Kite background• Kite data

Application

Kite Data

Data files HBase

Maintained by the Kite

Datasets


• A collection of records or entities• Like a Hive or relational table• Generic, reflected, or generated objects

• Identified by URI• dataset:hdfs:/data/ratings• dataset:hive:/data/ratings• dataset:hbase:zk1/ratings

ratings = Datasets.load("dataset:hive:/data/ratings")

Dataset configuration, JSON


• Schema (Avro)• Record fields, like a table definition




• Partition strategy• Layout or key definition from record fields

Configuring partitioning


• Partition strategy[ { "source" : "timestamp", "type" : "year"}, { "source" : "timestamp", "type" : "month"}, { "source" : "timestamp", "type" : "day"} ]

datasets/└── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...

Configuring key building


• Partition strategy for HBase[ { "source" : "email", "type" : "hash", "buckets": 32}, { "source" : "email", "type" : "identity"} ]

(22, "[email protected]")

\x80\x00\x00\[email protected]\x00\x00




• Partition strategy• Layout or key definition from record fields

• Column mapping (HBase)• Where to store record fields

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ]}

Mapping example


family name counts prefs

row key last first visits flash

[email protected] Lightyear Buzz 315 true

[ { "source": "email", "type": "key" }, ...]

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ]}

Mapping example


family name counts prefs

row key last first visits flash

[email protected] Lightyear Buzz 315 true

[ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ...]

Command-line demo?


1. Describe your datadataset obj-schema org.movielens.Rating --jar app.jar \ --output rating.avsc

2. Describe your layoutdataset partition-config ts:year ts:month ts:day \ --schema rating.avsc --output ymd.json

3. Create a datasetdataset create ratings --schema rating.avsc \ --partition-by ymd.json

Command-line tool


• Executable jar download• Inspects the environment

• Must be used on-cluster• Classpath for HBase, Hive, etc.

• Debugging: debug=true ./dataset -v <command>

• Requires MAPRED_HOME variable on CDH5

Resources


• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples











Questions


Ryan Blue: [email protected] mailing list: [email protected]

Maven parent POM


• Automatic Kite and Hadoop dependencies• Inherit from kite-app-parent-cdh4• CDH4 only, CDH5 support in 0.16.0

<parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>

Maven Plugin


• Maven plugin manages datasets for an application• Configured by app-parent POM• Handles create, update, etc. in maven goals

MapReduce


• DatasetKeyInputFormat• DatasetKeyOutputFormat• Values are always null

View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday());

DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);

Crunch


• CrunchDatasets.asSource• CrunchDatasets.asTarget

PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday);

• Handle-existing support in 0.16.0• Configure dependencies with Kite parent POM

DatasetSink


• Write to HDFS Avro and HBase• http://tiny.cloudera.com/DatasetSink

• Proxy user support• Automatic partitioning

agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSinkagent.sinks.name.kite.repo.uri = repo:hdfs:/datasetsagent.sinks.name.kite.dataset.name = eventsagent.sinks.name.auth.proxyUser = cloudera

http://tiny.cloudera.com/DatasetSink

http://tiny.cloudera.com/DatasetSink