Kite SDK: It’s for developersRyan Blue, Software Engineer
Resources
©2014 Cloudera, Inc. All rights reserved.
• Kite guide• http://tiny.cloudera.com/KiteGuide
• Dataset overview and intro• http://tiny.cloudera.com/Datasets
• Command-line tutorial• http://tiny.cloudera.com/KiteCLI
• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples
Agenda
©2014 Cloudera, Inc. All rights reserved.
• Kite background• Kite data
What problem does Kite solve?
©2014 Cloudera, Inc. All rights reserved.
• Accessibility for getting started• Easy to get started, without being an expert• Use before understanding
• Save time for experienced developers• Off-the-shelf tools for common tasks• Quickly iterate and test configurations
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
• Focus on using data, not managing files• Developers shouldn’t have to maintain data files• Use through configuration, not code• Need consistency across the platform
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application
Database
Data files
User code
Provided
Maintained by the database
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application Application
Database
Data files
Data files HBase
User code
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBaseData files HBase
Maintained by the Kite
Kite Datasets: Goals
©2014 Cloudera, Inc. All rights reserved.
• Think in terms of data: datasets, views, records• Describe data, layout and Kite does the right thing• Should work consistently across the platform• Reliable
Kite Datasets: Compatibility
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Kite 1.0 1.0 1.0
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
Current compatibility (0.15.0)
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Kite 1.0 1.0 1.0
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
Agenda
©2014 Cloudera, Inc. All rights reserved.
• Kite background• Kite data
Application
Kite Data
Data files HBase
Maintained by the Kite
Datasets
©2014 Cloudera, Inc. All rights reserved.
• A collection of records or entities• Like a Hive or relational table• Generic, reflected, or generated objects
• Identified by URI• dataset:hdfs:/data/ratings• dataset:hive:/data/ratings• dataset:hbase:zk1/ratings
ratings = Datasets.load("dataset:hive:/data/ratings")
Dataset configuration, JSON
©2014 Cloudera, Inc. All rights reserved.
• Schema (Avro)• Record fields, like a table definition
Dataset configuration, JSON
©2014 Cloudera, Inc. All rights reserved.
• Schema (Avro)• Record fields, like a table definition
• Partition strategy• Layout or key definition from record fields
Configuring partitioning
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy[ { "source" : "timestamp", "type" : "year"}, { "source" : "timestamp", "type" : "month"}, { "source" : "timestamp", "type" : "day"} ]
datasets/└── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...
Configuring key building
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy for HBase[ { "source" : "email", "type" : "hash", "buckets": 32}, { "source" : "email", "type" : "identity"} ]
(22, "[email protected]")
\x80\x00\x00\[email protected]\x00\x00
Dataset configuration, JSON
©2014 Cloudera, Inc. All rights reserved.
• Schema (Avro)• Record fields, like a table definition
• Partition strategy• Layout or key definition from record fields
• Column mapping (HBase)• Where to store record fields
{ "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ]}
Mapping example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
[email protected] Lightyear Buzz 315 true
[ { "source": "email", "type": "key" }, ...]
{ "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ]}
Mapping example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
[email protected] Lightyear Buzz 315 true
[ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ...]
Command-line demo?
©2014 Cloudera, Inc. All rights reserved.
1. Describe your datadataset obj-schema org.movielens.Rating --jar app.jar \ --output rating.avsc
2. Describe your layoutdataset partition-config ts:year ts:month ts:day \ --schema rating.avsc --output ymd.json
3. Create a datasetdataset create ratings --schema rating.avsc \ --partition-by ymd.json
Command-line tool
©2014 Cloudera, Inc. All rights reserved.
• Executable jar download• Inspects the environment
• Must be used on-cluster• Classpath for HBase, Hive, etc.
• Debugging: debug=true ./dataset -v <command>
• Requires MAPRED_HOME variable on CDH5
Resources
©2014 Cloudera, Inc. All rights reserved.
• Kite guide• http://tiny.cloudera.com/KiteGuide
• Dataset overview and intro• http://tiny.cloudera.com/Datasets
• Command-line tutorial• http://tiny.cloudera.com/KiteCLI
• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples
Questions
©2014 Cloudera, Inc. All rights reserved.
Ryan Blue: [email protected] mailing list: [email protected]
Maven parent POM
©2014 Cloudera, Inc. All rights reserved.
• Automatic Kite and Hadoop dependencies• Inherit from kite-app-parent-cdh4• CDH4 only, CDH5 support in 0.16.0
<parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>
Maven Plugin
©2014 Cloudera, Inc. All rights reserved.
• Maven plugin manages datasets for an application• Configured by app-parent POM• Handles create, update, etc. in maven goals
MapReduce
©2014 Cloudera, Inc. All rights reserved.
• DatasetKeyInputFormat• DatasetKeyOutputFormat• Values are always null
View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday());
DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);
Crunch
©2014 Cloudera, Inc. All rights reserved.
• CrunchDatasets.asSource• CrunchDatasets.asTarget
PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday);
• Handle-existing support in 0.16.0• Configure dependencies with Kite parent POM
DatasetSink
©2014 Cloudera, Inc. All rights reserved.
• Write to HDFS Avro and HBase• http://tiny.cloudera.com/DatasetSink
• Proxy user support• Automatic partitioning
agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSinkagent.sinks.name.kite.repo.uri = repo:hdfs:/datasetsagent.sinks.name.kite.dataset.name = eventsagent.sinks.name.auth.proxyUser = cloudera