Hw09 Clouderas Distribution For Hadoop

Preview:

Citation preview

Cloudera’s Distributionfor Hadoop

Oct 2, 2009

Todd Lipcon

(todd@cloudera.com)

What is CDH?

What’s a Distribution?I How many of you get your apache

httpd from apache.org?

I Pretty much everyone uses Linux

distributions to get software

I CDH is a Hadoop distribution in the

same way that Ubuntu is a Linux

distribution

What’s a Distribution?I How many of you get your apache

httpd from apache.org?

I Pretty much everyone uses Linux

distributions to get software

I CDH is a Hadoop distribution in the

same way that Ubuntu is a Linux

distribution

What is CDH?I Apache Hadoop and its ecosystem,

packaged up and easier to install

I RPM, Debian, and tarball installs

I Better Linux citizenship

I Maintained and tested patch series on

top of upstream

I Ecosystem compatibility guarantees

What’s in CDH?

CDH - Included PackagesI Apache Hadoop (MR, HDFS, and

Common)

I Apache Pig

I Apache Hive

I Cloudera Desktop

I HBase and ZooKeeper (contributed by

HBase team)

I ... more to come

Installation OptionsI APT and Yum repositories

I apt-get install hadoop

I yum install hadoop

I hadoop-conf-pseudo package to get

started

I tarball

CDH on Amazon EC2I hadoop-ec2 launch-cluster

todd-cluster 20

I Support for HDFS on EBS volumes

(better performance than S3)

I Cloudera Desktop automatically

installed and launched

I Great if your data is already on EBS or

S3

CDH on Amazon EC2I hadoop-ec2 launch-cluster

todd-cluster 20

I Support for HDFS on EBS volumes

(better performance than S3)I Cloudera Desktop automatically

installed and launchedI Great if your data is already on EBS or

S3I Soon to come: VMware (vCloud) and

Rackspace

Linux citizenshipI Hadoop should act like other software

you’re used toI Configuration using alternatives in

/etc

I Logs in /var/log

I Start/stop with init.d services

Patches in CDHI Get bug fixes earlyI Backport “Safe” new features

I Sqoop, MRUnitI Fair Scheduler on 18I /metrics servletI S3 fixesI etc...

I Backport “Really Safe” performance

patches

What exactly am I getting?I Hadoop in CDH is still Apache 2.0

I Read the changelog:

...hadoop-0.20/cloudera/CHANGES.cloudera.txt

I Read the patches:

...hadoop-0.20/cloudera/patches/

I Build it yourself:

...hadoop-0.20/cloudera/do-release-build

Is this a fork?

Is this a fork?

No way!

Is this a fork?No way!

I All functionality patches submitted

upstream (some build-system patches

only apply to our build)

I We employ 2 committers fulltime, plus

several contributors

I We regularly meet and work with other

community members from Yahoo!,

Facebook, etc.

My one commercial plug...gotta pay the bills

I We provide paid support for CDH

I Someone to call if your cluster is down

I Access to knowledgeable Hadoop

engineers

I Configuration and tuning help

I Process design reviews

I Prioritize patches you need (and hot

fixes for critical issues)

I </salesman>

Versions of CDH

Versions of CDHI Debian versioning schemeI stable

I no new features, lots of “soak time”I comparable to RHEL 5, Ubuntu LTS, or

Debian stableI recommended for critical production

deployments

Versions of CDHI Debian versioning schemeI testing

I considered usable - testing, notuntested!

I has whiz-bang features and newerversions

I recommended for shops who like thebleeding edge, or for those in PoC/devstage

Versions of CDHI CDH1 (stable)

I Released March ’09I Hadoop 0.18.3, Hive 0.3, Pig 0.2I Will become oldstable this winter

I CDH2 (testing)I Released June ’09I Hadoop 0.18.3, Hadoop 0.20.1, Pig 0.5,

Hive 0.4, HBase 0.20I Can install 0.18 and 0.20 at the same

timeI Will become stable this winter

CDH2 Package Versioning

hadoop-0.18-0.18.3+65-1.cloudera.noarch.rpm

A hadoop package based on Apache Hadoop

0.18.3 with 65 patches

hadoop-0.20-0.20.0+4.4-1.cloudera.noarch.rpm

A hadoop package based on Apache Hadoop

0.20.0 with 4 patches in testing, 4

security/critical fixes

Where do I get CDH?

http://archive.cloudera.com/

Questions?

Recommended