View
142
Download
5
Category
Preview:
DESCRIPTION
Citation preview
Moscow, November 16th, 2011
The Hadoop EcosystemKai Voigt, Cloudera Inc.
2©2011 Cloudera, Inc. All Rights Reserved.
ClouderaCloudera
2
Hadoop Linux
Licence Apache GPL and others
Distribution Vendor Cloudera Red Hat
Free DistributionCloudera's Distribution Including Hadoop (CDH)
Fedora Core
Commercial Distribution
Cloudera EnterpriseRed Hat Enterprise Linux (RHEL)
3©2011 Cloudera, Inc. All Rights Reserved.
Hadoop CoreHadoop Core
3
HDFS
MapReduce
4©2011 Cloudera, Inc. All Rights Reserved.
HDFSHDFS
4
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
5©2011 Cloudera, Inc. All Rights Reserved.
MapReduceMapReduce
5
• Two Phases of Functional Programming
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Java API
6©2011 Cloudera, Inc. All Rights Reserved.
Hadoop CoreHadoop Core
6
HDFS
MapReduce
JavaJava
Java
Java
7©2011 Cloudera, Inc. All Rights Reserved.
HDFS-FUSEHDFS-FUSE
7
/mnt/hdfs/
HDFS-FUSE
HDFS
8©2011 Cloudera, Inc. All Rights Reserved.
HDFS-FUSE ExamplesHDFS-FUSE Examples
8
$ mount ...fuse on /mnt/hdfs type fuse (rw,nosuid,nodev,user_id=0,group_id=0,default_permissions,allow_other)
$ cp /boot/vmlinuz-* /mnt/hdfs/user/cloudera/$ hadoop fs -ls vmlinuz-*-rw-r--r-- 3 cloudera supergroup 2107004 2011-11-08 16:14 /user/cloudera/vmlinuz-2.6.18-274.7.1.el5
9©2011 Cloudera, Inc. All Rights Reserved.
SqoopSqoop
9
RDBMS
Sqoop
HDFS
10 ©2011 Cloudera, Inc. All Rights Reserved.
SqoopSqoop
10
• Import & Export
• ODBC, JDBC Data Sources
• CSV Files in HDFS
11 ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop ExamplesSqoop Examples
11
$ sqoop import --connect jdbc:mysql://localhost/world --username root --table City ...
$ hadoop fs -cat City/part-m-000001,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,Herat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200...
12 ©2011 Cloudera, Inc. All Rights Reserved.
HiveHive
12
MapReduce
Hive
SQL
13 ©2011 Cloudera, Inc. All Rights Reserved.
HiveHive
13
• Data Warehouse System for Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• Developed at facebook
14 ©2011 Cloudera, Inc. All Rights Reserved.
Hive ExamplesHive Examples
14
CREATE TABLE newmovie (id INT, name STRING, year INT, numratings INT, avgrating FLOAT);INSERT OVERWRITE TABLE newmovieSELECT id, name, year, COUNT(1), AVG(rating)FROM movie JOIN movieratingON movie.id = movierating.movieidGROUP BY id, name, year;
15 ©2011 Cloudera, Inc. All Rights Reserved.
PigPig
15
MapReduce
Pig
Script
16 ©2011 Cloudera, Inc. All Rights Reserved.
PigPig
16
• Data Warehouse System for Hadoop
• Data Aggregation
• Ad-Hoc Queries
• High-Level Scripting Language (Pig Latin)
• Developed at Yahoo
17 ©2011 Cloudera, Inc. All Rights Reserved.
Pig ExamplesPig Examples
17
movierating = LOAD 'movierating' AS (userid, movieid, rating:INT);groupmr = GROUP movierating BY movieid;ratings = FOREACH groupmr GENERATE group AS movieid, COUNT(movierating.rating) AS numratings, AVG(movierating.rating) AS avgrating;movie = LOAD 'movie' AS (id, name, year);mr = JOIN movie BY id, ratings BY movieid;result = FOREACH mr GENERATE id, name, year, numratings, avgrating;STORE result INTO 'ratedmovie';
18 ©2011 Cloudera, Inc. All Rights Reserved.
The Story So FarThe Story So Far
18
RDBMS
Hive Pig
Sqoop
MapReduce
HDFS
FUSE
FSSQL
SQL Script
Posix
Java
Java
19 ©2011 Cloudera, Inc. All Rights Reserved.
HBaseHBase
19
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API– PUT– GET– DELETE– SCANE
20 ©2011 Cloudera, Inc. All Rights Reserved.
HBase Data ModelHBase Data Model
20
Key
RowID Columname Timestamp Value
com.apple.www Size yesterday 1234
com.apple.www Content yesterday <html>...
com.cloudera.www Size yesterday 2345
com.cloudera.www Content yesterday <html>...
com.cloudera.www Size today 3456
com.cloudera.www Content today <html>...
com.facebook.www Size yesterday 4567
com.facebook.www Content yesterday <html>...
com.yahoo.www Size today 5678
com.yahoo.www Content today <html>...
21 ©2011 Cloudera, Inc. All Rights Reserved.
HBase FlowHBase Flow
21
GET/PUT/DELETE
MEMORY
HDFS Logfile
22 ©2011 Cloudera, Inc. All Rights Reserved.
HBase ExamplesHBase Examples
22
hbase> create 'mytable', 'mycf'hbase> listhbase> put 'mytable', 'row1', 'mycf:col1', 'val1'hbase> put 'mytable', 'row1', 'mycf:col2', 'val2'hbase> put 'mytable', 'row2', 'mycf:col1', 'val3'hbase> scan 'mytable'hbase> disable 'mytable'hbase> drop 'mytable'
23 ©2011 Cloudera, Inc. All Rights Reserved.
FlumeFlume
23
• Many Servers with many Log Files– Webserver– Mailserver– Syslog
• Store all Logs in One Place– Manageable– Extensible– Reliable
24 ©2011 Cloudera, Inc. All Rights Reserved.
Flume ArchitectureFlume Architecture
24
Log
Flume Node
Log
Flume Node
...
HDFS
25 ©2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and SinksFlume Sources and Sinks
25
• Local Files
• HDFS
• Stdin, Stdout
• IRC
• IMAP
26 ©2011 Cloudera, Inc. All Rights Reserved.
WhirrWhirr
26
• Automatic Cluster Setup in the Cloud– Amazon– Rackspace
27 ©2011 Cloudera, Inc. All Rights Reserved.
Whirr ExampleWhirr Example
27
$ cat hadoop.properties whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,7 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY}whirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
$ bin/whirr launch-cluster --config hadoop.properties
$ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
$ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster
$ bin/whirr destroy-cluster --config hadoop.properties
28 ©2011 Cloudera, Inc. All Rights Reserved.
Oozie ConceptOozie Concept
28
• crond for Hadoop
• Job Flow Control– Branching– Serial– Loops
• Triggered– Time– Data
Job 1
Job 3
Job 2
Job 4 Job 5
29 ©2011 Cloudera, Inc. All Rights Reserved.
Oozie FeaturesOozie Features
29
• Component Independent– MapReduce– Hive– Pig– Sqoop– Streaming
30 ©2011 Cloudera, Inc. All Rights Reserved.
MahoutMahout
• Machine Learning Library for Hadoop– Regression– Classification– Recommendations– Pattern Mining
30
31 ©2011 Cloudera, Inc. All Rights Reserved.
Mahout Use CasesMahout Use Cases
• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform
31
32 ©2011 Cloudera, Inc. All Rights Reserved.
CDH4u2CDH4u2
32
• Cloudera's Distribution Including Hadoop
• http://www.cloudera.com/download/
• Linux Packages– Red Hat– Debian– Tar Archive
• Virtual Machines
• Cloud Installation with Whirr
33 ©2011 Cloudera, Inc. All Rights Reserved.
CDH ComponentsCDH Components
33
Hadoop Hive
Pig HBase
Zookeeper Flume
Sqoop Whirr
Hue Oozie
FUSE-DFS Mahout
34 ©2011 Cloudera, Inc. All Rights Reserved.
Thank you!Thank you!
• Kai Voigt
• kai@cloudera.com
• http://www.cloudera.com/
34
Recommended