21
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Introducing Hadoop Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com

SDEC2011 Introducing Hadoop

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Introducing HadoopMastering Hadoop Map-reduce for Data Analysis

Shashank Tiwariblog: shanky.org | twitter: @[email protected]

Page 2: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

What is Hadoop

Page 3: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

HDFS Architecture

Page 4: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Namenode/Datanode, JobTracker/TaskTracker

Page 5: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

MapReduce

Page 6: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

ZK Namespace

Page 7: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Essential HBase Schema

Page 8: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Multi-dimensional View

Page 9: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

A Map/Hash View

• {

• "row_key_1" : { "name" : {

• "first_name" : "Jolly", "last_name" : "Goodfellow"

• } } },

• "location" : { "zip": "94301" },

Page 10: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Architectural View (HBase)

Page 11: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

The Persistence Mechanism

Page 12: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

The underlying file format

Page 13: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Installing & Setting up Hadoop

• Required software: Java 1.6.x, ssh + sshd

• Download

• Install

• Configure

• single-node

• pseudo-distributed

• cluster

Page 14: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Download

• Source: http://hadoop.apache.org/

• Version:

• 0.20.203.x -- current stable

• 0.20.x -- previous stable

• Includes

• Hadoop Common -- common utilities, HDFS, MapReduce

Page 15: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Install

• Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz

• Move & Create Symbolic Link

• ln -s hadoop-0.20.203.0 hadoop

• On Windows

• http://developer.yahoo.com/hadoop/tutorial/module3.html

Page 16: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Configure -- single-node

• Edit: conf/hadoop-env.sh

• Set JAVA_HOME

• Default configuration is single-node

• Start bin/hadoop (for command options)

• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html

Page 17: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Configure -- pseduo-distributed

• Edit: conf/core-site.xml (configure HDFS daemon)

• Edit: conf/hdfs-site.xml (configure HDFS replication factor)

• Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon)

• Enable ssh to localhost (without passphrase)

• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html

Page 18: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Start Hadoop

• Format HDFS: bin/hadoop namenode -format

• Start all daemons: bin/start-all.sh

• Verify logs

• Browse the web interface:

• Namenode: http://localhost:50070/

• JobTracker: http://localhost:50030/

Page 19: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Take Hadoop for a test-drive

• Run examples (hadoop-examples-0.20.203.0.jar)

• Grep using regular expressions

• Copy files to HDFS: bin/hadoop fs -put bin input

• Grep for files which have text beginning with ‘start’

• Verify output on HDFS: bin/hadoop fs -cat output/*

• Copy output to local filesystem & verify: bin/hadoop fs -get output output && cat output/*

Page 20: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Configure -- cluster

• References:

• http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html (official documentation)

• http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a Hadoop Cluster. Source: YDN)

• http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips

Page 21: SDEC2011 Introducing Hadoop

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners

Questions?

• blog: shanky.org | twitter: @tshanky

[email protected]