Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12From Zero to Hadoop

Speaker Name | Title April 18, 2023

2

Agenda

• Hadoop Ecosystem Overview• Hadoop Core Technical Overview

• HDFS• MapReduce

• Hadoop in the Enterprise• Cluster Planning• Cluster Management with Cloudera Manager

3

What Are All These Things?

Hadoop Ecosystem Overview

4

Hadoop Ecosystem

INGEST STORE EXPLORE PROCESS ANALYZE SERVE

CONNECTORS

STORAGE

RESOURCE MGMT& COORDINATION

USER INTERFACE WORKFLOW MGMT METADATACLOUD

INTEGRATION

YAYARN

ZOZOOKEEPER

HDFSHADOOP DFS

HBHBASE

HUHUE

OOOOZIE

WHWHIRR

SQSQOOP

FLFLUME

FILEFUSE-DFS

RESTWEBHDFS / HTTPFS

SQLODBC / JDBC

MSMETA STORE

ACACCESS

BI ETL RDBMS

BATCH COMPUTE

BATCH PROCESSING REAL-TIME ACCESS & COMPUTE

MRMAPREDUCE

MR2MAPREDUCE2

HIHIVE

PIPIG

MAMAHOUT

DFDATAFU

IMIMPALA

MANAGEMENT SOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONS

CLOUDERA NAVIGATOR

CLOUDERA MANAGER

CORE(REQUIRED)

RTD RTQ

BDR

AUDIT(v1.0) LINEAGE

ACCESS(v1.0) LIFECYCLE

EXPLORE

CORE

5

Sqoop

Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver

6

FlumeNG

Client

Client

Client

Client

Agent

Agent

Agent

A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.

HBase

7

• A low latency, distributed, non-SQL database built on HDFS.

• A “Columnar Database”

8

Hive

• Relational database

abstraction using a SQL like

dialect called HiveQL• Statements are executed as

One or more MapReduce

Jobs

SELECTs.word, s.freq, k.freq

FROM shakespeare JOIN ON (s.word= k.word)WHERE s.freq >= 5;

9

Pig

• High-level scripting language

for for executing one or more

MapReduce jobs• Created to simplify authoring

of MapReduce jobs• Can be extended with user

defined functions

emps = LOAD 'people.txt’ AS (id,name,salary);rich = FILTER emps BY salary > 200000;sorted_rich = ORDER rich BY salary DESC;STORE sorted_rich INTO ’rich_people.txt';

10

Oozie

A workflow engine and

scheduler built specifically

for large-scale job

orchestration on a

Hadoop cluster

11

Zookeeper

• Zookeeper is a distributed

consensus engine• Provides well-defined concurrent

access semantics:• Leader election• Service discovery• Distributed locking / mutual

exclusion• Message board / mailboxes

12

MahoutA machine learning library with algorithms for:• Recommendation based on users'

behavior. • Clustering groups related documents. • Classification from existing

categorized. • Frequent item-set mining (shopping

cart content).

13

Hadoop Security

• Authentication is secured by MIT Kerberos v5

and integrated with LDAP

• Provides Identity, Authentication, and

Authorization

• Useful for multitenancy or secure

environments

14

Only the Good Parts

Hadoop Core Technical Overview

15

Components of HDFS

• NameNode – Holds all metadata for HDFS• Needs to be a highly reliable machine

• RAID drives – typically RAID 10• Dual power supplies• Dual network cards – Bonded

• The more memory the better – typical 36GB to - 64GB• Secondary NameNode – Provides check pointing for the

NameNode. Same hardware as the NameNode should be used

16

Components of HDFS – Contd.

• DataNodes – Hardware will depend on the specific needs of the cluster• No RAID needed, JBOD (just a bunch of disks) is used• Typical ratio is:

• 1 hard drive• 2 cores• 4GB of RAM

17

HDFS Architecture Overview

Secondary Namenode

Host 2

Namenode

Host 1DataNode

Host 3

DataNode

Host 4

DataNode

Host 5

DataNode

Host n

18

Block Size = 64MBReplication Factor = 3

HDFS Block Replication

1

2

3

4

5 2

3

4

2

4

5

1

3

5

1

2

5

1

3

4

HDFS

Node 1 Node 2

Node 3

Node 4 Node 5

Blocks

19

MapReduce – Map• Records from the data source (lines out of files, rows of a

database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

• map() produces one or more intermediate values along with an output key from the input.

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase

(key 1, int. values)



Reduce Task

Final (key, values)

20

MapReduce – Reduce• After the map phase is over, all the intermediate values for a

given output key are combined together into a list

• reduce() combines those intermediate values into one or more final values for that same output key

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase




Reduce Task

Final (key, values)

21

MapReduce – Shuffle and Sort

22

How It Works In The Real World

Hadoop In the Enterprise

24

Networking• One of the most important things to consider when

setting up a Hadoop cluster• Typically a top of rack is used with Hadoop with a

core switch • Careful on over subscribing the backplane of the

switch!

25

Hadoop Typical Data Pipeline

Data Sources

PigHive

MapReduce

HDFS

Orig

inal

Sou

rce

Dat

a

Resu

lt or

Cal

cula

ted

Dat

a

Data Warehouse

Marts

Sqoop

HadoopOozie

SqoopFlume

26

Hadoop Use Cases

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

ADVA

NCE

D A

NAL

YTIC

S

DATA

PRO

CESS

ING

27

Hadoop in the Enterprise

Logs Files Web Data Relational Databases

IDE’s BI / Analytics Enterprise Reporting

Enterprise Data Warehouse

Web Application

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

CUSTOMERS

28

Cloudera ManagerEnd-to-End Administration for CDH

ManageEasily deploy, configure & optimize clusters1MonitorMaintain a central view of all activity2DiagnoseEasily identify and resolve issues3IntegrateUse Cloudera Manager with existing tools4

29

Install A Cluster In 3 Simple Steps

1 2 3Find Nodes Install Components Assign Roles

Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue.

Cloudera Manager automatically installs the CDH components on the hosts you specified.

Verify the roles of the nodes within your cluster. Make changes as necessary.

Cloudera Manager Key Features

30

View Service Health & PerformanceCloudera Manager Key Features

31

Monitor & Diagnose Cluster WorkloadsCloudera Manager Key Features

32

Visualize Health Status With HeatmapsCloudera Manager Key Features

33

Rolling UpgradesCloudera Manager Key Features

34

?

35

Documents

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop