34
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 From Zero to Hadoop Speaker Name | Title 4/23/22

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Embed Size (px)

Citation preview

Page 1: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12From Zero to Hadoop

Speaker Name | Title April 18, 2023

Page 2: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

2

Agenda

• Hadoop Ecosystem Overview• Hadoop Core Technical Overview

• HDFS• MapReduce

• Hadoop in the Enterprise• Cluster Planning• Cluster Management with Cloudera Manager

Page 3: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

3

What Are All These Things?

Hadoop Ecosystem Overview

Page 4: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

4

Hadoop Ecosystem

INGEST STORE EXPLORE PROCESS ANALYZE SERVE

CONNECTORS

STORAGE

RESOURCE MGMT& COORDINATION

USER INTERFACE WORKFLOW MGMT METADATACLOUD

INTEGRATION

YAYARN

ZOZOOKEEPER

HDFSHADOOP DFS

HBHBASE

HUHUE

OOOOZIE

WHWHIRR

SQSQOOP

FLFLUME

FILEFUSE-DFS

RESTWEBHDFS / HTTPFS

SQLODBC / JDBC

MSMETA STORE

ACACCESS

BI ETL RDBMS

BATCH COMPUTE

BATCH PROCESSING REAL-TIME ACCESS & COMPUTE

MRMAPREDUCE

MR2MAPREDUCE2

HIHIVE

PIPIG

MAMAHOUT

DFDATAFU

IMIMPALA

MANAGEMENT SOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONS

CLOUDERA NAVIGATOR

CLOUDERA MANAGER

CORE(REQUIRED)

RTD RTQ

BDR

AUDIT(v1.0) LINEAGE

ACCESS(v1.0) LIFECYCLE

EXPLORE

CORE

Page 5: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

5

Sqoop

Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver

Page 6: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

6

FlumeNG

Client

Client

Client

Client

Agent

Agent

Agent

A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.

Page 7: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

HBase

7

• A low latency, distributed, non-SQL database built on HDFS.

• A “Columnar Database”

Page 8: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

8

Hive

• Relational database

abstraction using a SQL like

dialect called HiveQL• Statements are executed as

One or more MapReduce

Jobs

SELECTs.word, s.freq, k.freq

FROM shakespeare JOIN ON (s.word= k.word)WHERE s.freq >= 5;

Page 9: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

9

Pig

• High-level scripting language

for for executing one or more

MapReduce jobs• Created to simplify authoring

of MapReduce jobs• Can be extended with user

defined functions

emps = LOAD 'people.txt’ AS (id,name,salary);rich = FILTER emps BY salary > 200000;sorted_rich = ORDER rich BY salary DESC;STORE sorted_rich INTO ’rich_people.txt';

Page 10: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

10

Oozie

A workflow engine and

scheduler built specifically

for large-scale job

orchestration on a

Hadoop cluster

Page 11: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

11

Zookeeper

• Zookeeper is a distributed

consensus engine• Provides well-defined concurrent

access semantics:• Leader election• Service discovery• Distributed locking / mutual

exclusion• Message board / mailboxes

Page 12: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

12

MahoutA machine learning library with algorithms for:• Recommendation based on users'

behavior. • Clustering groups related documents. • Classification from existing

categorized. • Frequent item-set mining (shopping

cart content).

Page 13: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

13

Hadoop Security

• Authentication is secured by MIT Kerberos v5

and integrated with LDAP

• Provides Identity, Authentication, and

Authorization

• Useful for multitenancy or secure

environments

Page 14: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

14

Only the Good Parts

Hadoop Core Technical Overview

Page 15: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

15

Components of HDFS

• NameNode – Holds all metadata for HDFS• Needs to be a highly reliable machine

• RAID drives – typically RAID 10• Dual power supplies• Dual network cards – Bonded

• The more memory the better – typical 36GB to - 64GB• Secondary NameNode – Provides check pointing for the

NameNode. Same hardware as the NameNode should be used

Page 16: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

16

Components of HDFS – Contd.

• DataNodes – Hardware will depend on the specific needs of the cluster• No RAID needed, JBOD (just a bunch of disks) is used• Typical ratio is:

• 1 hard drive• 2 cores• 4GB of RAM

Page 17: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

17

HDFS Architecture Overview

Secondary Namenode

Host 2

Namenode

Host 1DataNode

Host 3

DataNode

Host 4

DataNode

Host 5

DataNode

Host n

Page 18: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

18

Block Size = 64MBReplication Factor = 3

HDFS Block Replication

1

2

3

4

5 2

3

4

2

4

5

1

3

5

1

2

5

1

3

4

HDFS

Node 1 Node 2

Node 3

Node 4 Node 5

Blocks

Page 19: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

19

MapReduce – Map• Records from the data source (lines out of files, rows of a

database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

• map() produces one or more intermediate values along with an output key from the input.

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase

(key 1, int. values)

(key 1, int. values)

(key 1, int. values)

Reduce Task

Final (key, values)

Page 20: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

20

MapReduce – Reduce• After the map phase is over, all the intermediate values for a

given output key are combined together into a list

• reduce() combines those intermediate values into one or more final values for that same output key

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase

(key 1, int. values)

(key 1, int. values)

(key 1, int. values)

Reduce Task

Final (key, values)

Page 21: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

21

MapReduce – Shuffle and Sort

Page 22: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

22

How It Works In The Real World

Hadoop In the Enterprise

Page 23: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

24

Networking• One of the most important things to consider when

setting up a Hadoop cluster• Typically a top of rack is used with Hadoop with a

core switch • Careful on over subscribing the backplane of the

switch!

Page 24: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

25

Hadoop Typical Data Pipeline

Data Sources

PigHive

MapReduce

HDFS

Orig

inal

Sou

rce

Dat

a

Resu

lt or

Cal

cula

ted

Dat

a

Data Warehouse

Marts

Sqoop

HadoopOozie

SqoopFlume

Page 25: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

26

Hadoop Use Cases

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

ADVA

NCE

D A

NAL

YTIC

S

DATA

PRO

CESS

ING

Page 26: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

27

Hadoop in the Enterprise

Logs Files Web Data Relational Databases

IDE’s BI / Analytics Enterprise Reporting

Enterprise Data Warehouse

Web Application

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

CUSTOMERS

Page 27: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

28

Cloudera ManagerEnd-to-End Administration for CDH

ManageEasily deploy, configure & optimize clusters1MonitorMaintain a central view of all activity2DiagnoseEasily identify and resolve issues3IntegrateUse Cloudera Manager with existing tools4

Page 28: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

29

Install A Cluster In 3 Simple Steps

1 2 3Find Nodes Install Components Assign Roles

Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue.

Cloudera Manager automatically installs the CDH components on the hosts you specified.

Verify the roles of the nodes within your cluster. Make changes as necessary.

Cloudera Manager Key Features

Page 29: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

30

View Service Health & PerformanceCloudera Manager Key Features

Page 30: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

31

Monitor & Diagnose Cluster WorkloadsCloudera Manager Key Features

Page 31: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

32

Visualize Health Status With HeatmapsCloudera Manager Key Features

Page 32: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

33

Rolling UpgradesCloudera Manager Key Features

Page 33: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

34

?

Page 34: Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

35