71
Big Data Hadoop: A Tour of the Zoo

Hadoop bootcamp getting started

Embed Size (px)

Citation preview

Page 1: Hadoop bootcamp getting started

Big Data

Hadoop: A Tour of the Zoo

Page 2: Hadoop bootcamp getting started

Content

▪ Setup

▪ Introduction

▪ Who Uses it

▪ How does it work

▪ What was new in hadoop 2

▪ Eco-System

▪ Hadoop Distributions

▪ Practical: meeting the shell

▪ Demo: Pig and Hive

Page 3: Hadoop bootcamp getting started

Setup

1. Go to http://hortonworks.com/products/hortonworks-

sandbox/#install and download the 2.2.4

2. Install VirtualBox

https://www.virtualbox.org/wiki/Downloads

3. Remarks:

Chrome might corrupt the download, try safari, firefox or IE.

Page 4: Hadoop bootcamp getting started

Introduction: This workshop goals

▪ Internal working

▪ HDFS

▪ Eco System

▪ Shell

▪ Upcoming workshops

Page 5: Hadoop bootcamp getting started

Who Uses It

▪ Yahoo

▪ Pinterest

▪ Port of Rotterdam

▪ Spotify

Page 6: Hadoop bootcamp getting started

Who Uses It: Yahoo

Page 7: Hadoop bootcamp getting started

Who Uses It: Yahoo

Page 8: Hadoop bootcamp getting started

Who Uses It: Yahoo

Page 9: Hadoop bootcamp getting started

Who Uses It: Yahoo

Page 10: Hadoop bootcamp getting started

Who Uses It: Pinterest

Page 11: Hadoop bootcamp getting started

Who Uses It: Pinterest

Page 12: Hadoop bootcamp getting started

Who Uses It: Pinterest

Page 13: Hadoop bootcamp getting started

Internals: Port of Rotterdam

Page 14: Hadoop bootcamp getting started

Internals: Port of Rotterdam

Page 15: Hadoop bootcamp getting started

Internals: Port of Rotterdam

Page 16: Hadoop bootcamp getting started

Internals: Spotify

Page 17: Hadoop bootcamp getting started

Internals: Spotify

Page 18: Hadoop bootcamp getting started

Internals: Spotify

▪ Reporting record labels and rights holders

▪ Creating toplists and what is most popular music right now

▪ Ad analysis

▪ Intelligent radio and discovery features

Page 19: Hadoop bootcamp getting started

How Does It Work

▪ HDFS: Intro

▪ Assumptions and Goals

▪ Key Features

▪ How does it work

▪ MapReduce

▪ Eco System

Page 20: Hadoop bootcamp getting started

How Does It Work: HDFS

Page 21: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 22: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 23: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 24: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 25: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 26: Hadoop bootcamp getting started

How Does It Work: Assumptions And Goals

Page 27: Hadoop bootcamp getting started

How Does It Work: Key Features

Rack Awareness

Minimal Data Motion

Page 28: Hadoop bootcamp getting started

How Does It Work: Key Features

Utilities

Rollback

Page 29: Hadoop bootcamp getting started

How Does It Work: Key Features

Standby NameNode

Operability

Page 30: Hadoop bootcamp getting started

How Does It Work: HDFS Architecture

Page 31: Hadoop bootcamp getting started

How Does It Work: Client Reading Files

Page 32: Hadoop bootcamp getting started

How Does It Work: MapReduce - server roles

Page 33: Hadoop bootcamp getting started

How Does It Work: MapReduce

split 1 ABCA

AB

CA

Ab

Bc

AC

cD

split 2 AbBc

split 3 ACcd

MAP

MAP

MAP

A, 1

A, 1

B, 1

C, 1

A, 1

B, 1

B, 1

C, 1

A, 1

C, 1

C, 1

D, 1

Reducer

Reducer

A, 4

B, 3

C, 4

D, 1

1 2 3 4 5

A, 4

B, 3

C, 4

D, 1

Page 34: Hadoop bootcamp getting started

Hadoop 1 vs 2

Page 35: Hadoop bootcamp getting started

Hadoop 1 vs 2: Federation

Page 36: Hadoop bootcamp getting started

Hadoop 1 vs 2: High Availability

Page 37: Hadoop bootcamp getting started

Eco-System

HDFS

YARN

MAPREDUCETEZSPARK

HBASEHIVE

HCATALOG

PIG Mahout

SCOOP

Flume

Zookeeper

ORC

Crunch

Oozie

DRILL

STORMD

A

T

A

Curator

KAFKA

Page 38: Hadoop bootcamp getting started

Eco-System: Pig

● Map Reduce

● Directed

Acyclic Graph

● Analyze

● Pig Latin

Page 39: Hadoop bootcamp getting started

Eco-System: Hive

● HiveQL (SQL)

● Map Reduce

● Analyze

● Schema on Read

Page 40: Hadoop bootcamp getting started

Eco-System: HCatalog

● Part of Hive

● REST services

● Table

Management

● Relational View

Page 41: Hadoop bootcamp getting started

Eco-System: ORC

● Store hive data

● Metadata in file

● File Format

● Compression

Optimized

Row

Columnar

Page 42: Hadoop bootcamp getting started

Eco-System

HDFS

MAPREDUCE

HIVE

HCATALOG

PIG

ORC

Page 43: Hadoop bootcamp getting started

Eco-System: Mahout

● Data Mining

● MapReduce

● Machine Learning

● Distributed

Page 44: Hadoop bootcamp getting started

Eco-System: Zookeeper

● Ordered

● Centralized

Service

● Reliability

● Fast

Page 45: Hadoop bootcamp getting started

Eco-System: Curator

● Recipes

● Simplifies

Zookeeper

● Made by Netflix

Page 46: Hadoop bootcamp getting started

Eco-System: YARN

● Resource

Manager

● MapReduce 2.0

● Multiple Data

Processing

Options

Yet

Another

Resource

Negiotiator

Page 47: Hadoop bootcamp getting started

Eco-System: YARN

Page 48: Hadoop bootcamp getting started

Eco-System

HDFS

YARN

MAPREDUCE

HIVE

HCATALOG

PIG Mahout

Zookeeper

ORC

Curator

Page 49: Hadoop bootcamp getting started

Eco-System: Oozie

● Oozie

Workflow

● Workflow

Scheduler

● Oozie

Coordinator

● Oozie Bundle

Page 50: Hadoop bootcamp getting started

Eco-System: Tez

● Dataflow Graph

● Improves Map

Reduce

● Dynamically

Reconfigure

Page 51: Hadoop bootcamp getting started

Eco-System: Sqoop

● Data Imports /

Exports RDBMS

● Java PoJo

Page 52: Hadoop bootcamp getting started

Eco-System: Hbase

● Fast

● Fault Tolerant

● Usable

● Use Cases

Page 53: Hadoop bootcamp getting started

Eco-System

HDFS

YARN

MAPREDUCETEZ

HBASEHIVE

HCATALOG

PIG Mahout

SCOOP

Zookeeper

ORC

OozieCurator

Page 54: Hadoop bootcamp getting started

Eco-System: Crunch

● Developer

Focused

● Pipeline

● Flexible Data

Model

Page 55: Hadoop bootcamp getting started

Eco-System: Drill

● Schema-Free

JSON Model

● Query Any

Datastore

● SQL (SQL:2003

syntax)

Page 56: Hadoop bootcamp getting started

Eco-System: Storm

● No Data-Loss

● Stream

Processing

● Scalable

Page 57: Hadoop bootcamp getting started

Eco-System: Flume

● Buffer

Incoming Data

● Stream Data

● Guarantee Data

Delivery

● Scalable

Page 58: Hadoop bootcamp getting started

Eco-System

HDFS

YARN

MAPREDUCETEZ

HBASEHIVE

HCATALOG

PIG Mahout

SCOOP

Flume

Zookeeper

ORC

Crunch

Oozie

DRILL

STORMD

A

T

A

Curator

Page 59: Hadoop bootcamp getting started

Eco-System: Spark

● More then

Map/Reduce

● Memory

● Java, Scala,

Python and R

● SQL

Page 60: Hadoop bootcamp getting started

Eco-System: Spark

Page 61: Hadoop bootcamp getting started

Eco-System: Spark

Page 62: Hadoop bootcamp getting started

Eco-System: Kafka

● Scalable

● Fast

● Durable

● Distributed By

Design

Page 63: Hadoop bootcamp getting started

Eco-System

HDFS

YARN

MAPREDUCETEZSPARK

HBASEHIVE

HCATALOG

PIG Mahout

SCOOP

Flume

Zookeeper

ORC

Crunch

Oozie

DRILL

STORMD

A

T

A

Curator

KAFKA

Page 64: Hadoop bootcamp getting started

Hadoop Distributions

Page 65: Hadoop bootcamp getting started

Hands on - HDFS

Page 66: Hadoop bootcamp getting started

HDFS: Many ways of input

HDFS

Page 67: Hadoop bootcamp getting started

HDFS Client

Very POSIX (UNIX) likehadoop fs -put

-get

-mkdir

-ls

-cp

-mv

-rm

-chmod

...

Page 68: Hadoop bootcamp getting started

Hands on - HDFS: Objectives

▪ Working with the HDFS Client

▪ Find where blocks are stored

Page 69: Hadoop bootcamp getting started

Hands on - Pig & Hive Preview

Page 70: Hadoop bootcamp getting started

Upcoming Workshops

▪ September: A visit to the animal farm: Babe eh … Pig

▪ October: Bee a master - Hive

▪ November: Streaming (Storm & Flume)

Page 71: Hadoop bootcamp getting started

Questions or Suggestions?