97
Unleash your cluster with YARN Ferran Galí i Reniu @ferrangali

Unleash your cluster with YARN

Embed Size (px)

Citation preview

Unleash your cluster with YARN

Ferran Galí i Reniu@ferrangali

About me

@ferrangali

Data

Sensors

Smartphones

User behavior

Social Networks

Text

Numbers

Images

Videos

Big Data

100 MB/s

2 TB = 3.5 hours

The Big Data problem

100 MB/s

2 TB = 30 min

The Big Data problem

HDFS

Node Node Node Node Node Node Node NodeHardware

HDFS

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

HDFS

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

$> hadoop fs -ls

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$>

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dir

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$>

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$> hadoop fs -cat dir/file3.txt

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$> hadoop fs -cat dir/file3.txtline1line2line3line4line5

HDFS

MapReduce

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

MapReduce

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing

MapReduce Job

Data Pipeline

Application

MapReduce Job

MapReduce Job

MapReduce Job

Data Pipeline

Application

MapReduce Job

Map

Map

Map

Map

MapReduce Job

Split

Split

Split

Split

Map

Map

Map

Map

MapReduce Job

Split

Split

Split

Split

map(){ // Your code here}

Map

Map

Map

Map

Reduce

Reduce

Reduce

MapReduce Job

Split

Split

Split

Split

map(){ // Your code here}

Map

Map

Map

Map

Reduce

Reduce

Reduce

MapReduce Job

Split

Split

Split

Split

map(){ // Your code here}

reduce(){ // Your code here}

Map

Map

Map

Map

Reduce

Reduce

Reduce

MapReduce Job

Split

Write

Write

Write

Split

Split

Split

map(){ // Your code here}

reduce(){ // Your code here}

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

Map Map Map Map

Map Map Map Reduce

Reduce Reduce Reduce Reduce

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Map

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Reduce

Limitations

Limitations

Limitations

Limitations

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

MapReduce Job

Limitations

Iterative

MapReduce Job

Limitations

Iterative

Graph Algorithms

MapReduce Job

Limitations

Iterative

Graph Algorithms

YARN

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemYARN - Yet Another Resource Negotiator

Processing

Cores

Memory

NodeManagerResourceManager

YARN Architecture

NodeManager

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

NodeManager

Applicationx61 core1024MB

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

NodeManager

Applicationx61 core1024MB

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Applicationx61 core1024MB

Container

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Map

Map

Map

NodeManager

Map

Map

Applicationx61 core1024MB

Map

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Reduce

Reduce

Reduce

NodeManager

Reduce

Reduce

Applicationx61 core1024MB

Reduce

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

ApplicationMaster

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

NodeManager

Container

ResourceManager

YARN Architecture

Container

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

ApplicationMaster

Container

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

NodeManager

Container

ResourceManager

YARN Architecture

Container

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

ApplicationMaster

Container

Container

Applicationx61 core1024MBetl

Application 2x42 cores2048MBquery

x8 x8

x8 x8

scheduleretl: weight 1

query: weight 2

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

Batch

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

In Memory / Streaming

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

Interactive SQL

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing ...

Application

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Reduce

MapReduce Job

MapReduce Job

MapReduce Job

Improved Data Pipelines

Application

Improved Data Pipelines

MapReduce Job

Spark JobMapReduce

Job

Application

Improved Data Pipelines

MapReduce Job

Spark Job

Application

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Search engine

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Business Intelligence

Search engine

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Business Intelligence

Search engine

Mailing

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Business Intelligence

Search engine

Mailing Push Notifications

Trovit

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Business Intelligence

Search engine

Mailing Push Notifications

Online Media Buying

Challenges

Trovit

Maintain

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Challenges

Trovit

Maintain Try new paradigms

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Challenges

Trovit

Maintain Try new paradigms

Fine tune

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Trovit

Data Analysis with SQL on Hadoop

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Trovit

Data Analysis with SQL on Hadoop

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

HiveSQL M/R

Trovit

Data Analysis with SQL on Hadoop

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

HiveSQL M/R

Sqoop onMySQL

Challenges

Trovit

Data Analysis with SQL on Hadoop

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

ImpalaInteractive

Challenges

Trovit

Data Analysis with SQL on Hadoop

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

ImpalaInteractive

Machine Learning

Trovit

Data Analysis with SQL on Hadoop

Near Real Time on a Storm cluster

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Separated Cluster

Challenges

Trovit

Data Analysis with SQL on Hadoop

Near Real Time on a Storm cluster

+70 MapReduce Jobs adding business value

Multi-tenant cluster executing +7000 jobs per day

Storm on YARN

Questions?

Thank YouFerran Galí i Reniu

@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0