50
Apache Hadoop Foundations of Scalability Konstantin V. Shvachko November, 2013

Hadoop scalability

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Hadoop scalability

Apache Hadoop

Foundations of Scalability

Konstantin V. Shvachko November, 2013

Page 2: Hadoop scalability

Author

WANdisco, Chief Architect - NonStop Hadoop - Free Training

Founder of AltoStor and AltoScale

Hadoop, HDFS at Yahoo! & eBay. Since 2005

Data structures and algorithms for large-scale distributed storage systems

Apache Hadoop committer and member of PMC

2

Page 3: Hadoop scalability

Computing

History of computing started long time ago

Fascination with numbers - Vast universe with simple strict rules - Computing devices - Crunch numbers

The Internet - Universe of words, fuzzy rules - Different type of computing - Understand meaning of things - Human thinking - Errors & deviations are a

part of study

The domain of computing is changing so does the computing itself

Computer History Museum, San Jose

3

Page 4: Hadoop scalability

Words vs. Numbers

In 1997 IBM built Deep Blue supercomputer

- Playing chess game with the champion G. Kasparov

- Human race defeated - Strict rules for Chess - Fast deep analyses of current state

From Big Numbers to Big Data

In 2011 IBM built Watson computer to play Jeopardy

- Questions and hints in human terms - Natural language processing - Reborn as diagnostics machine:

Oncology.

4

Page 5: Hadoop scalability

Big Data

Computations that need the power of many computers - Large datasets: hundreds of TBs, PBs - Or use of thousands of CPUs in parallel - Or both

Cluster as a computer

What is a PB? 1 KB = 1000 Bytes 1 MB = 1000 KB 1 GB = 1000 MB 1 TB = 1000 GB 1 PB = 1000 TB ???? = 1000 PB

5

Page 6: Hadoop scalability

Examples – Science

Fundamental physics: Large Hadron Collider (LHC) - Smashing high-energy protons at the speed of light - 1 PB of event data per sec, most filtered out - 15 PB of data per year - 160 PB of disk + 90 PB of tape storage

Math: Big Numbers - 2 quadrillionth (1015) digit of π is 0 - Pure CPU workload: 12 days of cluster time - 208 years of CPU-time on a cluster with 7600 CPU cores

Healthcare - Patient records, Sensors, Drug design - Genome

Page 7: Hadoop scalability

Examples – Web

Search engine - Webmap - Map of the Internet - 2008 @ Yahoo, 1500 nodes, 5 PB raw storage - Internet Search Index - Traditional Big Data applications

Behavioural Analysis - Recommendation engine: You may buy this too - Intelligence: fraud detection - Sentiment analysis: who will win elections - Matching interests: you should like him / her

7

Page 8: Hadoop scalability

The Sorting Problem

Classic in-memory sorting - Complexity: number of comparisons

External sorting - Cannot load all data in memory - 16 GB RAM vs. 200 GB file - Complexity: + disk IOs (bytes read or written)

Distributed sorting - Cannot load data on a single server - 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set - Complexity: + network transfers

Turns into a Big Data problem as the data set grows

Worst Average Space

Bubble Sort O(n2) O(n2) In-place

Quicksort O(n2) O(n log n) In-place

Merge Sort O(n log n) O(n log n) Double

8

Page 9: Hadoop scalability

Hadoop

Need a lot of computers

How to make them work together

Page 10: Hadoop scalability

Hadoop

Apache Hadoop is an ecosystem of tools for processing “Big Data” - Started in 2005 by D. Cutting and M. Cafarella - Scaled by Yahoo! Hadoop team from few nodes to thousands (4K-node cluster)

Consists of two main components: Providing unified cluster view 1. HDFS – a distributed file system

• File system API connecting thousands of drives

2. MapReduce – a framework for distributed computations • Splitting jobs into parts executable on one node • Scheduling and monitoring of job execution

Today used everywhere: Becoming a standard of distributed computing

Hadoop is an open source project

A reliable, scalable, high performance distributed computing system

10

Page 11: Hadoop scalability

Hadoop: Architecture Principles

Linear scalability: more nodes can do more work within the same time - Linear on data size: - Linear on compute resources:

Move computation to data - Minimize expensive data transfers - Data are large, programs are small

Reliability and Availability: Commodity hardware - 1 drive fails every 3 years => Probability of failing today 1/1000 - How many drives per day fail on 1000 node cluster with 10 drives per node?

Sequential data processing: avoid random reads / writes

Simple computational model - hides complexity in efficient execution framework

11

Page 12: Hadoop scalability

The Hadoop Family

Ecosystem of tools for processing BigData

12

HDFS Distributed file system YARN, MapReduce Computational Framework

Zookeeper Distributed coordination HBase Key-Value store Pig Dataflow language, SQL Hive Data warehouse, SQL Oozie Complex job workflow BigTop Packaging and testing

Page 13: Hadoop scalability

MapReduce

Distributed Computation

Page 14: Hadoop scalability

MapReduce

MapReduce - 2004 Jeffrey Dean, Sanjay Ghemawat. Google. - “MapReduce: Simplified Data Processing on Large Clusters”

Parallel Computational Model - Examples of computational models

• Turing or Post machines. Programming languages – C++, Java • Finite automaton, lambda calculus

- Split large input data into small enough pieces, process in parallel

Distributed Execution Framework - Compilers, interpreters - Scheduling, Processing, Coordination - Failure recovery

14

Page 15: Hadoop scalability

Functional Programming

Map a higher-order function - applies a given function to each element of a list - returns the list of results

Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

15

Page 16: Hadoop scalability

Functional Programming: reduce

Map a higher-order function - applies a given function to each element of a list - returns the list of results

Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

Reduce / fold a higher-order function - Iterates given function over a list of elements - Applies function to previous result and current element - Return single result

Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

16

Page 17: Hadoop scalability

Functional Programming

Map a higher-order function - applies a given function to each element of a list - returns the list of results

Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

Reduce / fold a higher-order function - Iterates given function over a list of elements - Applies function to previous result and current element - Return single result

Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

Reduce( x * y, [0,1,2,3,4,5] ) = ?

17

Page 18: Hadoop scalability

Functional Programming

Map a higher-order function - applies a given function to each element of a list - returns the list of results

Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

Reduce / fold a higher-order function - Iterates given function over a list of elements - Applies function to previous result and current element - Return single result

Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

Reduce( x * y, [0,1,2,3,4,5] ) = 0

18

Page 19: Hadoop scalability

Example: Sum of Squares

Composition of - a map followed by - a reduce applied to the results of the map

Example. - Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25] - Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55

Map easily parallelizable - Compute x2 for 1,2,3 on one node and for 4,5 on another

Reduce notoriously sequential - Need all squares at one node to compute the total sum.

Square Pyramid Number 1 + 4 + … + n2 = n(n+1)(2n+1) / 6

19

Page 20: Hadoop scalability

Computational Model

Map-Reduce is a Parallel Computational Model

Map-Reduce algorithm = job

Operates with key-value pairs: (k, V) - Primitive types, Strings or more complex Structures

Map-Reduce job input and output are collections of pairs (k, V)

MR Job is defined by 2 functions

map: (k1;; v1) → (k2;; v2)

reduce: (k2;; v2) → (k3;; v3)

20

Page 21: Hadoop scalability

Job Workflow

dogs C, 3

like

cats

V, 1

C, 2 V, 2

C, 3 V, 1

C, 8

V, 4

21

Page 22: Hadoop scalability

The Algorithm

Map ( null, word) nC = Consonants(word) nV = Vowels(word) Emit(“Consonants”, nC) Emit(“Vowels”, nV)

Reduce(key, n1, n2, …) nRes = n1 + n2 + … Emit(key, nRes)

22

Page 23: Hadoop scalability

Computation Framework

Two virtual clusters: HDFS and MapReduce - Physically tightly coupled - Designed to work together

The Hadoop Distributed File System - Reliable storage layer - View data as files and directories

MapReduce as Computation Framework - Job scheduling - Resource management - Lifecycle coordination - Task execution module

Job is executed on a cluster of computers

NameNode

DataNode

TaskTracker

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

Block

Task

23

Page 24: Hadoop scalability

HDFS Architecture Principles

The name space is a hierarchy of files and directories

Files are divided into blocks (typically 128 MB)

Namespace (metadata) is decoupled from data - Fast namespace operations, not slowed down by - Data streaming

Single NameNode keeps the entire name space in RAM

DataNodes store data blocks on local drives

Blocks are replicated on 3 DataNodes for redundancy and availability

24

Page 25: Hadoop scalability

MapReduce Framework

Job Input is a file or a set of files in a distributed file system (HDFS) - Input is split into blocks of roughly the same size - Blocks are replicated to multiple nodes - Block holds a list of key-value pairs

Map task is scheduled to one of the nodes containing the block - Map task input is node-local - Map task result is node-local

Map task results are grouped: one group per reducer Each group is sorted

Reduce task is scheduled to a node - Reduce task transfers the targeted groups from all mapper nodes - Computes and stores results in a separate HDFS file

Job Output is a set of files in HDFS. With #files = #reducers

25

Page 26: Hadoop scalability

Map Reduce Example: Mean

Mean

Input: large text file

Output: average length of words in the file µ

Example: µ(dogs, like, cats) = 4

¦ n

ixn 1

1P

26

Page 27: Hadoop scalability

Mean Mapper

Map input is the set of words w in the partition - Key = null Value = w

Map computes - Number of words in the partition

- Total length of the words ∑length(w)

Map output - <“count”, #words> - <“length”, #totalLength>

Map (null, w) Emit(“count”, 1) Emit(“length”, length(w))

27

Page 28: Hadoop scalability

Single Mean Reducer

Reduce input - <key, value>, where - key = “count”, “length” - value is an integer

Reduce computes - Total number of words: N = sum of all “count” values - Total length of words: L = sum of all “length” values

Reduce Output - <“count”, N> - <“length”, L>

The result - µ = L / N

Reduce(key, n1, n2, …) nRes = n1 + n2 + … Emit(key, nRes)

Analyze () read(“part-r-00000”) print(“mean = ” + L/N)

28

Page 29: Hadoop scalability

MapReduce Implementation

Single master JobTracker shepherds the distributed heard of TaskTrackers 1. Job scheduling and resource allocation 2. Job monitoring and job lifecycle coordination 3. Cluster health and resource tracking

Job is defined - Program: myJob.jar file - Configuration: job.xml - Input, output paths

JobClient submits the job to the JobTracker - Calculates and creates splits based on the input - Write myJob.jar and job.xml to HDFS

29

Page 30: Hadoop scalability

MapReduce Implementation

JobTracker divides the job into tasks: one map task per split. - Assigns a TaskTracker for each task, collocated with the split

TaskTrackers execute tasks and report status to the JobTracker - TaskTracker can run multiple map and reduce tasks - Map and Reduce Slots

Failed attempts reassigned to other TaskTrackers

Job execution status and results reported back to the client

Scheduler lets many jobs run in parallel

30

Page 31: Hadoop scalability

Example: Standard Deviation

Standard deviation

Input: large text file

Output: standard deviation σ of word lengths

Example: σ(dogs, like, cats) = 0

How many jobs

¦ n

ixn 1

2)(1 PV

? 31

Page 32: Hadoop scalability

Standard Deviation: Hint

2

1

22

1

2

11

22

1

22

1

1)

1(2

1

)(1

PV

PPV

PV

¦

¦¦¦

¦

n

i

nn

i

n

i

n

i

xn

nx

nx

n

xn

32

Page 33: Hadoop scalability

Standard Deviation Mapper

Map input is the set of words w in the partition - Key = null Value = w

Map computes - Number of words in the partition - Total length of the words ∑length(w) - The sum of lengths squared ∑length(w)2

Map output - <“count”, #words> - <“length”, #totalLength> - <“squared”, #sumLengthSquared>

Map (null, w) Emit(“count”, 1) Emit(“length”, length(w)) Emit(“squared”, length(w)2)

33

Page 34: Hadoop scalability

Standard Deviation Reducer

Reduce input - <key, value>, where - key = “count”, “length”, “squared” - value is an integer

Reduce computes - Total number of words: N = sum of all “count” values - Total length of words: L = sum of all “length” values - Sum of length squares: S = sum of all “squared” values

Reduce Output - <“count”, N> - <“length”, L> - <“squared”, S>

The result - µ = L / N - σ = sqrt(S / N - µ2)

Reduce(key, n1, n2, …) nRes = n1 + n2 + … Emit(key, nRes)

Analyze () read(“part-r-00000”) print(“mean = ” + L/N) print(“std.dev = ” + sqrt(S/N – L*L / N*N))

34

Page 35: Hadoop scalability

Combiner, Partitioner

Combiners perform local aggregation before the shuffle & sort phase - Optimization to reduce data transfers during shuffle - In Mean example reduces transfer of many keys to only two

Partitioners assign intermediate (map) key-value pairs to reducers - Responsible for dividing up the intermediate key space - Not used with single Reducer

Input Data

Input Data

Map Reduce

Input Map Shuffle & sort

Reduce Output Combiner Partitioner

35

Page 36: Hadoop scalability

Distributed Sorting

Sort a dataset, which cannot be entirely stored on one node.

Input: - Set of files. 100 byte records. - The first 10 bytes of each record is the key and the rest is the value.

Output: - Ordered list of files: f1, … fN

- Each file fi is sorted, and - If i < j then for any keys k Є fi and r Є fj (k ≤ r) - Concatenation of files in the given order must form a completely sorted record set

36

Page 37: Hadoop scalability

Naïve MapReduce Sorting

If the output could be stored on one node

The input to any Reducer is always sorted by key - Shuffle sorts Map outputs

One identity Mapper and one identity Reducer would do the trick - Identity: <k,v> → <k,v>

Input Data

Input Data

Map Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

cats dogs like

37

Page 38: Hadoop scalability

Sorting with Multiple Maps

Multiple identity Mappers and one identity Reducer – same result - Does not work for multiple Reducers

Input Data

Output Data

Map

Map

Map

Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

38

Page 39: Hadoop scalability

Sorting: Generalization

Define a hash function, such that - h: k → [1,N] - Preserves the order: k ≤ s → h(k) ≤ h(s) - h(k) is a fixed size prefix of string k (2 first bytes)

Identity Mapper

With a specialized Partitioner - Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)

Identity Reducer - Number of reducers is N: R1, …, RN

- Inputs for Ri are all pairs that have key h(k) = i - Ri is an identity reducer, which writes output to HDFS file fi - Hash function choice guarantees that

keys from fi are less than keys from fj if i < j

The algorithm was implemented to win Gray’s Terasort Benchmark in 2008

39

Page 40: Hadoop scalability

Storage

Scalability Challenges

Page 41: Hadoop scalability

Single NameNode of HDFS

Scheduled downtime dominates Unscheduled - OS maintenance - Configuration changes

Reasons for Unscheduled Downtime - 60 incidents in 500 days on 30,000 nodes - 24 Full GC – the majority - System bugs / Bad application / Insufficient resources - “Data Availability and Durability with HDFS”

Lack of Availability due to Performance Problems - A handful of nodes can saturate NameNode

Why High Availability is Important?

41

Page 42: Hadoop scalability

Hadoop-2 Active-Standby Architecture

Single Active NameNode shares journal with StandbyNode via

shared storage:

NFS, QJM

Provides failover to a Standby when Active Node fails

42

Page 43: Hadoop scalability

WANdisco Active-Active Architecture

Multiple equal-role NameNodes share namespace state via Coordination Engine

Proposal, Agreements

Coordinated updates

Fully replicated NameNodes available for reads and writes

43

Page 44: Hadoop scalability

WANdisco: Scaling Across Data Centers

Wide Area Network replication

Metadata – online

Data – offline

Continuous availability, and Disaster Recovery over a WAN

44

Page 45: Hadoop scalability

What is Apache HBase

Table: big, sparse, loosely structured - Collection of rows, sorted by row keys - Rows can have arbitrary number of columns

Table is split Horizontally into Regions - Dynamic Table partitioning - Region Servers serve regions to applications

Columns grouped into Column families - Vertical partition of tables

Distributed Cache: - Regions are loaded in nodes’ RAM - Real-time access to data

A distributed key-value store for real-time access to semi-structured data

45

DataNode DataNode DataNode

NameNode JobTracker

RegionServer RegionServer RegionServer

TaskTracker TaskTracker TaskTracker

HB

ase

Mas

ter

Page 46: Hadoop scalability

HBase Challenge

Failure of a region requires failover - Regions reassigned to other Region Servers - Clients failover and reconnect to new servers

Regions in high demand - Many client connections to one server introduce bottleneck

Good idea to replicate popular regions on multiple Region Servers - Open Problem: consistent updates

Solution: Coordinated updates

46

Page 47: Hadoop scalability

Giraffa File System

Challenge: RAM - namespace size limitation

Giraffa is a distributed, highly available file system

Utilizes features of HDFS and HBase

New open source project in experimental stage

A distributed highly scalable file system using HDFS and HBase

47

Page 48: Hadoop scalability

Giraffa Requirements

Availability – the primary goal - Load balancing of metadata traffic - Same data streaming speed to / from DataNodes - Continuous Availability: No SPOF

Cluster operability, management - Cost of running larger clusters same as a smaller one

More files & more data

48

HDFS Federated HDFS Giraffa

Space 25 PB 120 PB 1 EB = 1000 PB

Files + blocks 200 million 1 billion 100 billion

Concurrent Clients 40,000 100,000 1 million

Page 49: Hadoop scalability

Giraffa Architecture

Block Management Layer

BM BM BM

DN

Namespace Service HBase

Namespace Table path, attrs, block[], DN[][]

Block Management Processor

2

DN DN

DN DN

DN

DN DN

DN

Nam

espaceAgent

1

3

App

1. Giraffa client gets files and blocks from HBase

2. Block Manager handles block operations

3. Stream data to or from DataNodes

49

Page 50: Hadoop scalability

Contact: Samantha Leggat | t: 925.396.1194 | [email protected]

WANdisco, Bishop Ranch 8, 5000 Executive Pkwy, Suite 270, San Ramon, CA 94583

Thank you