2. hadoop fundamentals

Hadoop Fundamentals

Agenda

• What is Hadoop

• Revisiting Big Data

• Examples of Hadoop in Action

• Limitations of Hadoop

• Big Data and Cloud

• Hadoop Architecture

What is Hadoop?

• Open Source Project

• Written in Java

• Uses Google’s MapReduce and Google File system as its foundation

• Optimized to handle• Massive amounts of data through parallelism

• A variety of data (structured, unstructured and semi structured)

• Using inexpensive commodity hardware

• Great Performance

• Reliability provided through replication

• Not suited for OLTP workloads, OLAP workloads

• Hadoop is used for Big Data.. Complements OLTP and OLAP

What is Big Data?

• With all the devices available to collect data viz. RFID readers, microphones, cameras and sensors, there is an explosion of data worldwide

• Big data is large collections of data (also known as datasets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistical tools

Important statistics

• > 2 billion internet users & 7.3 billion active cellphones

• Twitter processes 7TB of data every day, Facebook processes 500TB

• approximately 80% of these data are unstructured

Need for fast, reliable, deep data insight. Hence relevance of Hadoop

Some of the Open Source Projects related to Hadoop

• Eclipse - popular IDE donated by IBM to the open source community

• Lucene – Text search engine developed in Java

• Hbase – Hadoop database

• Hive – provides data warehousing tools to extract, transform and load data, and then, query this data stored

• Pig – High level language that generates MapReduce code to analyze large datasets

Examples of Hadoop in Action

• 2011 – Watson, a Supercomputer developed by IBM participated in a quiz show – Jeapordy

• Approximately 2 million pages of text were input using Hadoop to distribute the workload for storing information into memory

• Use of advanced search and analysis

China Mobile

• Hadoop cluster to data mining on Call Data Records

• China Mobile was producing 5-8 TB of data

• Used Hadoop to process 10 times of data with 1/5th of the cost

Examples of Hadoop in Action

New York Times

• Hosted all public domain articles from 1851 to 1922

• 11 million files converted into 1.5 TB of PDF

• One employee who ran a job for 24 hours on a 100-instance Amazon EC2 Hadoop

Yahoo

• Largest production user with an application running a Hadoop cluster consisting of approximately 10,000 Linux machines

• Also largest contributor to Hadoop Opensource Project

Limitations of Hadoop

• Not good for processing transactions (random access)

• Not good when work cannot be parallalized

• Not good for low latency data access

• Not good for processing lots of small files

• Not good for intensive calculations with little data

Big Data Solutions and the Cloud

• Big Data solutions are more than just Hadoop• Add business intelligence / analytics functionality

• Derive information from data in motion

• Big Data solutions and the cloud are a perfect fit.• Cloud allows you to setup a cluster of systems in minutes and it’s relatively

inexpensive

Hadoop Architecture

Terminology Review

• Node – a computer typically non-enterprise, commodity hardware

• Rack – a collection of 30 – 40 nodes that are physically stored close to each other and connected to same network switch

• Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks

• A Hadoop Cluster is a collection of racks

Pre Hadoop 2.2 Architecture

• Distributed File System• Hadoop Distributed File System (HDFS)

• GPFS – FPO

• MapReduce Engine• Framework for performing calculations on the data in the file system

• Has a built in resource manager and scheduler

Pre Hadoop 2.2 MapReduce is called MapReduce 1.0 has its own resource management and scheduling

Hadoop Distributed File System

• A distributed file system that provides high-performance access to data across Hadoop clusters.

• Key tool for managing pools of big data and supporting big data analytics applications.

Hadoop Distributed File System

• HDFS runs on the existing file system• Not POSIX compliant

• Designed to tolerate high component failure rate• Reliability through replication

• Designed to handle very large files• Large streaming data access patterns

• No random access

• Uses blocks to store a file or parts of a file

Basic Features of HDFS

• Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

• Replication – Each data block replicated many times (typically 3)

• Failure – failure is the norm rather than exception

• Highly fault-tolerant• Name node consistently checking data nodes

• Detection of faults and quick automatic recovery from them is the goal

• High throughput

• Streaming access to file system data

• Can be built out of commodity hardware

Types of Nodes

Client Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Name Nodes & Data Nodes• Name Node maintains metadata about the files

• Data Nodes• Store actual data• Files are divided into blocks• Each block is replicated N times (default N=3)

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode

Job Tracker and Task Tracker

• Job Tracker is the Master Node• Receives the user’s job

• Decides on how many tasks will run (number of mappers)

• Decides on where to run each mapper (concept of locality)

• Task Tracker is the Slave Node• Receives the task from Job Tracker

• Runs the task until completion (either map or reduce task)

• Always in communication with the Job Tracker reporting progress

Fault Tolerance

• Failure is the norm rather than exception

• A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.

• Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

• Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS

Data Characteristics

Streaming data access

Applications need streaming access to data

Batch processing rather than interactive user access.

Large data sets and files: gigabytes to terabytes size

High aggregate data bandwidth

Scale to hundreds of nodes in a cluster

Tens of millions of files in a single instance

Write-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherency

A map-reduce application or web-crawler application fits perfectly with this model

Data Replication

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode

HDFS file blocks

• Not the same as operating system’s file blocks• HDFS book made up of multiple operating system blocks

• Default for Hadoop is 64 MB• Recommended is 128 MB (this is BigInsight’s block)

• Size of a block can be larger than the single disk in a cluster• Blocks for single file are spread across multiple nodes in a cluster

• If a chunk of file is smaller than HDFS block size• Only the needed space is used

• Blocks work well with replication• Fault tolerant and availability with commodity hardware

HDFS Replication

• Blocks of data are replicated to multiple nodes

• Allows for node failure without data loss

• Replication can be done on many more nodes by• Changing Hadoop configuration

• Setting replication factor for each file

MapReduce Framework

• Based on technology from Google

• Processes huge datasets with certain kind of distributable problems using large number of nodes

• A MapReduce program consists of map and reduce functions• Map-Phase – divides data into smaller subsets that are distributed over

different nodes

• Reduce – Phase – master node collects all the returned data and combines into some sort of output that can be used again

• Allows for distributed processing of map and reduce operations• Tasks run in parallel

Dataflow in MapReduce

Features of MapReduce

• Fine grained Map and Reduce tasks• Improved load balancing

• Faster recovery from failed tasks

• Automatic re-execution on failure• In a large cluster, some nodes are always slow or flaky

• Framework re-executes failed tasks

• Locality optimizations• With large data, bandwidth to data is a problem

• Map-Reduce + HDFS is a very effective solution

• Map-Reduce queries HDFS for locations of input data

• Map tasks are scheduled close to the inputs when possible

Key Value Pairs

• Mappers and Reducers are users’ code (provided functions)

• Just need to obey the Key-Value pairs interface

• Mappers:• Consume <key, value> pairs

• Produce <key, value> pairs

• Reducers:• Consume <key, <list of values>>

• Produce <key, value>

• Shuffling and Sorting:• Hidden phase between mappers and reducers

• Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>>

Word Count Example

• Mapper• Input: value: lines of text of input

• Output: key: word, value: 1

• Reducer• Input: key: word, value: set of counts

• Output: key: word, value: sum

• Launching program• Defines this job

• Submits job to cluster

Example - Word Count Dataflow

Example – Color Count

29

Shuffle & Sorting based on k

Reduce

Reduce

Reduce

Map

Map

Map

Map

Input blocks on HDFS

Produces (k, v)( , 1)

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Consumes(k, [v])( , [1,1,1,1,1,1..])

Produces(k’, v’)( , 100)

Job: Count the number of each color in a data set

Part0003

Part0002

Part0001

That’s the output file, it has 3 parts on probably 3 different machines

How Many Maps and Reduces

• Maps• Usually as many as the number of HDFS blocks being processed, this is the default

• Else the number of maps can be specified as a hint

• The number of maps can also be controlled by specifying the minimum split size

• The actual sizes of the map inputs are computed by:

• max(min(block_size,data/#maps), min_split_size

• Reduces• Unless the amount of data being processed is small

• 0.95*num_nodes*mapred.tasktracker.tasks.maximum

Types of Nodes

• Named Node• Only one Named Node in a cluster

• Stores metadata for the data node

• Manages the file system namespace and metadata• Data does not go through the named node

• Data is not stored in the named node

• Single point of failure• Good idea to mirror named node

• Do not use inexpensive commodity hardware

• Has large memory requirement• File system metadata is maintained in RAM to server read requests

Types of Nodes

Data Node

• Many per Hadoop Cluster

• Manages the blocks with data and serves them to clients

• Blocks from different files can be stored on the same DataNode

• Periodically reports to the NameNode the list of blocks it stores

• Suitable for inexpensive commodity hardware – replication provided at software level

Types of Nodes

Job Tracker

• Manages the MapReduce jobs in the cluster

• One per Hadoop Cluster

• Receives job requests submitted by the client

• Schedules and monitors MapReduce Jobs on TaskTrackers• Attempts to direct a task to the TaskTracker where the data resides

• Monitors of any failing tasks that need to rescheduled

Types of Nodes

TaskTracker

• Many per Hadoop Cluster

• Executes the MapReduce operations• Runs the MapReduce tasks in JVMs

• Have a set of slots used to run tasks

• Communicates with JobTracker

• Reads blocks from DataNodes

Hadoop 2.2 Architecture

Provides YARN

• Referred to as MapReduce V2

• Resource Manager and Scheduler external to anyframework

• DataNodes still exist

• JobTracker and TaskTrackers no longer exist

• Not Required to run YARN with Hadoop 2.2• Still supports MapReduce V1

YARN

Two main ideas

• Provide generic scheduling and resource management• Support more than just MapReduce

• Support more than just batch processing

• More efficient scheduling and workload management• No more balancing between map slots and reduce slots

Hadoop High Availability

JournalNode JournalNode JournalNode

NameNode NameNode

DataNode DataNode DataNode DataNode DataNode

Data & Analytics

2. hadoop fundamentals