HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Beginners Guide

HADOOP

Agenda

● What is Hadoop?● What is Big Data?● Architecture● MapReduce● Querying Data● Examples of Hadoop in Action

What is Hadoop?

What is Hadoop?

What is Hadoop?

What is Hadoop?

What is Hadoop?

What is Hadoop?

● Open Source project● Written in Java● Optimised to handle

– Massive amounts of data through parallelism

– A variety of data ( structured, unstructured, semi-structured)

– Using commodity hardware

● Great performance

What is Hadoop?

● Reliability provided through replication● Not for OLTP, not for OLAP/DSS● Good for Big Data

What is Big Data?

● RFID Reader

What is Big Data?

● 2 billion of Internet users

What is Big Data?

● 4.7 Billion Mobile phones

What is Big Data?

● 7 TB of data processed by Twitter every day

What is Big Data?

● 10 TB of data processed by Facebook every day

What is Big Data?

● About 80% of data is unstructured

Architecture

● HDFS● MapReduce● Types of Nodes● Topology Awareness● Writing a File to HDFS● HDFS CLI

Architecture

Architecture

Architecture

● Two main components– Distributed File System– MapReduce Engine

HDFS

● HDFS runs on top of a existing file system● Designed to handle very large files with

streaming data access● Uses blocks to store a file or parts of a file

HDFS – Blocks

● File blocks– 64 MB – 128 MB– 1 HDFS block is supported by many OS Blocks

HDFS – Blocks

● Advantages of blocks– Fixed size, easy to calculate how many fit on

a disk– A file can be larger than any disk in the

cluster– If a file or a chunk of a file is smaller than

the block, only needed space is used– Fits well with replication

HDFS – Replication

● Blocks are replicated to multiple nodes● Allows node failure without data loss

MapReduce Engine

● Technology from Google● A MapReduce program consists of map and

reduce functions● A MapReduce jobs is divided into tasks that

run in parallel

Type of Nodes

● HDFS Nodes– NameNode– DataNode

● MapReduce Nodes– Job Tracker– Task Tracker

Type of Nodes

Type of Nodes

● NameNode– Only 1 per Hadoop Cluster– Manages the filesystem namespace and

metadata– Single point of failure– Large memory requirements

Type of Nodes

● DataNode– Many per Hadoop Cluster– Manages blocks with data and serves them to

the clients– Periodically reports to the NameNode the list

of blocks it stores– Commodity hardware

Type of Nodes

● JobTracker Node– 1 per Hadoop Cluster– Receives job request submitted by clients– Schedule and monitor MapReduce jobs on

tasks trackers

Type of Nodes

● TaskTracker Node– Many per Hadoop Cluster– Executes MapReduce operations

Topology Awareness

Topology Awareness

Topology Awareness

Topology Awareness

Writing a File to HDFS




HDFS Command Line

● File System Shell

MapReduce

● Map Operation● Reduce Operation● Submitting a MR job● The Shuffle● Data Types● Fault tolerance● Scheduling / Task Execution

Map Operation

Map Operation

Map Operation

Reduce Operation

Reduce Operation

Reduce Operation

Submitting a MR Job

Submitting a MR Job

Data Types

● Key / Value● Lists

Data Types

● Simple data flow example

Fault Tolerance

● Task Failure– Child task fails, the JVM reports to the

TaskTracker.– Child task hangs, it is killed. JobTracker

reschedule the task on another machine.– If task continues to fail, job is failed

Fault Tolerance

● TaskTracker Failure– JobTracker receives no heartbeat– Remove TaskTracker from the pool

● JobTracker Failure– Single point of failure. Job Fails

Scheduling

● FIFO– Each job uses the whole Hadoop Cluster

● Fair– Job is placed in pools

● Capacity– Hadoop simulates for each user a separeta

MP Cluster with FIFO scheduling

Task Execution

● Speculative– Job execution is time sensitive to slow-

running tasks● JVM Reuse

– Use the same JVM through configuration

Querying Data

● Pig● Hive

Pig

● Developed by Yahoo!● Pig is a platform for analysing large data sets

that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Pig

● Two Components– Language (PigLatin)– Execution environment

● Two Execution environments– (Single JVM)

● pig -x local– Distributed System

● pig -x mapreduce / pig

Pig

● Running Pig– Script

● pig script-name.pig– Grunt

● pig (launch Command Line Tool)– Embedded

● Calling pig from Java

HIVE

● Hive is a data warehouse system for Hadoop that facilitates easy data summarisation, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

● Provide a SQL-lke language HiveQL

HIVE

● Running Hive– Interactive

● Hive– Script

● Hive -f my-script– Inline

● Hive -e 'SELECT * FROM MyTable'

Documents

HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –