60
Beginners Guide HADOOP

HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Beginners Guide

HADOOP

Page 2: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Agenda

● What is Hadoop?● What is Big Data?● Architecture● MapReduce● Querying Data● Examples of Hadoop in Action

Page 3: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

Page 4: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

Page 5: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

Page 6: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

Page 7: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

Page 8: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

● Open Source project● Written in Java● Optimised to handle

– Massive amounts of data through parallelism

– A variety of data ( structured, unstructured, semi-structured)

– Using commodity hardware

● Great performance

Page 9: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Hadoop?

● Reliability provided through replication● Not for OLTP, not for OLAP/DSS● Good for Big Data

Page 10: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● RFID Reader

Page 11: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● 2 billion of Internet users

Page 12: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● 4.7 Billion Mobile phones

Page 13: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● 7 TB of data processed by Twitter every day

Page 14: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● 10 TB of data processed by Facebook every day

Page 15: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

What is Big Data?

● About 80% of data is unstructured

Page 16: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Architecture

● HDFS● MapReduce● Types of Nodes● Topology Awareness● Writing a File to HDFS● HDFS CLI

Page 17: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Architecture

Page 18: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Architecture

Page 19: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Architecture

● Two main components– Distributed File System– MapReduce Engine

Page 20: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HDFS

● HDFS runs on top of a existing file system● Designed to handle very large files with

streaming data access● Uses blocks to store a file or parts of a file

Page 21: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HDFS – Blocks

● File blocks– 64 MB – 128 MB– 1 HDFS block is supported by many OS Blocks

Page 22: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HDFS – Blocks

● Advantages of blocks– Fixed size, easy to calculate how many fit on

a disk– A file can be larger than any disk in the

cluster– If a file or a chunk of a file is smaller than

the block, only needed space is used– Fits well with replication

Page 23: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HDFS – Replication

● Blocks are replicated to multiple nodes● Allows node failure without data loss

Page 24: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

MapReduce Engine

● Technology from Google● A MapReduce program consists of map and

reduce functions● A MapReduce jobs is divided into tasks that

run in parallel

Page 25: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

● HDFS Nodes– NameNode– DataNode

● MapReduce Nodes– Job Tracker– Task Tracker

Page 26: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

Page 27: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

● NameNode– Only 1 per Hadoop Cluster– Manages the filesystem namespace and

metadata– Single point of failure– Large memory requirements

Page 28: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

● DataNode– Many per Hadoop Cluster– Manages blocks with data and serves them to

the clients– Periodically reports to the NameNode the list

of blocks it stores– Commodity hardware

Page 29: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

● JobTracker Node– 1 per Hadoop Cluster– Receives job request submitted by clients– Schedule and monitor MapReduce jobs on

tasks trackers

Page 30: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Type of Nodes

● TaskTracker Node– Many per Hadoop Cluster– Executes MapReduce operations

Page 31: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Topology Awareness

Page 32: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Topology Awareness

Page 33: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Topology Awareness

Page 34: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Topology Awareness

Page 35: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Writing a File to HDFS

Page 36: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Writing a File to HDFS

Page 37: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Writing a File to HDFS

Page 38: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Writing a File to HDFS

Page 39: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HDFS Command Line

● File System Shell

Page 40: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

MapReduce

● Map Operation● Reduce Operation● Submitting a MR job● The Shuffle● Data Types● Fault tolerance● Scheduling / Task Execution

Page 41: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Map Operation

Page 42: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Map Operation

Page 43: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Map Operation

Page 44: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Reduce Operation

Page 45: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Reduce Operation

Page 46: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Reduce Operation

Page 47: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Submitting a MR Job

Page 48: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Submitting a MR Job

Page 49: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Data Types

● Key / Value● Lists

Page 50: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Data Types

● Simple data flow example

Page 51: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Fault Tolerance

● Task Failure– Child task fails, the JVM reports to the

TaskTracker.– Child task hangs, it is killed. JobTracker

reschedule the task on another machine.– If task continues to fail, job is failed

Page 52: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Fault Tolerance

● TaskTracker Failure– JobTracker receives no heartbeat– Remove TaskTracker from the pool

● JobTracker Failure– Single point of failure. Job Fails

Page 53: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Scheduling

● FIFO– Each job uses the whole Hadoop Cluster

● Fair– Job is placed in pools

● Capacity– Hadoop simulates for each user a separeta

MP Cluster with FIFO scheduling

Page 54: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Task Execution

● Speculative– Job execution is time sensitive to slow-

running tasks● JVM Reuse

– Use the same JVM through configuration

Page 55: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Querying Data

● Pig● Hive

Page 56: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Pig

● Developed by Yahoo!● Pig is a platform for analysing large data sets

that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Page 57: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Pig

● Two Components– Language (PigLatin)– Execution environment

● Two Execution environments– (Single JVM)

● pig -x local– Distributed System

● pig -x mapreduce / pig

Page 58: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

Pig

● Running Pig– Script

● pig script-name.pig– Grunt

● pig (launch Command Line Tool)– Embedded

● Calling pig from Java

Page 59: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HIVE

● Hive is a data warehouse system for Hadoop that facilitates easy data summarisation, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

● Provide a SQL-lke language HiveQL

Page 60: HADOOP - files.meetup.comfiles.meetup.com/2808892/HadoopBigData_Presentation.pdf · HDFS – Blocks Advantages of blocks – Fixed size, easy to calculate how many fit on a disk –

HIVE

● Running Hive– Interactive

● Hive– Script

● Hive -f my-script– Inline

● Hive -e 'SELECT * FROM MyTable'