Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Beginners Guide
HADOOP
Agenda
● What is Hadoop?● What is Big Data?● Architecture● MapReduce● Querying Data● Examples of Hadoop in Action
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
● Open Source project● Written in Java● Optimised to handle
– Massive amounts of data through parallelism
– A variety of data ( structured, unstructured, semi-structured)
– Using commodity hardware
● Great performance
What is Hadoop?
● Reliability provided through replication● Not for OLTP, not for OLAP/DSS● Good for Big Data
What is Big Data?
● RFID Reader
What is Big Data?
● 2 billion of Internet users
What is Big Data?
● 4.7 Billion Mobile phones
What is Big Data?
● 7 TB of data processed by Twitter every day
What is Big Data?
● 10 TB of data processed by Facebook every day
What is Big Data?
● About 80% of data is unstructured
Architecture
● HDFS● MapReduce● Types of Nodes● Topology Awareness● Writing a File to HDFS● HDFS CLI
Architecture
Architecture
Architecture
● Two main components– Distributed File System– MapReduce Engine
HDFS
● HDFS runs on top of a existing file system● Designed to handle very large files with
streaming data access● Uses blocks to store a file or parts of a file
HDFS – Blocks
● File blocks– 64 MB – 128 MB– 1 HDFS block is supported by many OS Blocks
HDFS – Blocks
● Advantages of blocks– Fixed size, easy to calculate how many fit on
a disk– A file can be larger than any disk in the
cluster– If a file or a chunk of a file is smaller than
the block, only needed space is used– Fits well with replication
HDFS – Replication
● Blocks are replicated to multiple nodes● Allows node failure without data loss
MapReduce Engine
● Technology from Google● A MapReduce program consists of map and
reduce functions● A MapReduce jobs is divided into tasks that
run in parallel
Type of Nodes
● HDFS Nodes– NameNode– DataNode
● MapReduce Nodes– Job Tracker– Task Tracker
Type of Nodes
Type of Nodes
● NameNode– Only 1 per Hadoop Cluster– Manages the filesystem namespace and
metadata– Single point of failure– Large memory requirements
Type of Nodes
● DataNode– Many per Hadoop Cluster– Manages blocks with data and serves them to
the clients– Periodically reports to the NameNode the list
of blocks it stores– Commodity hardware
Type of Nodes
● JobTracker Node– 1 per Hadoop Cluster– Receives job request submitted by clients– Schedule and monitor MapReduce jobs on
tasks trackers
Type of Nodes
● TaskTracker Node– Many per Hadoop Cluster– Executes MapReduce operations
Topology Awareness
Topology Awareness
Topology Awareness
Topology Awareness
Writing a File to HDFS
Writing a File to HDFS
Writing a File to HDFS
Writing a File to HDFS
HDFS Command Line
● File System Shell
MapReduce
● Map Operation● Reduce Operation● Submitting a MR job● The Shuffle● Data Types● Fault tolerance● Scheduling / Task Execution
Map Operation
Map Operation
Map Operation
Reduce Operation
Reduce Operation
Reduce Operation
Submitting a MR Job
Submitting a MR Job
Data Types
● Key / Value● Lists
Data Types
● Simple data flow example
Fault Tolerance
● Task Failure– Child task fails, the JVM reports to the
TaskTracker.– Child task hangs, it is killed. JobTracker
reschedule the task on another machine.– If task continues to fail, job is failed
Fault Tolerance
● TaskTracker Failure– JobTracker receives no heartbeat– Remove TaskTracker from the pool
● JobTracker Failure– Single point of failure. Job Fails
Scheduling
● FIFO– Each job uses the whole Hadoop Cluster
● Fair– Job is placed in pools
● Capacity– Hadoop simulates for each user a separeta
MP Cluster with FIFO scheduling
Task Execution
● Speculative– Job execution is time sensitive to slow-
running tasks● JVM Reuse
– Use the same JVM through configuration
Querying Data
● Pig● Hive
Pig
● Developed by Yahoo!● Pig is a platform for analysing large data sets
that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Pig
● Two Components– Language (PigLatin)– Execution environment
● Two Execution environments– (Single JVM)
● pig -x local– Distributed System
● pig -x mapreduce / pig
Pig
● Running Pig– Script
● pig script-name.pig– Grunt
● pig (launch Command Line Tool)– Embedded
● Calling pig from Java
HIVE
● Hive is a data warehouse system for Hadoop that facilitates easy data summarisation, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
● Provide a SQL-lke language HiveQL
HIVE
● Running Hive– Interactive
● Hive– Script
● Hive -f my-script– Inline
● Hive -e 'SELECT * FROM MyTable'