Hadoop : The Pile of Big Data

www.edureka.co/big-data-and-hadoop

Hadoop : The pile of Data

View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]

For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN

http://www.edureka.co/big-data-and-hadoop

mailto:[email protected]

www.edureka.co/big-data-and-hadoopSlide 2

ObjectivesAt the end of this module, you will be able to…

What is Haoop Framework

What is Big Data

Hadoop core components

When to use Hadoop

Processing

• Unstructured data

• Semi structured data

• Structured data

Slide 3Slide 3 www.edureka.co/big-data-and-hadoopSlide 3

What is Hadoop?

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage and distributed processing.


Hadoop Key Characteristics

Reliable

EconomicalFlexible

Scalable

Hadoop Features


Hadoop Design Principles

Facilitate the storage and processing of large and/or rapidly growing data sets

» Structured and unstructured data» Simple programming models

Scale-Out rather than Scale-Up

Bring Code to Data rather than data to code

High scalability and availability

Use commodity hardware

Fault-tolerance


Hadoop – It’s about Scale and Structure

Structured Data Types Multi and Unstructured

Limited, No Data Processing Processing Processing coupled with Data

Standards & Structured Governance Loosely Structured

Required On Write Schema Required On Read

Reads are Fast Speed Writes are Fast

Software License Cost Support Only

Known Entity Resources Growing, Complexities, Wide

OLTPComplex ACID TransactionsOperational Data Store

Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing

RDBMS HADOOP

Slide 7 www.edureka.co/big-data-and-hadoop

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization

What is Big Data?

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytes

processing

mobile

Big Data

Slide 8 www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

VOLUME

Web logs

Images

Videos

Audios

Sensor Data

VARIETYVELOCITY VERACITY

Min Max Mean SD

4.3 7.9 5.84 0.83

2.0 4.4 3.05 0.43

0.1 2.5 1.20 0.76

http://www-01.ibm.com/software/data/bigdata/


Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

MahoutMachine Learning

HBase

Hadoop 1.0 Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

FlumeSqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning


Hadoop 2.x Core Components

Hadoop 2.x Core Components

HDFS YARN

Storage Processing

DataNode

NameNode Resource Manager

Node Manager

Master

Slave

SecondaryNameNode


Hadoop 2.x Core Components ( Contd.)

DataNode

Node Manager

DataNode DataNode DataNode

YARN

HDFSCluster

Resource Manager

NameNode

Node Manager

Node Manager

Node Manager


Main Components of HDFS

NameNode:

» Master of the system» Maintains and manages the blocks which are present on

the DataNodes

DataNodes:

» Slaves which are deployed on each machine and provide the actual storage

» Responsible for serving read and write requests for the clients


When to use Hadoop


Your have different types of data : structured, semi-structured and unstructured

The data set is huge in size i.e. several Terabytes or Petabytes

You are not in a hurry for Answers

Data Size and Data Diversity


To implement Hadoop on you data you should first understand the level of complexity of data and the rate it is going to grow

So we need a cluster planning, its may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data

Future Planning


Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R , Python, Spark, MongoDB etc.

Multiple Frameworks for Big Data


When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability

Lifetime Data Availability


Processing Unstructured data : Image processing

Demo


Processing Semi-structured data : XML processing

Demo


Processing structured data : csv processing

Demo

Slide 21

Questions

Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Technology

Hadoop : The Pile of Big Data