22
www.edureka.co/big-data-and-hadoop Hadoop : The pile of Data View Big Data and Hadoop Course at: http:// www.edureka.co/big-data-and-hadoop For more details please contact us: US : 1800 275 9730 (toll free) INDIA : +91 88808 62004 Email Us : [email protected] For Queries: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN

Hadoop : The Pile of Big Data

  • Upload
    edureka

  • View
    257

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Hadoop : The Pile of Big Data

www.edureka.co/big-data-and-hadoop

Hadoop : The pile of Data

View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]

For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN

Page 2: Hadoop : The Pile of Big Data

www.edureka.co/big-data-and-hadoopSlide 2

ObjectivesAt the end of this module, you will be able to…

What is Haoop Framework

What is Big Data

Hadoop core components

When to use Hadoop

Processing

• Unstructured data

• Semi structured data

• Structured data

Page 3: Hadoop : The Pile of Big Data

Slide 3Slide 3 www.edureka.co/big-data-and-hadoopSlide 3

What is Hadoop?

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage and distributed processing.

Page 4: Hadoop : The Pile of Big Data

Slide 4Slide 4 www.edureka.co/big-data-and-hadoopSlide 4

Hadoop Key Characteristics

Reliable

EconomicalFlexible

Scalable

Hadoop Features

Page 5: Hadoop : The Pile of Big Data

Slide 5Slide 5 www.edureka.co/big-data-and-hadoopSlide 5

Hadoop Design Principles

Facilitate the storage and processing of large and/or rapidly growing data sets

» Structured and unstructured data» Simple programming models

Scale-Out rather than Scale-Up

Bring Code to Data rather than data to code

High scalability and availability

Use commodity hardware

Fault-tolerance

Page 6: Hadoop : The Pile of Big Data

Slide 6Slide 6 www.edureka.co/big-data-and-hadoopSlide 6

Hadoop – It’s about Scale and Structure

Structured Data Types Multi and Unstructured

Limited, No Data Processing Processing Processing coupled with Data

Standards & Structured Governance Loosely Structured

Required On Write Schema Required On Read

Reads are Fast Speed Writes are Fast

Software License Cost Support Only

Known Entity Resources Growing, Complexities, Wide

OLTPComplex ACID TransactionsOperational Data Store

Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing

RDBMS HADOOP

Page 7: Hadoop : The Pile of Big Data

Slide 7 www.edureka.co/big-data-and-hadoop

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization

What is Big Data?

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytes

processing

mobile

Big Data

Page 8: Hadoop : The Pile of Big Data

Slide 8 www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

VOLUME

Web logs

Images

Videos

Audios

Sensor Data

VARIETYVELOCITY VERACITY

Min Max Mean SD

4.3 7.9 5.84 0.83

2.0 4.4 3.05 0.43

0.1 2.5 1.20 0.76

Page 9: Hadoop : The Pile of Big Data

Slide 9Slide 9 www.edureka.co/big-data-and-hadoopSlide 9

Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

MahoutMachine Learning

HBase

Hadoop 1.0 Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

FlumeSqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning

Page 10: Hadoop : The Pile of Big Data

Slide 10Slide 10 www.edureka.co/big-data-and-hadoopSlide 10

Hadoop 2.x Core Components

Hadoop 2.x Core Components

HDFS YARN

Storage Processing

DataNode

NameNode Resource Manager

Node Manager

Master

Slave

SecondaryNameNode

Page 11: Hadoop : The Pile of Big Data

Slide 11Slide 11 www.edureka.co/big-data-and-hadoopSlide 11

Hadoop 2.x Core Components ( Contd.)

DataNode

Node Manager

DataNode DataNode DataNode

YARN

HDFSCluster

Resource Manager

NameNode

Node Manager

Node Manager

Node Manager

Page 12: Hadoop : The Pile of Big Data

Slide 12Slide 12 www.edureka.co/big-data-and-hadoopSlide 12

Main Components of HDFS

NameNode:

» Master of the system» Maintains and manages the blocks which are present on

the DataNodes

DataNodes:

» Slaves which are deployed on each machine and provide the actual storage

» Responsible for serving read and write requests for the clients

Page 13: Hadoop : The Pile of Big Data

Slide 13Slide 13 www.edureka.co/big-data-and-hadoopSlide 13

When to use Hadoop

Page 14: Hadoop : The Pile of Big Data

Slide 14Slide 14 www.edureka.co/big-data-and-hadoopSlide 14

Your have different types of data : structured, semi-structured and unstructured

The data set is huge in size i.e. several Terabytes or Petabytes

You are not in a hurry for Answers

Data Size and Data Diversity

Page 15: Hadoop : The Pile of Big Data

Slide 15Slide 15 www.edureka.co/big-data-and-hadoopSlide 15

To implement Hadoop on you data you should first understand the level of complexity of data and the rate it is going to grow

So we need a cluster planning, its may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data

Future Planning

Page 16: Hadoop : The Pile of Big Data

Slide 16Slide 16 www.edureka.co/big-data-and-hadoopSlide 16

Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R , Python, Spark, MongoDB etc.

Multiple Frameworks for Big Data

Page 17: Hadoop : The Pile of Big Data

Slide 17Slide 17 www.edureka.co/big-data-and-hadoopSlide 17

When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability

Lifetime Data Availability

Page 18: Hadoop : The Pile of Big Data

Slide 18Slide 18 www.edureka.co/big-data-and-hadoopSlide 18

Processing Unstructured data : Image processing

Demo

Page 19: Hadoop : The Pile of Big Data

Slide 19Slide 19 www.edureka.co/big-data-and-hadoopSlide 19

Processing Semi-structured data : XML processing

Demo

Page 20: Hadoop : The Pile of Big Data

Slide 20Slide 20 www.edureka.co/big-data-and-hadoopSlide 20

Processing structured data : csv processing

Demo

Page 21: Hadoop : The Pile of Big Data

Slide 21

Questions

Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Page 22: Hadoop : The Pile of Big Data