Upload
edureka
View
257
Download
1
Tags:
Embed Size (px)
Citation preview
www.edureka.co/big-data-and-hadoop
Hadoop : The pile of Data
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop
For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]
For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN
www.edureka.co/big-data-and-hadoopSlide 2
ObjectivesAt the end of this module, you will be able to…
What is Haoop Framework
What is Big Data
Hadoop core components
When to use Hadoop
Processing
• Unstructured data
• Semi structured data
• Structured data
Slide 3Slide 3 www.edureka.co/big-data-and-hadoopSlide 3
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage and distributed processing.
Slide 4Slide 4 www.edureka.co/big-data-and-hadoopSlide 4
Hadoop Key Characteristics
Reliable
EconomicalFlexible
Scalable
Hadoop Features
Slide 5Slide 5 www.edureka.co/big-data-and-hadoopSlide 5
Hadoop Design Principles
Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data» Simple programming models
Scale-Out rather than Scale-Up
Bring Code to Data rather than data to code
High scalability and availability
Use commodity hardware
Fault-tolerance
Slide 6Slide 6 www.edureka.co/big-data-and-hadoopSlide 6
Hadoop – It’s about Scale and Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTPComplex ACID TransactionsOperational Data Store
Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing
RDBMS HADOOP
Slide 7 www.edureka.co/big-data-and-hadoop
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
Slide 8 www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
VOLUME
Web logs
Images
Videos
Audios
Sensor Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
Slide 9Slide 9 www.edureka.co/big-data-and-hadoopSlide 9
Hadoop Ecosystem
Pig LatinData Analysis
HiveDW System
OtherYARN
Frameworks(MPI, GRAPH)
HBaseMapReduce Framework
YARNCluster Resource Management
Apache Oozie(Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework
Apache Oozie(Workflow)
HDFS(Hadoop Distributed File System)
MahoutMachine Learning
HBase
Hadoop 1.0 Hadoop 2.0
Sqoop
Unstructured or Semi-structured Data Structured Data
FlumeSqoop
Unstructured or Semi-structured Data Structured Data
Flume
MahoutMachine Learning
Slide 10Slide 10 www.edureka.co/big-data-and-hadoopSlide 10
Hadoop 2.x Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
SecondaryNameNode
Slide 11Slide 11 www.edureka.co/big-data-and-hadoopSlide 11
Hadoop 2.x Core Components ( Contd.)
DataNode
Node Manager
DataNode DataNode DataNode
YARN
HDFSCluster
Resource Manager
NameNode
Node Manager
Node Manager
Node Manager
Slide 12Slide 12 www.edureka.co/big-data-and-hadoopSlide 12
Main Components of HDFS
NameNode:
» Master of the system» Maintains and manages the blocks which are present on
the DataNodes
DataNodes:
» Slaves which are deployed on each machine and provide the actual storage
» Responsible for serving read and write requests for the clients
Slide 13Slide 13 www.edureka.co/big-data-and-hadoopSlide 13
When to use Hadoop
Slide 14Slide 14 www.edureka.co/big-data-and-hadoopSlide 14
Your have different types of data : structured, semi-structured and unstructured
The data set is huge in size i.e. several Terabytes or Petabytes
You are not in a hurry for Answers
Data Size and Data Diversity
Slide 15Slide 15 www.edureka.co/big-data-and-hadoopSlide 15
To implement Hadoop on you data you should first understand the level of complexity of data and the rate it is going to grow
So we need a cluster planning, its may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data
Future Planning
Slide 16Slide 16 www.edureka.co/big-data-and-hadoopSlide 16
Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R , Python, Spark, MongoDB etc.
Multiple Frameworks for Big Data
Slide 17Slide 17 www.edureka.co/big-data-and-hadoopSlide 17
When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability
Lifetime Data Availability
Slide 18Slide 18 www.edureka.co/big-data-and-hadoopSlide 18
Processing Unstructured data : Image processing
Demo
Slide 19Slide 19 www.edureka.co/big-data-and-hadoopSlide 19
Processing Semi-structured data : XML processing
Demo
Slide 20Slide 20 www.edureka.co/big-data-and-hadoopSlide 20
Processing structured data : csv processing
Demo
Slide 21
Questions
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions