By: Shrikant Gawande (Cloudera Certified )
What is Big Data ?
For every 30 mins, a airline jet
collects 10 terabytes of sensor
data (flying time)
NYSE generates about one
terabyte of new trade data per
day to Perform stock trading
analytics to determine trends for
optimal trades.
Facebook users spend 10.5 billion
minutes (almost 20,000 years) online
on the social network.
Facebook has an average of 3.2
billion likes and comments are posted
every day.
Facebook Example
Twitter Example
Twitter has over 500 million registered users.
The USA, whose 141.8 million accounts represents
27.4 percent of all Twitter users, good enough to
finish well ahead of Brazil, Japan, the UK and
Indonesia.
79% of US Twitter users are more likely to
recommend brands they follow .
67% of US Twitter users are more likely to buy from
brands they follow .
57% of all companies that use social media for
business use Twitter.
Hadoop is being used across industries
Industries using Hadoop
Source : Karmasphere
Why to learn Big Data ?
What Big Companies Have To Say ..
Data Volume Is Growing Exponentially
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
The world's information doubles every
two years
Over the next 10 years:
The number of servers worldwide
will grow by 10x
Amount of information managed by
enterprise data centers will grow by
50x
Number of “files” enterprise data
center handle will grow by 75x
Source: http://www.emc.com/leadership/programs/digital-
universe.htm,which was based on the 2011 IDC Digital Universe
Study
IBM’s Definition
IBM’s definition –Big Data Characteristics
http://www-1.ibm.com/software/data/bigdata/
A collection of large and complex data sets which are difficult to process using common database
management tools or traditional data processing applications.
Big Data is the amount of data that is beyond the storage and the processing capabilities of a single
physical machine.
Data that has extra large volume, comes from variety of sources, variety of formats and comes at us
with a great velocity it normally referred as Big Data
It’s more of unstructured Data than Structured Data
A Traditional Approach Under Pressure
Why Big Data ?
ERP
CRM Data ( few TBs)
Enterprise data
What Data We have been adding in last 3-4 Years
Customer Experience
Click Streams
Online Campaign
Banner Ads – capturing every click 100 n TBs
User Entered data
Search – In product search
Social media – to understand general sentiments
Industry Use Cases Types of Data
Financial Services
New Account Risk Screens Text, Server Logs
Trading Risk Server Logs
Insurance Underwriting Geographic, Sensor, Text
Telecom
Call Details records (CDR) Machine, Geographic
Infrastructure Investment Machine, Server Logs
Real-Time Bandwidth Allocation Server Logs, Text, Social
Retail
360 Degree View of Customer ClickStream, Text
Localized, Personal Promotion Geographic
Website Optimization ClickStream
Manufacturing
Supply Chain and Logistics Sensor
Assembly Line Quality Assurance Sensor
Crowd sourced Quality Assurance Social
HealthCareUse Genomic in Medical Trials Structured
Monitor Patient Vitals in Real-Time Sensor
Pharmaceuticals
Recruit and Retain Patients for Drug Trails Social, Clickstream
Improve Prescription Adherence Social, Unstructured, Geographic
Oil and GasUnify Exploration and Production Data Sensor, Unstructured, Geographic
Monitor Rig Safety in Real Time Sensor, Unstructured
Common Business Applications
How can we find products
that customers are interested
in BUT DON’T BUY ?
Leveraging ALL Business Data
How to Extract Insights from 9TBs of Web Logs ?
How do you make
sense of this ?
Leveraging ALL Business Data
How to Extract Insights from 9TBs of Web Logs ?
What users did when they come to our web site ?
Which product they viewed ?
Which product seen but not purchased ? Why ? New Offering based on past data?
In the First line User has seen some product by some particular ID ?
Leveraging ALL Business Data
How to Extract Insights from 9TBs of Web Logs ? (Contd …
Visitor views 2nd
product- We want to do this not just for 1 customer but all the customers
Hidden Treasure
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data
New offerings to the customer
Limitations of Existing Data Analytics Architecture
Solution: A Combined Storage Computer Layer
Differentiating factors
Some of the Hadoop Users
Why DFS ?
What is Hadoop ?
Apache Hadoop is a framework that allows for the distributed processing
of large data sets across clusters of commodity computers using a simple
programming model.
It is an Open-source Data Management with scale-out storage &
distributed processing
Hadoop Key Characteristics
Hadoop History
Hadoop Eco-System
Hadoop Core Components
HDFS –Hadoop Distributed File System(Storage)
Distributed across “nodes”
Natively redundant
Name Node tracks locations.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
Hadoop Core Components (contd.)
HDFS Architecture
Main Components of HDFS
NameNode
master of the system
maintains and manages the blocks which are
present on the DataNodes
DataNodes
slaves which are deployed on each machine and
provide the actual storage
responsible for serving read and write requests
for the clients
NameNode and Datanode
NameNode Meta Data
Meta-data in Memory
• The entire metadata is in main memory
• No demand paging of FS meta-data
Types of Metadata
• List of files
• List of Blocks for each file
• List of DataNode for each block
• File attributes, e.g. access time, replication factor
A Transaction Log
• Records file creations, file deletions. etc
Storage : Name-Node and Data-Node.SProcessing : Job-Tracker and Task-Tracker.S
H1 H2 H3 H4
Poll - 01
Poll - 02
Poll - 03
Poll - 04
Poll - 05
Hadoop Courses and its fees across
major training institutes…
Hadoop Course fee at Cloudera
Cloudera Hadoop Training :
Hadoop Course fee at HortonWorks and Edureka
$ 2,795 = Rs. 1,73,290
My Contact Details:
Thank You …