By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA,...

Preview:

Citation preview

By: Shrikant Gawande (Cloudera Certified )

What is Big Data ?

For every 30 mins, a airline jet

collects 10 terabytes of sensor

data (flying time)

NYSE generates about one

terabyte of new trade data per

day to Perform stock trading

analytics to determine trends for

optimal trades.

Facebook users spend 10.5 billion

minutes (almost 20,000 years) online

on the social network.

Facebook has an average of 3.2

billion likes and comments are posted

every day.

Facebook Example

Twitter Example

Twitter has over 500 million registered users.

The USA, whose 141.8 million accounts represents

27.4 percent of all Twitter users, good enough to

finish well ahead of Brazil, Japan, the UK and

Indonesia.

79% of US Twitter users are more likely to

recommend brands they follow .

67% of US Twitter users are more likely to buy from

brands they follow .

57% of all companies that use social media for

business use Twitter.

Hadoop is being used across industries

Industries using Hadoop

Source : Karmasphere

Why to learn Big Data ?

What Big Companies Have To Say ..

Data Volume Is Growing Exponentially

Estimated Global Data Volume:

2011: 1.8 ZB

2015: 7.9 ZB

The world's information doubles every

two years

Over the next 10 years:

The number of servers worldwide

will grow by 10x

Amount of information managed by

enterprise data centers will grow by

50x

Number of “files” enterprise data

center handle will grow by 75x

Source: http://www.emc.com/leadership/programs/digital-

universe.htm,which was based on the 2011 IDC Digital Universe

Study

IBM’s Definition

IBM’s definition –Big Data Characteristics

http://www-1.ibm.com/software/data/bigdata/

A collection of large and complex data sets which are difficult to process using common database

management tools or traditional data processing applications.

Big Data is the amount of data that is beyond the storage and the processing capabilities of a single

physical machine.

Data that has extra large volume, comes from variety of sources, variety of formats and comes at us

with a great velocity it normally referred as Big Data

It’s more of unstructured Data than Structured Data

A Traditional Approach Under Pressure

Why Big Data ?

ERP

CRM Data ( few TBs)

Enterprise data

What Data We have been adding in last 3-4 Years

Customer Experience

Click Streams

Online Campaign

Banner Ads – capturing every click 100 n TBs

User Entered data

Search – In product search

Social media – to understand general sentiments

Industry Use Cases Types of Data

Financial Services

New Account Risk Screens Text, Server Logs

Trading Risk Server Logs

Insurance Underwriting Geographic, Sensor, Text

Telecom

Call Details records (CDR) Machine, Geographic

Infrastructure Investment Machine, Server Logs

Real-Time Bandwidth Allocation Server Logs, Text, Social

Retail

360 Degree View of Customer ClickStream, Text

Localized, Personal Promotion Geographic

Website Optimization ClickStream

Manufacturing

Supply Chain and Logistics Sensor

Assembly Line Quality Assurance Sensor

Crowd sourced Quality Assurance Social

HealthCareUse Genomic in Medical Trials Structured

Monitor Patient Vitals in Real-Time Sensor

Pharmaceuticals

Recruit and Retain Patients for Drug Trails Social, Clickstream

Improve Prescription Adherence Social, Unstructured, Geographic

Oil and GasUnify Exploration and Production Data Sensor, Unstructured, Geographic

Monitor Rig Safety in Real Time Sensor, Unstructured

Common Business Applications

How can we find products

that customers are interested

in BUT DON’T BUY ?

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ?

How do you make

sense of this ?

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ?

What users did when they come to our web site ?

Which product they viewed ?

Which product seen but not purchased ? Why ? New Offering based on past data?

In the First line User has seen some product by some particular ID ?

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ? (Contd …

Visitor views 2nd

product- We want to do this not just for 1 customer but all the customers

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data

New offerings to the customer

Limitations of Existing Data Analytics Architecture

Solution: A Combined Storage Computer Layer

Differentiating factors

Some of the Hadoop Users

Why DFS ?

What is Hadoop ?

Apache Hadoop is a framework that allows for the distributed processing

of large data sets across clusters of commodity computers using a simple

programming model.

It is an Open-source Data Management with scale-out storage &

distributed processing

Hadoop Key Characteristics

Hadoop History

Hadoop Eco-System

Hadoop Core Components

HDFS –Hadoop Distributed File System(Storage)

Distributed across “nodes”

Natively redundant

Name Node tracks locations.

MapReduce (Processing)

Splits a task across processors

“near” the data & assembles results

Self-Healing, High Bandwidth

Clustered storage

Hadoop Core Components (contd.)

HDFS Architecture

Main Components of HDFS

NameNode

master of the system

maintains and manages the blocks which are

present on the DataNodes

DataNodes

slaves which are deployed on each machine and

provide the actual storage

responsible for serving read and write requests

for the clients

NameNode and Datanode

NameNode Meta Data

Meta-data in Memory

• The entire metadata is in main memory

• No demand paging of FS meta-data

Types of Metadata

• List of files

• List of Blocks for each file

• List of DataNode for each block

• File attributes, e.g. access time, replication factor

A Transaction Log

• Records file creations, file deletions. etc

Storage : Name-Node and Data-Node.SProcessing : Job-Tracker and Task-Tracker.S

H1 H2 H3 H4

Poll - 01

Poll - 02

Poll - 03

Poll - 04

Poll - 05

Hadoop Courses and its fees across

major training institutes…

Hadoop Course fee at Cloudera

Cloudera Hadoop Training :

Hadoop Course fee at HortonWorks and Edureka

$ 2,795 = Rs. 1,73,290

My Contact Details:

Thank You …

Recommended