Introduction to Cloud computing and Big Data-Hadoop

Introduction: Cloud Computing and Big Data - Hadoop

Presented By: Nagarjuna D.NSAP CTLAT&T, Bengaluru

Date: 14-07-2015

Overview• Cloud Computing Evolution

• Why Cloud Computing needed?

• Cloud Computing Models

• Cloud Solutions

• Cloud Jobs opportunities

• Criteria for Big Data

• Big Data challenges

• Technologies to process Big Data- Hadoop

• Hadoop History and Architecture

• Hadoop Eco-System

• Hadoop Real-time Use cases

• Hadoop Job opportunities

• Hadoop and SAP HANA integration

• Summary2

Internet of Things (IoT)

Big Data “One of the Reason is Cloud Computing….!”

3

Cloud Computing (Evolution of an internet and its hidden from the end user)

• Infrastructure is maintained somewhere with shared computing resources -servers and storage, network, all delivered over the Internet.

• The Cloud delivers a hosting environment that is- -immediate, -flexible, -scalable,-secure,-available,-saves corporations money, time and resources.

Flexible

Scalable

Secure

Cloud Computing (Cont….)

• In addition, the platform provides on demand services, i.e always on, anywhere, anytime and any place.

• “Pay-for-what-you-use”- metered basis.

• Its based on utility computing and Virtualization.

5

Cloud Computing History

Traditional Infrastructure Model

Chart Title Forecasted Infrastructure Demand

Time

Capital

7

Acceptable Surplus

Chart Title Forecasted Infrastructure Demand

Surplus

Time

Capital

8

Actual Infrastructure Model

Chart Title

Actual Infrastructure Demand

Time

Capital

9

Unacceptable Surplus

Surplus

Time

Capital

10

Unacceptable Deficit

Deficit

Time

Capital

11

Utility Infrastructure Model(Concept of Cloud Computing)

Chart Title

Actual Infrastructure Demand

Time

Capital

12

Cloud Flavors (Service Models)

• IaaS – Infrastructure as a Service

• PaaS – Platform as a Service

• SaaS – Software as a Service

13

SaaS Examples

14

IaaS Examples

15

PaaS Examples

16

Cloud Deployment Models• Public Cloud

• Private Cloud

• Hybrid Cloud

• Community Cloud

17

Cloud Distribution Examined

18

Enterprise Cloud Solutions

1. Test / Development / QA Platformo Use cloud infrastructure servers as test and development

platform2. Disaster Recoveryo Keep images of servers on cloud infrastructure ready to

go in case of a disaster 3. Cloud File Storageo Backup or Archive company data to cloud file storage

4. Load Balancingo Use cloud infrastructure for overflow management during

peak usage times

19

Enterprise Cloud Solutions (cont)

5. Overhead Controlo Lower overhead costs and make bids more competitive

6. Distributed Network Control and Cost Reportingo Create an individual private networks (VPC) for each of

subsidiaries or contracts7. Rapid Deploymento Turn up servers immediately to fulfill project timelines

8. Functional IT Labor Shifto Refocus IT labor expense on revenue producing activities

20

Preparing for the Future Cloud IT JobsSampling of IT skills likely to be in demand in the futureo Functional application development and support

I.e. Oracle, SAP, SQL, linking hardware to software o Leveraging data to make strategic business decisions

I.e. Business Intelligence : Applying sales forecasts to inventory and manufacturing decisions

o Mobile apps Android, iPhone, Windows Mobile

o Wi-Fi engineers USF to include broadband communications (LTE replaces GSM/CDMA)

o Optical engineers Optical offers the highest bandwidth today (PON, CWDM, DWDM)

o Virtualization Specialists Economies of scale require virtualization (server, storage, client…)

o IP Engineerso Network Security Specialistso Web developerso Social Media developerso Business Intelligence application development and support

21

IT Cloud infrastructure

23

“Big Data- Big Thing”

• Big Data is exactly like Rubik’s cube.

• Just like a Rubik’s cube Big Data has many different solutions.

• If you take five Rubik’s cube and mix up the same way and give it to five different expert’s.

• They will solve the Rubik’s cube in fractions of the seconds.

• But if you pay attention to the same closely, you will notice that even though the final outcome is the same, the route taken to solve the Rubik’s cube is not the same.

• Every expert will start at a different place(colors) and will try to resolve it with different methods.

• It is nearly impossible to have a exact same route taken by two experts.

Begining Big Data

24

25

Big Data Definition in general

• Big Data is a collection of data sets that are large and complex in nature.

• They constitute both structured and unstructured data that grow large so fast that they are not manageable by traditional relational database systems(Eg., RDBMS).

26

Big Data Technically

i. Volumepetta bytes or Zetta bytes.

ii. VelocityBatch or real(stream) time processing.

iii. VarietyStructured, semi-structured & Unstructured.It is estimated that 80% of world’s data are unstructured and rest of them semi-structured and structured.

iv. Veracity The quality of the data being captured

can vary greatly.

Fig.Big Data Based on Doug Cutting 3Vs model

27

Variety of Data1. Structured Data:- Data i.e. identifiable because its organized in a structure(Standard defined format)E.g.: Database, Data Warehouses & Electronic spreadsheets.

2. Semi-Structured Data:- Data i.e. neither raw data, nor typed data in a conventional database systemE.g.: Wiki pages, Tweets, Facebook data & Instant Messages.

3. Unstructured Data:- its doesn’t have standard defined structureE.g.: Data files, Audio files, Video, Graphics & Multimedia.

28

Traditional Data v/s Big Data

Attributes Traditional Data Big Data

Volume Gigabytes to terabytes Petabytes to zettabytes

Organizaton Centralized Distributed

Structure Structured Semi-structured & unstructured

Data model Strict schema based Flat schema

Data relationship Complex interrelationships Almost flat with few relationships

29

Criteria of Big Data

1. 272 hours of video are uploaded to YouTube every minute and over 3 billion hours of video are watched every month.

2. Radio Frequency ID (RFID) systems generated up to 1,000 times more data compared to the conventional bar code systems.

3. 340 million tweets are sent every day and that amounts of 7TB of data.

4. Social networking site, Facebook, processes over 10TB of data every day.

5. Over 5 billion people use cell phones to call, send SMS, email, browse Internet, and interact via social networking sites.

6. The Square Kilometre Array project of NASA receives 700 TB of data per second.

30

Challenges with Big Data

1. Scaling is costly.2. Strategy must be in place before you hit the limit of a single

computer. 3. Most entreprises responded to scalability needs when they started

facing problems of poor response and low throughput.4. Adding hardware to existing system is manpower extensive and

hence error prone.5. Mixed data type - structured and unstructured - makes scaling

even harder.

31

Exploring Big Data for business insights

32

33

Big Data solutions with Hadoop

34

Organizations Adopted Big Data

35

How are Organizations using Big Data Technology?

36

37

Feb 14th 2011 –Watson is IBM’s super computer built using Big Data Technology.Its not online & its process like a human brain.

38

39

Tools typically used in Big Data Scenarios

40

Technology to process Big Data- Hadoop (Open-source software framework written in Java)

• Open-source software: It's free to download, though more and more commercial versions of Hadoop are becoming available.

• Framework: It means that everything you need to develop and run software applications is provided –programs, connections, etc.

• Distributed storage: The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware.

• Processing power: Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.

• Hadoop an DFS and not Database. Its designed for information from many forms.

• Open source project started by Doug Cutting- employee of Yahoo. Hadoop is the name of his sons toy elephant.

• Apache software foundation- Apache Hadoop.

41

Hadoop Creation History

42

Hadoop ArchitectureHadoop core has two major components (daemons):1. HDFS

a. NameNodeb. Secondary NameNodec. DataNode

2. MapReduce Engine (distributed data processing framework)a. JobTrackerb. TaskTracker

46

What components make up Hadoop?

• Hadoop Common – the libraries and utilities used by other Hadoop modules.

• Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization.

• MapReduce – a software programming model for processing large sets of data in parallel.

• YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator.)

45

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Master

Task Tracker

Data Node

Job Tracker

Name Node

MapReduce

HDFS

Hadoop Architecture

47

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Master

Task Tracker

Data Node

Job Tracker

Name Node

48

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Master

Task Tracker

Data Node

Job Tracker

Name Node

49

Node

RACK

RACK

RACK

RACK

Cluster

Data Center

50

51

MapReduce Example

52

Benefits of Hadoop• Scalable– New nodes can be added without needing to change

data formats.• Cost effective– Hadoop brings massively parallel computing to

commodity hardwares.• Flexible– Hadoop is schema-less, and can absorb any type of data,

structured or not, from any number of sources.• Fault tolerant– When you lose a node, the system redirects work to

another location of the data and continues processing without missing a heartbeat.

• Programming languages- Java(default)/python.• Last but not least – it’s free! ( Open source).

43

Hadoop is not Suitable for All Kinds of Applications

Hadoop is not suitable to:

• perform real-time, stream-based processing where data is processed immediately upon its arrival.

• perform online access where low latency is required.

44

Hadoop Eco-System

53

Real-Time Hadoop Use Cases

1. Risk Modeling (How can banks understand customers & markets ?)

2. Customer churn analysis (why do companies really lose customers?)

3. Ad Targeting (How can companies increase campaign efficiency?)

4. Point of sale transaction analysis (How do retailers target promotion guaranteed to make you buy?)

5. Search quality (What’s in your search?) Hyperlink54

55

56

Hadoop Job Opportunities

57

58

Apache Hadoop & SAP HANA Integration(Future Generation Technologies)

59

In Real-Time Business

60

Resources

61

Summaryo Cloud Computingo Big Datao Apache Hadoop o Hadoop and SAP HANA integration

62

Than

k You

More Details

Nagarjuna D [email protected][email protected]

More Cloud Solutions Architect Skills:• Amazon Cloud (Amazon Web Services)

• MongoDB (NoSQL Database)

• Play Framework (Web Application Framework)

• Domain/ SSL Certificate setup

• Apache Hadoop, Apache Pig, Apache hive

Your Valuable Feedback Please

• Compulsory to where I must improve………..!