48
What Is Hadoop And Why Deploy It In the Cloud?

Breaking points of traditional approach What if you could handle big data?

Embed Size (px)

DESCRIPTION

Breaking points of traditional approach

Citation preview

Page 1: Breaking points of traditional approach What if you could handle big data?

What Is Hadoop And Why Deploy It In the Cloud?

Page 2: Breaking points of traditional approach What if you could handle big data?

AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?

Page 3: Breaking points of traditional approach What if you could handle big data?

Breaking points of traditional approach

Staging

Increasing data volumes1

50x Data growth 2010-2020

40ZB Digital Universe 2020

1Trillion Web pages

Page 4: Breaking points of traditional approach What if you could handle big data?

Breaking points of traditional approach

Staging

Increasing data volumes1

204MEmails sent every minute

340MTweets sent every day

231BUS Ecommerce in 2012 Real-time data

2

Page 5: Breaking points of traditional approach What if you could handle big data?

Breaking points of traditional approach

Staging

Increasing data volumes1

Real-time data2

New data types3 15x

Machine generated data 2020

1.3M Hours on Skype per hour

2.4MFacebook content per minute

Page 6: Breaking points of traditional approach What if you could handle big data?

Breaking points of traditional approach

Staging

Increasing data volumes1

Real-time data2

New data types3

Cloud-born data4 $100

B spend on cloud

50% large orgs have hybrid by 2017

40% CRM sold are SaaS

Page 7: Breaking points of traditional approach What if you could handle big data?

What if you could handle big data?

Data complexity: variety and velocity

Terabytes

Gigabytes

Megabytes

Petabytes Big

DataLog filesSpatial & GPS coordinatesData market feedseGov feedsWeather Text/image

Click streamWikis/blogs

Sensors/RFID/devices

Social sentimentAudio/video

Web 2.0

Web LogsDigital MarketingSearch MarketingRecommendations

AdvertisingMobile

CollaborationeCommerce

ERP/CRMPayables

PayrollInventory

ContactsDeal TrackingSales Pipeline

Page 8: Breaking points of traditional approach What if you could handle big data?

Introducing Apache HadoopApache Open Source ProjectHighly scalable distributed file system (HDFS)Distributed processing on data nodes

Page 9: Breaking points of traditional approach What if you could handle big data?

Data volumeHadoop stores files in a distributed file systemStorage and computation is distributed across many serversFiles can be spread out over multiple nodesHadoop can store very large amounts of dataCombined storage resource can grow with demand from a few nodes to thousands of nodesScales out linearlyVery large files supported including those larger than the capacity of a single node

Files

Page 10: Breaking points of traditional approach What if you could handle big data?

Data varietyHadoop stores files (non-relational store)Files could have a variety of semi-structured or unstructured dataPreviously, these files may not have been seen as providing value or insightsToday, new business questions and insights are being uncovered through data science

SentimentUnderstand how your customersfeel about your brand and products—right now

ClickstreamCapture and analyzewebsite visitors’ data trails and optimize your website

SensorsDiscover patterns in data streaming automatically from remote sensors and machines

GeographicAnalyze location-based data to manage operations where they occur

Server logsResearch logs to diagnose process failures and prevent security breaches

UnstructuredUnderstand patterns in files across millions of web pages, emails, and documents

Page 11: Breaking points of traditional approach What if you could handle big data?

Applications

Devices

HTTP

Inco

min

g

Outg

oing

Data velocityHadoop can stream live data and process them in real-timeHadoop can act as scalable event stream ingestionHadoop can do near real-time in-stream processingData input Event

brokerStream processing Outgoing

Page 12: Breaking points of traditional approach What if you could handle big data?

Governance and integrationData workflow, lifecycle and governanceFalconSqoopFlumeNFSWebHDFS

YARN: data operating system

ScriptPig

SearchSolr

SQLHive/Tez, HCatalog

NosqlHbaseAccumulo

Stream Storm

OthersSpark, in-memory, ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° °

°

°

N

BatchMap reduce

Data access

HDFS (Hadoop Distributed File System)Data management

AuthenticationAuthorizationAccountingData protectionStorage: HDFSResources: YARNAccess: Hive, … Pipeline: FalconCluster: Knox

Security Operations

Provision, manage, and monitorAmbariZookeeper

SchedulingOozie

Hadoop is a platform with portfolio of projectsGoverned by Apache Software Foundation (ASF)Comprises core services of MapReduce, HDFS, and YARNIn addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) Operational services which help manage the cluster (Ambari, Falcon, and Oozie)

Page 13: Breaking points of traditional approach What if you could handle big data?

A Hadoop distribution is a package of projectsTested for consistency across entire package

Knox

Tez

Pig

Hive

and

HCa

talo

g

Phoe

nix

Accu

mul

o

Stor

m

Mah

out

Solr

Falco

n

Sqoo

p

Flum

e

Amba

ri

Oozie

Zook

eepe

r

HBas

e

Hado

op

and

YARN

Data management

Data access Governance and integration

Operations Security

HDP 2.0 October 2013 2.2.0 0.12.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 1.4.4 3.3.2 3.4.5 .0.4.0

HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0

HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 1.5.1 4.0.0 3.4.5 .0.4.02.4.0

Page 14: Breaking points of traditional approach What if you could handle big data?

With many contributors80 committers to Hadoop core project

Page 15: Breaking points of traditional approach What if you could handle big data?

Retail360°view of the customerAnalyze brand sentimentLocalized, personalized promotionsWebsite optimizationOptimal store layout

Financial servicesNew account risk screensFraud preventionTrading riskMaximize deposit spreadInsurance underwritingAccelerate loan processing

TelecomCall detail records (CDRs)Infrastructure investmentNext product to buy (NPTB)Real-time bandwidth allocationNew product development

Utilities, oil, and gasSmart meter stream analysisSlow oil well decline curvesOptimize lease biddingCompliance reportingProactive equipment repairSeismic image processing

Public sectorAnalyze public sentimentProtect critical networksPrevent fraud and wasteCrowd source reporting for repairs to infrastructureFulfill open records requests

ManufacturingSupplier consolidationSupply chain and logisticsAssembly line quality assurance Proactive maintenanceCrowd source quality assurance

HealthcareGenomic data for medical trialsMonitor patient vitalsReduce re-admittance ratesStore medical research dataRecruit cohorts for pharmaceutical trials

Business applications of Hadoop

Page 16: Breaking points of traditional approach What if you could handle big data?

New analytic applications from new dataINDUSTRY USE CASE

SENTIMENTAND WEB

CLICKSTREAMAND BEHAVIOR

MACHINE AND SENSOR

GEOGRAPHIC

SERVER LOGS

STRUCTURED AND UNSTRUCTURED

Financial services

New account risk screens ✔ ✔Trading risk ✔Insurance underwriting ✔ ✔ ✔

TelecomCall detail records (CDR) ✔ ✔Infrastructure investment ✔ ✔Real-time bandwidth allocation ✔ ✔ ✔

Retail360° view of the customer ✔ ✔ ✔Localized, personalized promotions ✔Website optimization ✔

ManufacturingSupply chain and logistics ✔Assembly line quality assurance ✔Crowd-sourced quality assurance ✔

Healthcare Use genomic data in medial trials ✔ ✔ ✔Monitor patient vitals in real-time

PharmaceuticalsRecruit and retain patients for drug trials ✔ ✔

Improve prescription adherence ✔ ✔ ✔ ✔

Oil and gas Unify exploration and production data ✔ ✔ ✔ ✔Monitor rig safety in real-time ✔ ✔ ✔

GovernmentETL offload/federal budgetary pressures ✔ ✔

Sentiment analysis for government programs ✔

Page 17: Breaking points of traditional approach What if you could handle big data?

AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?

Page 18: Breaking points of traditional approach What if you could handle big data?

Up-front HW costs Capacity planning Hadoop expertise

Challenges with implementing Hadoop

Page 19: Breaking points of traditional approach What if you could handle big data?

Why Cloud + Big Data?

Speed Scale Economics

Always Up, Always On

Open and flexibleTime to value

Data of all Volume, Variety, Velocity

Massive Compute and Storage

Deployment expertise

Page 20: Breaking points of traditional approach What if you could handle big data?

No HW costs

$0

Unlimited scalePay what you need

Deployed in minutes

Why Hadoop in the Cloud?

Page 21: Breaking points of traditional approach What if you could handle big data?

On-premises Hadoop

SoftwareAppliances

Scenarios For Deploying Hadoop As Hybrid

CloudCloud

Develop/POC

Cloud

Bursting

Cloud

Backup/archive

Page 22: Breaking points of traditional approach What if you could handle big data?

AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?

Page 23: Breaking points of traditional approach What if you could handle big data?

Introducing Azure HDInsight

Page 24: Breaking points of traditional approach What if you could handle big data?

Hadoop 2.2 and 2.4

80% data compression with ORC

Microsoft contributions to HadoopHadoop on Windows

Hive 100x Query Speed Up

30,000+code linecontributions

HDFS in Cloud (Azure)

REEF for Machine Learning

10,000+engineering hours

Committers

to Hadoop

Page 25: Breaking points of traditional approach What if you could handle big data?

Microsoft + Hortonworks

Promoting Open Hadoop

Engineering alignmentCorporate alignmentField alignment

Page 26: Breaking points of traditional approach What if you could handle big data?

HDInsight Built for Windows or LinuxCustomer ChoiceManaged & supported by MicrosoftFamiliarity of WindowsRe-use common tools, documentation, samples from Hadoop/Linux ecosystemAdd Hadoop projects that were authored on Linux to HDInsightEasier transition from on-premise to cloud

Page 27: Breaking points of traditional approach What if you could handle big data?

HDInsight Supports HiveSQL-like queries on Hadoop data in HDInsightHDInsight provides easy-to-use graphical query interface for HiveHiveQL is a SQL-like language (subset of SQL)Hive structures include well-understood database concepts such as tables, rows, columns, partitionsCompiled into MapReduce jobs that are executed on Hadoop

Dramatic performance gains with Stinger/TezStinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with HiveBrings query execution engine technology from Microsoft SQL Server to HivePerformance gains up to 100x

Microsoft contribution to Apache code

Hadoop 2.0

1400s44.3s

35.1s

Sample Query

Hive 10 HDP 1.3 /Hive 11

HDP 2.0

32x Speedup40XSpeedup

HDP 2.115s

100xSpeedup

Page 28: Breaking points of traditional approach What if you could handle big data?

HDInsight Supports HBase

Data Node Data Node Data Node Data Node

Task Tracker Task Tracker Task Tracker Task Tracker

Name Node

Job Tracker

HMasterCoordination

Region Server Region Server Region Server Region Server

NoSQL database on data in HDInsightColumnar, NoSQL databaseRuns on top of the Hadoop Distributed File System (HDFS)Provides flexibility in that new columns can be added to column families at any time

Page 29: Breaking points of traditional approach What if you could handle big data?

HDInsight Supports MahoutMachine learning library A library of machine learning algorithms to execute on data in HDFSAlgorithms are not dependent on size of data and can scale with large datasetsLibrary includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models

Page 30: Breaking points of traditional approach What if you could handle big data?

HDInsight Supports StormStream analytics for Near-Real Time ProcessingConsumes millions of real-time events from a scalable event broker (ie. Apache Kafka, Azure Event Hub)Performs time-sensitive computationOutput to persistent stores, dashboards or devicesCustomizable with Java + .NETDeeply integrated to Visual Studio

Event Queuing System

Collection Presentation and action

Event producers

Transformation

Long-term storage

Event Hubs

Storage adapters

Stream processi

ngCloud gateways(web APIs)

Field gateways

Applications

Search and query

Data analytics (Excel)

Web/thick client dashboards

Live Dashboards

Apache Storm on

HDInsight

Devices to take action

Kafka /RabbitMQ /ActiveMQ

Web and Social

Devices

Sensors

Azure Stream

Analytics

HDFS

Azure DBs

Azure storage

HBase

Page 31: Breaking points of traditional approach What if you could handle big data?

HDInsight Supports SparkIn Memory Processing on Multiple WorkloadsSingle execution model for multiple tasks (SQL queries, Streaming, Machine Learning, and Graph)Processing up to 100x faster performanceDeveloper friendly (Java, Python, Scala)BI tool of choice (Power BI, Tabelau, Qlik, SAP)Notebook experience (Jupyter/iPython, Zeppelin)

Spark SQL Spark Streaming

Machine Learning MLib

Graph GraphX

Page 32: Breaking points of traditional approach What if you could handle big data?

Add Hadoop Projects to HDInsightModify HDInsight clusters with custom scriptAdd Apache Hadoop projects to HDInsightDocumented for Spark, R, Giraph, Solr

HDInsight Allows You To Add Hadoop Projects

Page 33: Breaking points of traditional approach What if you could handle big data?

Microsoft Makes Hadoop EasierDeep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storageSubmit Hive queries, Storm topologies (C# or Java spouts/bolts)IntelliSense

Page 34: Breaking points of traditional approach What if you could handle big data?

Introducing Azure HDInsight

Page 35: Breaking points of traditional approach What if you could handle big data?

Why Microsoft Azure?

Azure Storage

HDInsight

Data Factory

ML

Stream Analytics

Database

DocumentDB

Search

On-premises Hadoop SoftwareAppliances

Azure Facts• >4 trillion objects in Azure• 300,000-1M+ requests per second• Double compute and storage every 6 months

Event Hubs

Page 36: Breaking points of traditional approach What if you could handle big data?

No hardware challengesHDInsight in the Cloud bypasses hardware costsHardware acquisitionHardware maintenancePerformance tuning

HDInsight in the Cloud bypasses capacity planningSpin up any number of Hadoop nodes on-demandGo from tens of nodes to thousands of nodes

No HW costs

$0

Unlimited scale

Page 37: Breaking points of traditional approach What if you could handle big data?

Deployed in minutesHDInsight in the Cloud Bypasses deployment expertiseHadoop is non-trivial to install and get up and running on multi-nodesEducation gap in IT community regarding Hadoop

HDInsight is deployed in minutesSpin up any number of Hadoop nodes on-demandUp and running in a few clicks (and within minutes)

Deployed in minutes

Page 38: Breaking points of traditional approach What if you could handle big data?

Mission Critical, Enterprise ReadyManaged Hadoop, Backed By An SLAThree Nine’s of Availability 99.9% uptime

HDInsight Auto Replicates DataAutomatic geo-replication of dataData only replicates within the same geo-political (i.e., country, region)

Mission Critical Hadoop

Page 39: Breaking points of traditional approach What if you could handle big data?

Maintenance done for youMinimal IT resources for upgrades/patchingOS patching and security updates done automatically

Minimal IT resources to update Hadoop versions Hadoop versions are rapidly releasing throughout the yearAlways be on the latest version of Hadoop with no effort

HDInsight on Hadoop 2.2April 2014HDInsight on Hadoop 1.1.2Oct 2013

HDInsight on Hadoop 2.4June 2014

O/S Upgrades

O/S Patching

HDInsight adds latest version of Hadoop for you

Page 40: Breaking points of traditional approach What if you could handle big data?

Low Cost HDInsight is billed by usageBilled for usageClusters can be deleted when no longer used

No additional price for supportAzure Support includes Hadoop supportWhat usually costs thousands of dollars per node is included

$£€¥

Page 41: Breaking points of traditional approach What if you could handle big data?

Introducing Azure HDInsight

Page 42: Breaking points of traditional approach What if you could handle big data?

Scalable, manageable, trusted

1 Billion Microsoft Office users Connect to HDInsight Analyze Visualize

Office 365 is our fastest-growing commercial product ever Share Ask Access

Bringing Hadoop to a billion peopleExcel as the BI tool for everyone

Power BI for collaboration& new experiences

Page 43: Breaking points of traditional approach What if you could handle big data?

DevicesApplicationsDashboards

Making advanced analytics accessible to Hadoop Microsoft Azure Machine Learning

Cloud

Desktop

ML API Service

Microsoft Azure PortalPublish API

Publish API in minutes

Web

ML Studio

Workspace

Easily make changes

ResultsRun & refineTest model typesHistorical data

SQL DB Blobs & tables

HDInsight

SQL Server VM

Page 44: Breaking points of traditional approach What if you could handle big data?

Wu FengProfessor of Computer ScienceVirginia Tech

“What excites me about what I’m doing with HDInsight is the ability to accelerate discovery to the point that we may be able to find treatments for cancer.”

Virginia Tech is able to capture data from DNA sequencers which are generating 15 PB of genome data each year. Rather than creating a supercomputing center with millions of dollars, Virginia Tech leverages Azure and only paying for compute they use.

Page 45: Breaking points of traditional approach What if you could handle big data?

Blackball uses HDInsight to collect point-of-sale (POS) data and new types of data such as customer feedback via social media.

“Before, we thought that people would choose cold drinks and desserts in hot weather. But contrary to our assumptions, in certain outlets we saw an opposite trend.”

Andrew CheongSenior ManagerBlackBall

Page 46: Breaking points of traditional approach What if you could handle big data?

AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?

Page 47: Breaking points of traditional approach What if you could handle big data?

Get StartedRead documentationhttp://azure.microsoft.com/en-us/documentation/services/hdinsight/

Learning Maphttp://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map/

Microsoft Virtual Academyhttp://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-big-data

Channel 9 Data Exposed Showhttp://channel9.msdn.com/Shows/Data-Exposed

Try 30 day trialhttp://azure.microsoft.com/en-us/pricing/free-trial/

Page 48: Breaking points of traditional approach What if you could handle big data?

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing marketconditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.