Transcript
Page 1: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 1 Hortonworks © 2014

Distilling Hadoop Patterns of Use Shaun Connolly, Hortonworks @shaunconnolly

March 25, 2014

Page 2: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 2 Hortonworks © 2014

Our Mission:

Our Commitment

Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process

Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind

Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills

Headquarters: Palo Alto, CA Employees: 300+ and growing

Reseller Partners

Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop

Page 3: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 3 Hortonworks © 2014

Data Continues to Grow Sharply

2020:  Digital  universe  =  40  Ze'abytes    

2012:  Digital  universe  =  20  Ze'abytes  1  Ze2abyte  (ZB)  =  1  billion  Terabytes  (TB)    

2014:  31%  of  enterprises  managing  more  than  1  Petabyte  

Social  Networks  

Machine  Generated  

Documents,    Emails  

OLTP,  ERP,    CRM  Systems  

Geoloca@on  Data  

Sensor  Data  

Web  Logs,  Click  Streams  

85%  of  growth  from  new  types  of  data  with  machine-­‐generated  data  increasing  15x  

Sources:  IDC  and  IDG  Enterprise  

Page 4: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 4 Hortonworks © 2014

Cameras and microphones widely

deployed

New routes to market via intelligent objects

Content and services via connected

products

Everything has a URL

Remote sensing of objects and environment

Augmented reality

Situational decision support

Building and infrastructure management

Over 50% of Internet connections are things: 2011: 15+ billion permanent, 50+ billion intermittent 2020: 30+ billion permanent, >200 billion intermittent

Source: Gartner Keynote at Hadoop Summit 2013

Page 5: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 5 Hortonworks © 2014

Harnessing Big Data is transformational to business models Enables the move from post-transaction, reactive analysis of subsets of data stored in silos to a world of pre-transaction, interactive insights across all data that impacts both the top and bottom lines

Page 6: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 6 Hortonworks © 2014

DATA

 SYSTEMS  

APPLICAT

IONS  

Repositories  

ROOMS

Sta@s@cal  Analysis  

BI  /  Repor@ng,  Ad  Hoc  Analysis  

Interac@ve  Web  &  Mobile  Applica@ons  

Enterprise  Applica@ons  

EDW MPP RDBMS   EDW   MPP  

Governa

nce    

&  In

tegra=

on  

Security  

Ope

ra=o

ns  

Data  Access  

Data  Management  

SOURC

ES  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

Geoloca@on  Data  

Modern Data Architecture with Hadoop

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

ENTERPRISE HADOOP

Page 7: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 7 Hortonworks © 2014

MDA Unlocks New Approach to Insight

Enterprise  Hadoop  Mul@ple  Query  Engines  Itera@ve  Process:  Explore,  Transform,  Analyze  

SQL  Single  Query  Engine  Repeatable  Linear  Process  

Determine  list  of  ques@ons  

Current  Approach    Apply  schema  on  write    Dependent  on  IT  

Augment  with  Hadoop    Apply  schema  on  read    Support  range  of  access  paRerns  to  data  stored  in  HDFS  

Design  solu@ons  

Collect  structured  data  

Ask  ques@ons  from  list  

Detect  addi@onal  ques@ons  

Batch   Interac@ve   Real-­‐@me   Streaming  

Page 8: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 8 Hortonworks © 2014

Schema-on-Write vs. Schema-on-Read

Standard Digital Camera § Zoom & focus first § Capture limited set of pixels § Crop around the focused area

Lytro Lightfield Camera § Capture entire lightfield §  Infinite zoom & focus § Crop any captured areas

Page 9: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 9 Hortonworks © 2014

MDA Uses Commodity Compute + Storage

$0 $20,000 $40,000 $60,000 $80,000 $180,000

Cloud Storage

HADOOP

NAS

Engineered System

Hadoop Enables Scalable Compute & Storage at a

Compelling Cost Structure

Fully Loaded Cost per Raw TB of Data (min – max cost)

EDW/MPP

SAN

Page 10: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 10 Hortonworks © 2014

MDA Optimizes Data Warehouse

Analytics 20%

ETL Process 30%

Operations 50%

Current Reality §  EDW at capacity; some usage

from low value workloads §  Older transformed data

archived, unavailable for ongoing exploration

§  Source data often discarded

Operations 50%

Analytics 50%

HADOOP Parse, cleanse,

apply structure, transform

Augment with Hadoop §  Free up EDW resources from low

value tasks §  Keep 100% of source data and

historical data for ongoing exploration §  Mine data for value after loading it

because of schema-on-read

Page 11: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 11 Hortonworks © 2014

Integrating with Existing Investments AP

PLICAT

IONS  

DATA

 SYSTEM  

SOURC

ES  

RDBMS   EDW   MPP  

Emerging  Sources    (Sensor,  Sen=ment,  Geo,  Unstructured)  

HANA

BusinessObjects BI

OPERATIONAL  TOOLS  

DEV  &  DATA  TOOLS  

Exis=ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

INFRASTRUCTURE  

Page 12: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 12 Hortonworks © 2014

Powering the Modern Data Architecture

   

Enables  deep  insight  across  a  large,  broad,  

diverse  set  of  data  at  efficient  scale    

Mul=-­‐Use  Data  PlaSorm  Store  all  data  in  one  place,  process  in  many  ways  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

°  

°  

°  

°  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

°  

°  

°  

n  

Batch   Interac=ve   Real-­‐=me   Streaming  

Data Lake that contains ALL data; raw sources and any processed data

over extended periods of time.

YARN  :  Data  Opera=ng  System  

Page 13: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 13 Hortonworks © 2014

How  Hadoop?    “Hadoop  can  be  used  to  create  a  ‘data  lake’  –  an  integrated  repository  of  data  from  internal  and  external  data  sources...  Data  combined  from  mulVple  silos  can  help  your  organizaVon  find  answers  to  complex  quesVons  that  no  one  has  previously  dared  ask  or  known  how  to  ask.”    

   -­‐-­‐  Forrester  

Page 14: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 14 Hortonworks © 2014

The Common Journey with Hadoop SC

ALE

SCOPE

More data and analytic apps

New Analytic Apps New types of data LOB-driven

A Modern Data Architecture  

RDBMS

MPP

EDW

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

Page 15: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 15 Hortonworks © 2014

Unlock Value in New Types of Data 1.  Social

Understand how people are feeling and interacting – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Value

+ Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value

Page 16: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 16 Hortonworks © 2014

20 Business Applications of Hadoop Industry Use Case Type of Data

Financial Services New Account Risk Screens Text, Server Logs

Trading Risk Server Logs

Insurance Underwriting Geographic, Sensor, Text

Telecom Call Detail Records (CDRs) Machine, Geographic

Infrastructure Investment Machine, Server Logs

Real-time Bandwidth Allocation Server Logs, Text, Social

Retail 360° View of the Customer Clickstream, Text

Localized, Personalized Promotions Geographic

Website Optimization Clickstream

Manufacturing Supply Chain and Logistics Sensor

Assembly Line Quality Assurance Sensor

Crowdsourced Quality Assurance Social

Healthcare Use Genomic Data in Medical Trials Structured

Monitor Patient Vitals in Real-Time Sensor

Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream

Improve Prescription Adherence Social, Unstructured, Geographic

Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured

Monitor Rig Safety in Real-Time Sensor, Unstructured

Government ETL Offload in Response to Federal Budgetary Pressures Structured

Sentiment Analysis for Government Programs Social

Page 17: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 17 Hortonworks © 2014

360° Customer View for Home Supply Retailer

Problem Disjoint customer engagement across all channels Data repositories on website traffic, POS transactions and in-home services exist in separate silos Unable to perform analytics on customer buying behavior across all channels Limited ability for targeted marketing to specific segments

Solution Unified system of engagement via “golden record” Golden record enables targeted marketing capabilities: customized coupons, promotions and emails Deep visibility into all customers and all market segments Unlocks rich, informed cross-sell & up-sell opportunities

Creating Opportunity Data: Clickstream,

Unstructured, Structured

Retail

Major home improvement retailer

>$74B in revenue

>300K employees

>2,200 stores

Page 18: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 18 Hortonworks © 2014

Monetize Anonymous & Aggregate Banking Data

Problem Unable to unlock valuable cross-sell banking data Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets Data sets are isolated in legacy silos controlled by LOBs Regulations and company policies protect customer privacy IT challenged by joining data while guaranteeing anonymity

Solution Create cross-LOB data lake of de-identified data Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data Single point of security & privacy for de-identification, masking, encryption, authentication and access control Interoperability with SAS, Red Hat & Splunk

Creating Opportunity Data: Structured,

Clickstream, Social & Unstructured

Banking

One of the largest US banks

Page 19: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 19 Hortonworks © 2014

Improving Efficiency Data: Sensor Optimize High-Tech Manufacturing

Problem Ineffective root cause analysis on product defects 200 million digital storage devices manufactured yearly >10K faulty devices returned by customers every month Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) Subset of sensor data from QA testing retained 3-12 months

Solution Created sensor data lake for 10x quality improvement Repository holds 24 months of data for each device Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Quality improved 10x: rate down to ~1K faulty devices / month

Manufacturing

Digital Storage Devices

>$15B in revenue

>85K employees

Page 20: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 20 Hortonworks © 2014

Think Pigabyte, Not Petabyte

Page 21: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 21 Hortonworks © 2014

Enabling Hadoop for the Enterprise Journey

Capabili=es  Ensure  enterprise  capabili@es  are  delivered  in  100%  open  source  to  benefit  all  

1 2 Integra=on  Interoperable  with  exis@ng    

data  center  investments  

Skills  Leverage  your  exis@ng  skills:  development,  analy@cs,  opera@ons    3

Scale

Scope

More data and analytic apps

New Analytic Apps New types of data LOB-driven

A Modern Data Architecture  

RDBMS

MPP

EDW

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

Page 22: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 22 Hortonworks © 2014

Try Hadoop Today… Get Involved

Download the Hortonworks Sandbox Learn Hadoop

Build Your Analytic App

Try Hadoop 2

San Jose, CA June 3 - 5, 2014

REGISTER NOW

Amsterdam April 2 - 3, 2014

REGISTER NOW

Page 23: Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

Page 23 Hortonworks © 2014

Questions? @shaunconnolly


Recommended