Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

  • Published on

  • View

  • Download

Embed Size (px)


Shaun Connolly's presentation at SAS Global Conference


  • Page 1 Hortonworks 2014 Distilling Hadoop Patterns of Use Shaun Connolly, Hortonworks @shaunconnolly March 25, 2014
  • Page 2 Hortonworks 2014 Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Headquarters: Palo Alto, CA Employees: 300+ and growing Reseller Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
  • Page 3 Hortonworks 2014 Data Continues to Grow Sharply 2020: Digital universe = 40 Ze'abytes 2012: Digital universe = 20 Ze'abytes 1 Ze2abyte (ZB) = 1 billion Terabytes (TB) 2014: 31% of enterprises managing more than 1 Petabyte Social Networks Machine Generated Documents, Emails OLTP, ERP, CRM Systems Geoloca@on Data Sensor Data Web Logs, Click Streams 85% of growth from new types of data with machine-generated data increasing 15x Sources: IDC and IDG Enterprise
  • Page 4 Hortonworks 2014 Cameras and microphones widely deployed New routes to market via intelligent objects Content and services via connected products Everything has a URL Remote sensing of objects and environment Augmented reality Situational decision support Building and infrastructure management Over 50% of Internet connections are things: 2011: 15+ billion permanent, 50+ billion intermittent 2020: 30+ billion permanent, >200 billion intermittent Source: Gartner Keynote at Hadoop Summit 2013
  • Page 5 Hortonworks 2014 Harnessing Big Data is transformational to business models Enables the move from post-transaction, reactive analysis of subsets of data stored in silos to a world of pre-transaction, interactive insights across all data that impacts both the top and bottom lines
  • Page 6 Hortonworks 2014 DATA SYSTEMS APPLICATIONS Repositories ROOMS Sta@s@cal Analysis BI / Repor@ng, Ad Hoc Analysis Interac@ve Web & Mobile Applica@ons Enterprise Applica@ons EDW MPPRDBMS EDW MPP Governance & Integra=on Security Opera=ons Data Access Data Management SOURCES OLTP, ERP, CRM Systems Documents, Emails Web Logs, Click Streams Social Networks Machine Generated Sensor Data Geoloca@on Data Modern Data Architecture with Hadoop OPERATIONS TOOLS Provision, Manage & Monitor DEV & DATA TOOLS Build & Test ENTERPRISE HADOOP
  • Page 7 Hortonworks 2014 MDA Unlocks New Approach to Insight Enterprise Hadoop Mul@ple Query Engines Itera@ve Process: Explore, Transform, Analyze SQL Single Query Engine Repeatable Linear Process Determine list of ques@ons Current Approach Apply schema on write Dependent on IT Augment with Hadoop Apply schema on read Support range of access paRerns to data stored in HDFS Design solu@ons Collect structured data Ask ques@ons from list Detect addi@onal ques@ons Batch Interac@ve Real-@me Streaming
  • Page 8 Hortonworks 2014 Schema-on-Write vs. Schema-on-Read Standard Digital Camera Zoom & focus first Capture limited set of pixels Crop around the focused area Lytro Lightfield Camera Capture entire lightfield Infinite zoom & focus Crop any captured areas
  • Page 9 Hortonworks 2014 MDA Uses Commodity Compute + Storage $0 $20,000 $40,000 $60,000 $80,000 $180,000 Cloud Storage HADOOP NAS Engineered System Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Fully Loaded Cost per Raw TB of Data (min max cost) EDW/MPP SAN
  • Page 10 Hortonworks 2014 MDA Optimizes Data Warehouse Analytics 20% ETL Process 30% Operations 50% Current Reality EDW at capacity; some usage from low value workloads Older transformed data archived, unavailable for ongoing exploration Source data often discarded Operations 50% Analytics 50% HADOOP Parse, cleanse, apply structure, transform Augment with Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read
  • Page 11 Hortonworks 2014 Integrating with Existing InvestmentsAPPLICATIONS DATA SYSTEM SOURCES RDBMS EDW MPP Emerging Sources (Sensor, Sen=ment, Geo, Unstructured) HANA BusinessObjects BI OPERATIONAL TOOLS DEV & DATA TOOLS Exis=ng Sources (CRM, ERP, Clickstream, Logs) INFRASTRUCTURE
  • Page 12 Hortonworks 2014 Powering the Modern Data Architecture Enables deep insight across a large, broad, diverse set of data at ecient scale Mul=-Use Data PlaSorm Store all data in one place, process in many ways 1 n Batch Interac=ve Real-=me Streaming Data Lake that contains ALL data; raw sources and any processed data over extended periods of time. YARN : Data Opera=ng System
  • Page 13 Hortonworks 2014 How Hadoop? Hadoop can be used to create a data lake an integrated repository of data from internal and external data sources... Data combined from mulVple silos can help your organizaVon nd answers to complex quesVons that no one has previously dared ask or known how to ask. -- Forrester
  • Page 14 Hortonworks 2014 The Common Journey with Hadoop SCALE SCOPE More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  • Page 15 Hortonworks 2014 Unlock Value in New Types of Data 1. Social Understand how people are feeling and interacting right now 2. Clickstream Capture and analyze website visitors data trails and optimize your website 3. Sensor/Machine Discover patterns in data streaming from remote sensors and machines 4. Geographic Analyze location-based data to manage operations where they occur 5. Server Logs Diagnose process failures and prevent security breaches 6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value + Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value
  • Page 16 Hortonworks 2014 20 Business Applications of Hadoop Industry Use Case Type of Data Financial Services New Account Risk Screens Text, Server Logs Trading Risk Server Logs Insurance Underwriting Geographic, Sensor, Text Telecom Call Detail Records (CDRs) Machine, Geographic Infrastructure Investment Machine, Server Logs Real-time Bandwidth Allocation Server Logs, Text, Social Retail 360 View of the Customer Clickstream, Text Localized, Personalized Promotions Geographic Website Optimization Clickstream Manufacturing Supply Chain and Logistics Sensor Assembly Line Quality Assurance Sensor Crowdsourced Quality Assurance Social Healthcare Use Genomic Data in Medical Trials Structured Monitor Patient Vitals in Real-Time Sensor Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream Improve Prescription Adherence Social, Unstructured, Geographic Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured Monitor Rig Safety in Real-Time Sensor, Unstructured Government ETL Offload in Response to Federal Budgetary Pressures Structured Sentiment Analysis for Government Programs Social
  • Page 17 Hortonworks 2014 360 Customer View for Home Supply Retailer Problem Disjoint customer engagement across all channels Data repositories on website traffic, POS transactions and in- home services exist in separate silos Unable to perform analytics on customer buying behavior across all channels Limited ability for targeted marketing to specific segments Solution Unified system of engagement via golden record Golden record enables targeted marketing capabilities: customized coupons, promotions and emails Deep visibility into all customers and all market segments Unlocks rich, informed cross-sell & up-sell opportunities Creating Opportunity Data: Clickstream, Unstructured, Structured Retail Major home improvement retailer >$74B in revenue >300K employees >2,200 stores
  • Page 18 Hortonworks 2014 Monetize Anonymous & Aggregate Banking Data Problem Unable to unlock valuable cross-sell banking data Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets Data sets are isolated in legacy silos controlled by LOBs Regulations and company policies protect customer privacy IT challenged by joining data while guaranteeing anonymity Solution Create cross-LOB data lake of de-identified data Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data Single point of security & privacy for de-identification, masking, encryption, authentication and access control Interoperability with SAS, Red Hat & Splunk Creating Opportunity Data: Structured, Clickstream, Social & Unstructured Banking One of the largest US banks
  • Page 19 Hortonworks 2014 Improving Efficiency Data: SensorOptimize High-Tech Manufacturing Problem Ineffective root cause analysis on product defects 200 million digital storage devices manufactured yearly >10K faulty devices returned by customers every month Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) Subset of sensor data from QA testing retained 3-12 months Solution Created sensor data lake for 10x quality improvement Repository holds 24 months of data for each device Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Quality improved 10x: rate down to ~1K faulty devices / month Manufacturing Digital Storage Devices >$15B in revenue >85K employees
  • Page 20 Hortonworks 2014 Think Pigabyte, Not Petabyte
  • Page 21 Hortonworks 2014 Enabling Hadoop for the Enterprise Journey Capabili=es Ensure enterprise capabili@es are delivered in 100% open source to benet all 1 2Integra=on Interoperable with exis@ng data center investments Skills Leverage your exis@ng skills: development, analy@cs, opera@ons 3 Scale Scope More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  • Page 22 Hortonworks 2014 Try Hadoop Today Get Involved Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 San Jose, CA June 3 - 5, 2014 REGISTER NOW Amsterdam April 2 - 3, 2014 REGISTER NOW
  • Page 23 Hortonworks 2014 Questions? @shaunconnolly