154
The Journey to Becoming a Data-Driven Enterprise Pivotal Big Data Roadshow 2015

Driving Real Insights Through Data Science

  • Upload
    pivotal

  • View
    1.653

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Driving Real Insights Through Data Science

The Journey to Becoming a Data-Driven Enterprise

Pivotal Big Data Roadshow 2015

Page 2: Driving Real Insights Through Data Science

2© Copyright 2015 Pivotal. All rights reserved.

Where we’re going today…

3 Great Keynotes• Journey to a Data-driven Enterprise

• Data Science Use Cases

• Streaming Data and Predictive Analytics

Stock Inference Demo and Architecture Overview

Intensive hands-on training sessions

Page 3: Driving Real Insights Through Data Science

3© Copyright 2015 Pivotal. All rights reserved.

Today’s Agenda

12:00 PM – 12:45 PM - Check-In & Lunch

12:45 PM – 1:00 PM – Welcome and Agenda Review

1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science

1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case

Review and Demo

2:20 PM – 2:30 PM – Coffee Break

2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo

3:00 PM – 3:15 PM – Closing Remarks

Page 4: Driving Real Insights Through Data Science

© Copyright 2015 Pivotal. All rights reserved.

MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’ - IT COULD TRANSFORM GE'S BUSINESS -

AND THE ECONOMY.

JEFF IMMELT, CEO, GE

”JEFF IMMELT, CEO, GE

Page 5: Driving Real Insights Through Data Science

© Copyright 2015 Pivotal. All rights reserved.

THE POWER OF 1

RX

IncreasingFreight Utilization Rail

PredictiveMaintenance Healthcare

PredictiveDiagnostics Power

Driving Outcomes That Matter

One Percent Improvement Equals$27B

Industry Value byReducing System

Inefficiency

$63BIndustry Value byReducing Process

Inefficiency

$66BIndustry Value with

Efficiency ImprovementsIn Gas-fired Power

Plant FleetsSource: General Electric

Page 6: Driving Real Insights Through Data Science

© Copyright 2015 Pivotal. All rights reserved.

DATA-DRIVEN ENTERPRISE JOURNEY

STORE• Structured

• Unstructured

• High Volume

• High Velocity

ANALYZE• Predictive Analytics

• Machine Learning

• Advance Data Science

• Realtime Analytics

DEVELOP• Advanced Analytic Pipelines

• Realtime Analytical Applications

• Global Scale Data-Driven Applications

• Enterprise, Consumer, and Mobile

INNOVATE• Agile Dev Expertise

• DevOps

• Microservice

• Continuous Delivery

• Closed Loop Applications

AGILE DEVELOPMENT

BIG DATA PREDICTIVE ANALYTICS

CLOUD NATIVE PLATFORM

Page 7: Driving Real Insights Through Data Science

8© Copyright 2015 Pivotal. All rights reserved.

0% of CIOs think their IT infrastructure is fully prepared for big data (3)

30% of companies have deployed advanced analytics, 11% big data analysis (4)

44% of new applications failed to meet performance expectations (5)

2X90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)

But…

80% of CEOs thinking data mining and analysis are strategically important (1)

4% of companies use analytics effectively (2)

(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study

LARGE ENTERPRISE BIG DATA TROUBLE

Page 8: Driving Real Insights Through Data Science

9© Copyright 2015 Pivotal. All rights reserved.

BIG DATACHASM

70%of data

generated by customers

80%of data stored

3%prepared for

analysis

0.5%being

analyzed

<0.5%being

operationalized

9

THE DATA DIVIDE

Page 9: Driving Real Insights Through Data Science

10© Copyright 2015 Pivotal. All rights reserved.

Software Is Eating The World

Data Is Fueling Software

SOFTWARE IS EATING THE WORLD

Page 10: Driving Real Insights Through Data Science

11© Copyright 2015 Pivotal. All rights reserved.

WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT PROVIDES A 360-DEGREE VIEW OF THE PROCESS.

FROM A DATA SCIENCE AND DATA TECHNOLOGY PERSPECTIVE, IT MEANS DELIVERING BEST-IN-

CLASS DATA TECHNOLOGIES AND ENABLING THEM ON THEIR PLATFORM.

Page 11: Driving Real Insights Through Data Science

12© Copyright 2015 Pivotal. All rights reserved.

ACROSS INDUSTRIES

Page 12: Driving Real Insights Through Data Science

13© Copyright 2015 Pivotal. All rights reserved.

THE NEW DATA IMPERATIVES

ConvergedData & Cloud

OpenData-DrivenApps

Page 13: Driving Real Insights Through Data Science

14© Copyright 2015 Pivotal. All rights reserved.

THE BIG DATA PROBLEM

Fragmentation ConstraintsComplexity

Page 14: Driving Real Insights Through Data Science

15© Copyright 2015 Pivotal. All rights reserved.

• Remove Lock-in

• Leverage Ecosystem

• Co-innovate

GUIDING PRINCIPLES IN THE NEW ERA

OPEN AGILE CLOUD-READY

• Shorten innovation cycles

• Reduce TCO

• Improve TTM

• Solve business problems

• Avoid lock-in

• Appropriate security

Page 15: Driving Real Insights Through Data Science

16© Copyright 2015 Pivotal. All rights reserved.

JOURNEY TO A DATA-DRIVEN ENTERPRISE

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

Modernize data infrastructure

Page 16: Driving Real Insights Through Data Science

17© Copyright 2015 Pivotal. All rights reserved.

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

Modernize data infrastructure

DATA-DRIVEN COMPANIES:

USE MODERN DATA INFRASTRUCTURE

Page 17: Driving Real Insights Through Data Science

18© Copyright 2015 Pivotal. All rights reserved.

MODERNIZE DATA INFRASTRUCTURE

Elastic, Scale-outstorage and processing

Flexible data types and pipelining

ETL on demand: low operational costExpanded use cases

Higher quality analyticsLowered storage/processing cost

Less fragmented ecosystemReduced vendor lock-in

REQUIREMENTS BENEFITS

Cloud friendly and open-source based

Page 18: Driving Real Insights Through Data Science

19© Copyright 2015 Pivotal. All rights reserved.

Modernize data infrastructure

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

DATA-DRIVEN COMPANIES:

STRATEGICALLY USE ADVANCED ANALYTICS

Page 19: Driving Real Insights Through Data Science

20© Copyright 2015 Pivotal. All rights reserved.

ADVANCED ANALYTICS

Leverage existing skills and toolsRapid time to insights

Internet of Things use casesRapid time to insights

Solve business problemsPredictive insights: proactive execution

REQUIREMENTS BENEFITS

Machine learning and advanced analytics

010101010101010100101010101010101100101010

SQL- compliant batch and interactive queries

Massive stream processing

0101010101010101001010

1010101010110010101010

10101010

Page 20: Driving Real Insights Through Data Science

21© Copyright 2015 Pivotal. All rights reserved.

Modernize data infrastructure

Perform advanced analyticsDiscover insights

DATA-DRIVEN COMPANIES:

INNOVATE AT SCALE

Deploy analytic apps and automate at scale

Page 21: Driving Real Insights Through Data Science

22© Copyright 2015 Pivotal. All rights reserved.

ANALYTIC APPS AND AUTOMATION AT SCALE

Reduced time to actionLow ‘analytics app-dev’ integration cost

Reduced time to insightsFlexible ingestion: low operating cost

High performance: low operating costTransactional safety: business critical ops

REQUIREMENTS BENEFITS

Low-latency, distributed in-memory transactions

Resilient, scale-out messaging and object storage

Agile analytic app-dev with enterprise PaaSPaaS

Page 22: Driving Real Insights Through Data Science

23© Copyright 2015 Pivotal. All rights reserved.

JOURNEY TO A DATA-DRIVEN ENTERPRISE

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

Modernize data infrastructure

Pivotal Data Sciencehelps you move from BI to

Data Science

Pivotal Labs helps you move to an agile development of apps at scale

Pivotal Data Engineeringhelps you move from data administration to data engineering

Page 23: Driving Real Insights Through Data Science

26© Copyright 2015 Pivotal. All rights reserved.

PIVOTAL BIG DATA SUITE

Page 24: Driving Real Insights Through Data Science

27© Copyright 2015 Pivotal. All rights reserved.

Open sourcing all Pivotal Big Data Suite components including:

WORLD’S FIRST OPEN SOURCED BIG DATA PORTFOLIOBUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION

BUILT FOR ENTERPRISES

Pivotal GemFireApache Geode

Apache HAWQPivotal HDB

PivotalGreenplum Database

Page 25: Driving Real Insights Through Data Science

28© Copyright 2015 Pivotal. All rights reserved.

BUILT FOR ENTERPRISES

Value added features: enterprise grade performance + robustness without lock-in• Advanced Query Optimization in analytics

• WAN replication and continuous query in transactional processing

Flexible Deployment models: align to business objectives and needs• Balance cost objectives with policy and compliance requirements

• Leverage Pivotal’s pre-integration + certification on supported configurations

Enterprise grade support: one throat to choke for the suite• Focus on business problems – not on lifecycle management

• Expert support on Big Data Suite means reduced business risk

Page 26: Driving Real Insights Through Data Science

29© Copyright 2015 Pivotal. All rights reserved.

• Common core for Hadoop ecosystem

• Rapidly accelerated certifications, ecosystem development and enterprise-grade quality

OpenDataPlatform.org

OPEN

Page 27: Driving Real Insights Through Data Science

30© Copyright 2015 Pivotal. All rights reserved.

AGILE

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

Modernize data infrastructure

Spring XD Spark Pivotal HD & Open Data Platform

Pivotal Greenplum Database

Pivotal HDB Rabbit MQ

Redis

Pivotal GemFire Pivotal BDS on PCF

Pivotal Cloud Foundry

Page 28: Driving Real Insights Through Data Science

31© Copyright 2015 Pivotal. All rights reserved.

CLOUD-READY

COMMODITY HARDWARE APPLIANCE HYBRID CLOUDCLOUD

IaaS IaaS

PAAS

Page 29: Driving Real Insights Through Data Science

32© Copyright 2015 Pivotal. All rights reserved.

DATA-DRIVEN ENTERRPRISE JOURNEY WITH PIVOTAL BIG DATA SUITE

STORE• Structured

• Unstructured

• High Volume

• High Velocity

ANALYZE• Predictive Analytics

• Machine Learning

• Advance Data Science

• Realtime Analytics

DEVELOP• Advanced Analytic Pipelines

• Realtime Analytical Applications

• Global Scale Data-Driven Applications

• Enterprise, Consumer, IoT, and Mobile

INNOVATE• Agile Dev Expertise

• DevOps

• Microservices

• Continuous Delivery

• Closed Loop Applications

AGILE DEVELOPMENT

BIG DATA PREDICTIVE ANALYTICS

CLOUD NATIVE PLATFORM

Spring XD

Spark

Pivotal HD & Open Data Platform

Spring XD

Pivotal Greenplum Database

Pivotal HDB

Spring XD

Pivotal GemFire

Rabbit MQ

Spring Cloud

Pivotal BDS on PCF

Pivotal Cloud Foundry

Pivotal LabsData ScienceData Engineering

Page 30: Driving Real Insights Through Data Science

35© Copyright 2015 Pivotal. All rights reserved.

FOR FURTHER INFO, CHECKOUT…

• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal

• Pivotal Academy @ https://pivotal.biglms.com

Or reach out to your local Pivotal Account Executive…

Page 31: Driving Real Insights Through Data Science

36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.

Pivotal Data Science Overview and Use CasesPivotal Big Data Roadshow

Page 32: Driving Real Insights Through Data Science

37© Copyright 2015 Pivotal. All rights reserved.

DATA SCIENCE?

App Development

Analytics

Business IntelligenceReporting

Visualization

Dashboards

Insights Big Data

Machine LearningStatistics

MathematicsTime Series

Algorithms

Databases

Software

Modeling

Queries

Real-Time

Sensors

Predictive Models

ETL

Research

Hadoop

Distributed Computing

MapReduce

SQL

In-Memory

OLAP

Text Mining

Unstructured Data

Open Source

Decision Science

Ad Hoc Queries

Hacking

In-Database Analytics

Internet of Things

Data Cleansing

Sentiment

Page 33: Driving Real Insights Through Data Science

38© Copyright 2015 Pivotal. All rights reserved.

• ETL

• Unstructured

• Data Cleansing

• Sensors

Data Related

• Algorithms

• Mathematics

• Statistics

• Econometrics

• Predictive Modeling

• Machine Learning

• Text Mining

• Sentiment

• Map Reduce

Fields of Study & Techniques

• Dashboards

• Insights

• Visualization

• Ad Hoc Queries

• Reporting

Business Intelligence

• Software

• In-Database Analysis

• Distributed Computing

• Hadoop

• Open Source

Implementation

• Big Data

• Decision Science

• Internet of Things

• Real-Time

• Hacking

• In-Memory

Industry Buzzwords

Page 34: Driving Real Insights Through Data Science

39© Copyright 2015 Pivotal. All rights reserved.

What is Data Science?

The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.

DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST

Page 35: Driving Real Insights Through Data Science

40© Copyright 2015 Pivotal. All rights reserved.

Gene Sequencing

Smart Grids

COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS

3000X MOREDATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLION

PHOTOS EACH DAY

Billions of Data Points

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors

Page 36: Driving Real Insights Through Data Science

41© Copyright 2015 Pivotal. All rights reserved.

What is Big Data Analytics?

DescriptiveAnalytics

WHAT HAPPENED?DiagnosticAnalytics

WHY DID IT HAPPENED?

BI

Data Science

PredictiveAnalytics

WHAT WILL HAPPEN?

PrescriptiveAnalytics

HOW CAN WE MAKE IT HAPPEN?

Hindsight

Insight

Foresight

Complexity

Value ofAnalytics

($)

Page 37: Driving Real Insights Through Data Science

42© Copyright 2015 Pivotal. All rights reserved.

P L A T F O R M

Data Science Toolkit

KEY TOOLS KEY LANGUAGES

SQL

Page 38: Driving Real Insights Through Data Science

43© Copyright 2015 Pivotal. All rights reserved.

Scalable, In-Database ML

• Open Source https://github.com/madlib/madlib• Works on Greenplum DB, HAWQ and PostgreSQL• In active development by Pivotal• Downloads and Docs: http://madlib.net/

Page 39: Driving Real Insights Through Data Science

44© Copyright 2015 Pivotal. All rights reserved.

Functions

Supervised LearningRegression Models• Cox Proportional Hazards Regression• Elastic Net Regularization• Generalized Linear Models• Linear Regression• Logistic Regression• Marginal Effects• Multinomial Regression• Ordinal Regression• Robust Variance, Clustered Variance• Support Vector MachinesTree Methods• Decision Tree• Random ForestOther Methods• Conditional Random Field• Naïve Bayes

Unsupervised Learning• Association Rules (Apriori)• Clustering (K-means) • Topic Modeling (LDA)

StatisticsDescriptive• Cardinality Estimators• Correlation• SummaryInferential• Hypothesis TestsOther Statistics• Probability Functions

Other Modules• Conjugate Gradient• Linear Solvers• PMML Export• Random Sampling• Term Frequency for Text

Time Series• ARIMA

Aug 2015

Data Types and Transformations• Array Operations• Dimensionality Reduction (PCA)• Encoding Categorical Variables• Matrix Operations• Matrix Factorization (SVD, Low Rank)• Norms and Distance Functions• Sparse Vectors

Model Evaluation• Cross Validation

Predictive Analytics Library

Page 40: Driving Real Insights Through Data Science

45© Copyright 2015 Pivotal. All rights reserved.

A single address for everything analyticsAnalytics with Pivotal

Time-to-Insights

FORECASTING CLUSTERING

REGRESSION

CLASSIFICATION

OPTIMIZATION

Page 41: Driving Real Insights Through Data Science

46© Copyright 2015 Pivotal. All rights reserved.

Smart Systems = Sensors + Digital Brain + Actuators

Problem Formulation

Modeling Step

Data StepApplication Step

Data Science forBuilding Models

Sensors & Actuators

Data Lake

Page 42: Driving Real Insights Through Data Science

47© Copyright 2015 Pivotal. All rights reserved. 47© Copyright 2013 Pivotal. All rights reserved.

Data Science Use Cases

Page 43: Driving Real Insights Through Data Science

48© Copyright 2015 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.

Financial Services

Page 44: Driving Real Insights Through Data Science

49© Copyright 2015 Pivotal. All rights reserved.

Identifying and Pricing Cross-Sell Opportunities

CUSTOMER

A global financial services provider

BUSINESS PROBLEM

Identify cross-sell opportunities between two business arms of a financial institution.

CHALLENGES

Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential.

SOLUTIONS

Fast integration of data in Pivotal Greenplum Database.

Predictive models and evaluation of profitability:– Association rule. – Logistic regression for each product

offered.– Estimation of revenue opportunity.

On-demand reporting and visualization via custom dashboards connected to in-database models.

Identified multi-million dollar opportunities for the bank.

Page 45: Driving Real Insights Through Data Science

50© Copyright 2015 Pivotal. All rights reserved.

Financial Compliance

BUSINESS PROBLEMEnsure compliance with Dodd-Frank and Basel Committee regulationsIdentify underlying risk and fraud while reducing the compliance department’s overburdened

Emails Chats Trades

Transactions Policy Securities

Phone Calls Watch Lists …

Financial complianceData Lake

Data integration

Data clean up Modeling Classification

and ranking

Analyst user interfacesFeedback

Analytics

Analyst feedbackData integration: e.g., append trade information with email and chat communications

Data cleanup: e.g., identify newsletters and spam emails

Modeling: • Predictive modeling to flag

messages and trades• Graph and cohort analysis

Analyst feedbackReviewed fraud instances included in periodic model refreshes

SOLUTION A data lake platform coupled with cutting edge data

science techniques Flexible user interface to promote an adaptive,

continuously learning compliance framework

Page 46: Driving Real Insights Through Data Science

51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.

Telco & Mobile

Page 47: Driving Real Insights Through Data Science

52© Copyright 2015 Pivotal. All rights reserved.

Subscriber Micro-SegmentationCUSTOMER

A major telco with cable & VOD, internet, and phone business unitsBUSINESS PROBLEM

Better understand aggregated subscriber behavior to drive business strategy using newly available data sources

CHALLENGES

▪ Large quantities of deep packet inspection data and set top box data that had not been analyzed before

▪ Needed to incorporate internet usage and TV consumption information into pre-existing subscriber segments

SOLUTION

▪ Generated new subscriber segments that incorporated features based on consumption of TV and internet services across a variety of devices

▪ Crossed new segments with existing segments to generate new micro-segments for cross-sell/upsell and new product development opportunities

Customized Micro-Segments

Page 48: Driving Real Insights Through Data Science

53© Copyright 2015 Pivotal. All rights reserved.

Newly Identified Behavior-Based Segments S

ubsc

riber

s

Moderates

OTT & Data Heavyweights

Portable OTT Entertainment Seekers

iPhone Heavy

Android Heavy

iPad Heavy

In-Home OTT Entertainment Seekers

In-Home Native Content Seekers

VOD Heavy

TV Heavy

Page 49: Driving Real Insights Through Data Science

54© Copyright 2015 Pivotal. All rights reserved.

Opportunities for Data-Driven Decisions in Pharma

Page 50: Driving Real Insights Through Data Science

55© Copyright 2015 Pivotal. All rights reserved.

Data driven drugs: From discovery to delivery

RICH DATA SOURCES• Molecular data

• Cellular drug screens• Animal models

• Clinical data including notes, images, markers (e.g. genomics, lab results)

• Sensor and assay data• Internal and partner/purchased

external data• Contact center data• Patient registries, public and federal

data, clinical partnerships

Clinical Trials

Manufacturing

Marketing

Distribution and surveillance

Drug discovery + development

Page 51: Driving Real Insights Through Data Science

56© Copyright 2015 Pivotal. All rights reserved.

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

0 20 40 60 80 100 120 140 160 180 200

0

5

10

15

20

25

30

Sensors

High-Content Screens

TEM

PTIME

Abs

orba

nce

Elution volume

Velo

city

TimeAutomated raw

materials mixing

Page 52: Driving Real Insights Through Data Science

57© Copyright 2015 Pivotal. All rights reserved.

Vaccine Potency PredictionCUSTOMER A major pharmaceutical company

BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.

CHALLENGES Customer’s data model was not optimal for

running analytical queries Manual data quality issues Data capture was performed with varying

consistency due to high cost associated with manual data collection

SOLUTION Introduced a new data model to make data

accessible and enable analytics (including LIMS and DeltaV)

Built automated outlier detection/correction methods to address manual data entry quality issues

Devised imputation methods to deal with data completeness issues

Built predictive models with high accuracy

Page 53: Driving Real Insights Through Data Science

58© Copyright 2015 Pivotal. All rights reserved.

http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!

Page 54: Driving Real Insights Through Data Science

59© Copyright 2015 Pivotal. All rights reserved.

FOR FURTHER INFO…

• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal

• Pivotal Academy @ https://pivotal.biglms.com

• Or reach out to your local Pivotal Account Executive…

Page 55: Driving Real Insights Through Data Science

60© Copyright 2015 Pivotal. All rights reserved.

Pivotal Data Science Labs: Packaged Services

• Analytics Roadmap

• Prioritized Opportunities

• Architectural Recommendations

• Hands-on training

• Hosted data on

Pivotal Data stack

• Results review &assessment

• On-site MPP analytics training

• Analytics tool-kit

• Quick insight (2 weeks)

• Prof. services

• Data science model building

• Ready-to-deploymodel(s)

• Prof. services

• Data sciencemodel building

• Ready-to-deploymodel(s)

LAB PRIMER(2-Week Roadmapping)

LAB 600(6-Week Lab)

LAB 1200(12-Week Lab)

LAB 100(Analytics Bundle)

DATA JAM(Internal DS Contest)

Page 56: Driving Real Insights Through Data Science

61© Copyright 2015 Pivotal. All rights reserved.

Data Streaming and Predictive AnalyticsUsing Pivotal Big Data Suite

Page 57: Driving Real Insights Through Data Science

62© Copyright 2015 Pivotal. All rights reserved.

Converging Trends

InnovationNew Data

New Processes

New Insights

The Journey to the Data-Driven Enterprise

Data Scienceand Machine

LearningBig DataIoT, Mobile Apps,

Social Media

Page 58: Driving Real Insights Through Data Science

63© Copyright 2015 Pivotal. All rights reserved.

HDFS

Data Lake

Ingest Store Analytics

Hard to changeLabor intensive

Inefficient

Coding basedNo real-time informationBased on expensive ETL

Migrating from a Reactive, Static and Constrained Model…

Page 59: Driving Real Insights Through Data Science

64© Copyright 2015 Pivotal. All rights reserved.

HDFSData Lake Expert System / Machine Learning

In-Memory Real-Time

Data

Continuous LearningContinuous Improvement

Continuous Adapting

Data Stream Pipeline

Multiple Data SourcesReal-Time ProcessingStore Everything

To Pro-Active, Self-Improving, Machine Learning Systems

Page 60: Driving Real Insights Through Data Science

65© Copyright 2015 Pivotal. All rights reserved.

New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

“50-80% OF THE TIME ON DATA

SCIENCE PROJECTS IS SPENT ON DATA WRANGLING

Page 61: Driving Real Insights Through Data Science

66© Copyright 2015 Pivotal. All rights reserved.

Data FeedsStream Processing

Expert Systems Machine Learning

Historical Data

Business ValueSmart

Decisions

Still…

HDFS

Data Lake

Page 62: Driving Real Insights Through Data Science

67© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

GemFire

Data Stream Needs an Agile, Scalable and Fast Solution

HAWQ GPDBDataLake

Page 63: Driving Real Insights Through Data Science

68© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

DistributedComputing

In-Memory Real-Time Data

Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining

Expert System / Machine Learning

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB

DataLake

Page 64: Driving Real Insights Through Data Science

69© Copyright 2015 Pivotal. All rights reserved.

INGEST / SINK PROCESS ANALYZE

• No coding required

• Dozens of built-in connectors

• Seamless integration with Kafka, Sqoop

• Create new connectors easily using Spring

• Call Spark, Reactor or RxJava

• Built-in configurable filtering, splitting and transformation

• Out-of-box configurable jobs for batch processing

• Import and invoke PMML jobs easily

• Call Python, R, Madlib and other tools

• Built-in configurable counters and gauges

Spring XDState of the Art Data Pipeline Automation

Page 65: Driving Real Insights Through Data Science

70© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

DistributedComputing

GemFire Provides Scalable, Low-Latency Data Access, Storage and Event Processing

Expert System / Machine Learning

GemFire

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB

DataLake

Page 66: Driving Real Insights Through Data Science

71© Copyright 2015 Pivotal. All rights reserved.

GemFire

• In-Memory Enterprise Data Grid• Horizontally Scalable, Consistent,

Highly Available • Event handling• Continuous Queries• Enterprise Data Geo Distribution

In-memory Real Time Data

Page 67: Driving Real Insights Through Data Science

72© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

DistributedComputing

Pivotal Provides SQL Based Advanced Analytics

Expert System / Machine Learning

GemFire

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable

DataLake HAWQ GPDB

Page 68: Driving Real Insights Through Data Science

73© Copyright 2015 Pivotal. All rights reserved.

HAWQ

• Massively Parallel Processing RDBMS on HADOOP

• ANSI SQL on Hadoop• Extremely high

performance for analytics (not like Hive)

• Stores all data directly on HDFS

• Open-Source

Advanced SQL analytics in Hadoop

Combining SQL with Hadoop is key for analytics

SQL remains #1 choice for Data Science

Page 69: Driving Real Insights Through Data Science

74© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

Developers and Data Scientists Can Focus onthe Business Value of Data

GemFire

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable

DataLake HAWQ GPDB

Page 70: Driving Real Insights Through Data Science

75© Copyright 2015 Pivotal. All rights reserved.

Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps

Data Stream Pipeline

DistributedComputing Real-Time Data Expert Systems &

Machine LearningAdvancedAnalytics

HDFSData Lake

Page 71: Driving Real Insights Through Data Science

76© Copyright 2015 Pivotal. All rights reserved.

Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps

Data Stream Pipeline

HDFSData Lake

GemFire HAWQ GPDB

SpringXD

Page 72: Driving Real Insights Through Data Science

77© Copyright 2015 Pivotal. All rights reserved.

“SO WE ARE MOVING TO A WORLD WHERE THE

MACHINES WE WORK WITH ARE NOT JUST INTELLIGENT; THEY ARE BRILLIANT.THEY ARE SELF-

AWARE, THEY ARE PREDICTIVE, REACTIVE AND SOCIAL. IT'S A WORLD WHERE INFORMATION

ITSELF BECOMES INTELLIGENT AND COMES TO US AUTOMATICALLY WHEN WE NEED IT WITHOUT

HAVING TO LOOK FOR IT.

”MARCO ANNUNZIATA, GE

Page 73: Driving Real Insights Through Data Science

78© Copyright 2015 Pivotal. All rights reserved.

DemoPowered by Pivotal Big Data Suite

Page 74: Driving Real Insights Through Data Science

79© Copyright 2015 Pivotal. All rights reserved.

It's all about DATA

Data SourcesLook for patterns

Prediction

Page 75: Driving Real Insights Through Data Science

Transform Sink

SpringXD

ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native

Machine Learning

Enrich Filter

Split

Dashboard

Indicators

1

2

Predict

3

Real data

Simulator

/Stocks

/TechIndicators

/Predictions

Page 76: Driving Real Insights Through Data Science

81© Copyright 2015 Pivotal. All rights reserved. 81

Page 77: Driving Real Insights Through Data Science
Page 78: Driving Real Insights Through Data Science

91© Copyright 2015 Pivotal. All rights reserved.

“THE REAL OPPORTUNITY FOR

CHANGE...SURPASSING THE MAGNITUDE OF THE CONSUMER INTERNET...IS THE INDUSTRIAL

INTERNET, AN OPEN, GLOBAL NETWORK THAT CONNECTS PEOPLE, DATA AND MACHINES.

”JEFF IMMELT, CEO, GE

Page 79: Driving Real Insights Through Data Science

100© Copyright 2015 Pivotal. All rights reserved.

FOR FURTHER INFO, CHECKOUT…

• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal

• Pivotal Academy @ https://pivotal.biglms.com

• Or reach out to your local Pivotal Account Executive…

Page 80: Driving Real Insights Through Data Science

BUILT FOR THE SPEED OF BUSINESS

Page 81: Driving Real Insights Through Data Science

102© Copyright 2015 Pivotal. All rights reserved. 102© Copyright 2013 Pivotal. All rights reserved.

Accelerating the Generation of New Insights

October 27, 2015

Sarah Aerni, Data ScienceAntonio Petrole, Data Engineering

Page 82: Driving Real Insights Through Data Science

BUILT FOR THE SPEED OF BUSINESS

Page 83: Driving Real Insights Through Data Science

104© Copyright 2015 Pivotal. All rights reserved.

Gene Sequencing

Smart Grids

COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS

3000X MOREDATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLION

PHOTOS EACH DAY

Technology to process and store data is needed in all industries

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors

Page 84: Driving Real Insights Through Data Science

105© Copyright 2015 Pivotal. All rights reserved.

What is Big Data Analytics?

DescriptiveAnalytics

WHAT HAPPENED?DiagnosticAnalytics

WHY DID IT HAPPEN?

BI

Data Science

PredictiveAnalytics

WHAT WILL HAPPEN?

PrescriptiveAnalytics

HOW CAN WE MAKE IT HAPPEN?

Hindsight

Insight

Foresight

Complexity

Value ofAnalytics

($)

Page 85: Driving Real Insights Through Data Science

106© Copyright 2015 Pivotal. All rights reserved.

Opportunities for Data-Driven Decisions in Pharma

Page 86: Driving Real Insights Through Data Science

107© Copyright 2015 Pivotal. All rights reserved.

Data driven drugs: From discovery to delivery

RICH DATA SOURCES• Molecular data

• Cellular drug screens• Animal models

• Clinical data including notes, images, markers (e.g. genomics, lab results)

• Sensor and assay data• Internal and partner/purchased

external data• Contact center data• Patient registries, public and federal

data, clinical partnerships

Clinical Trials

Manufacturing

Marketing

Distribution and surveillance

Drug discovery + development

Page 87: Driving Real Insights Through Data Science

108© Copyright 2015 Pivotal. All rights reserved.

A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

0 20 40 60 80 100 120 140 160 180 200

0

5

10

15

20

25

30

Sensors

High-Content Screens

TEM

PTIME

Abs

orba

nce

Elution volume

Velo

city

TimeAutomated raw

materials mixing

Page 88: Driving Real Insights Through Data Science

109© Copyright 2015 Pivotal. All rights reserved.

Vaccine Potency PredictionCUSTOMER A major pharmaceutical company

BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.

CHALLENGES Customer’s data model was not optimal for

running analytical queries Manual data quality issues Data capture was performed with varying

consistency due to high cost associated with manual data collection

SOLUTION Introduced a new data model to make data

accessible and enable analytics (including LIMS and DeltaV)

Built automated outlier detection/correction methods to address manual data entry quality issues

Devised imputation methods to deal with data completeness issues

Built predictive models with high accuracy

Page 89: Driving Real Insights Through Data Science

110© Copyright 2015 Pivotal. All rights reserved.

Interpreting the utility of a measure obtained during manufacturing based on model outcomes

Sample model insights

Some features may reveal tunable parameters to alter potency, others may simply be markers

Features consistently absent from models may be uninformative for predicting potency

Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes

Assayed value Duration of a step

Pot

ency

Pot

ency

Correlation=0.45 Correlation=0.38

Page 90: Driving Real Insights Through Data Science

111© Copyright 2015 Pivotal. All rights reserved.

Need for new environments to process big data?HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION

PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE

Page 91: Driving Real Insights Through Data Science

112© Copyright 2015 Pivotal. All rights reserved.

Multiple tools with a single, simple goal: Distributed storage with in-place computation

PivotalHadoop

Pivotal Greenplum Database

HAWQ

Page 92: Driving Real Insights Through Data Science

113© Copyright 2015 Pivotal. All rights reserved.

Multiple tools with a single, simple goal: Distributed storage with in-place computation

Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

PivotalHadoop

Pivotal Greenplum Database

HAWQ

Page 93: Driving Real Insights Through Data Science

114© Copyright 2015 Pivotal. All rights reserved.

Identifying duplicates: counting with groupingOpportunities for performance improvements

– Sorting and re-sorting is required in many pipelines

– Single-threaded processes create bottlenecks in speed and need to move data

https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf

Solution: Leverage Pivotal’s distributed MPP environment by using common database functions

Page 94: Driving Real Insights Through Data Science

115© Copyright 2015 Pivotal. All rights reserved.

Identifying duplicates: counting with grouping

Reference genome

Mapped reads

Page 95: Driving Real Insights Through Data Science

116© Copyright 2015 Pivotal. All rights reserved.

Identifying duplicates: counting with grouping

Duplicates

13

11115

13

select locus, count(*) from readsgroup by locusReference genome

Mapped reads

Page 96: Driving Real Insights Through Data Science

117© Copyright 2015 Pivotal. All rights reserved.

Reference genome

Mapped reads

Counting numbers of reads mapped to featuresselect exon, count(*) from reads JOIN refseqON(<reads overlap exon>)group by exon

5

17

12

Page 97: Driving Real Insights Through Data Science

118© Copyright 2015 Pivotal. All rights reserved.

Multiple tools with a single, simple goal: Distributed storage with in-place computation

Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasetsCompute using a variety of paradigms (e.g. MapReduce)

PivotalHadoop

Pivotal Greenplum Database

HAWQ

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2

Page 98: Driving Real Insights Through Data Science

119© Copyright 2015 Pivotal. All rights reserved.

Multiple tools with a single, simple goal: Distributed storage with in-place computation

SQL compliantWorld-class query optimizerInteractive queryHorizontal scalabilityRobust data managementCommon Hadoop formatsDeep analytics

PivotalHadoop

Pivotal Greenplum Database

HAWQ

Think of it as distributed PostGreSQL (GPDB) on Hadoop• SQL compliant• World-class query optimizer• Interactive query• Horizontal scalability• Robust data management• Common Hadoop formats• Deep analytics

Page 99: Driving Real Insights Through Data Science

120© Copyright 2015 Pivotal. All rights reserved.

A single address for everything analyticsAnalytics with Pivotal

Time-to-Insights

FORECASTING CLUSTERING

REGRESSION

CLASSIFICATION

OPTIMIZATION

Page 100: Driving Real Insights Through Data Science

121© Copyright 2015 Pivotal. All rights reserved.

P L A T F O R M

Data Science Toolkit

KEY TOOLS KEY LANGUAGES

SQL

Page 101: Driving Real Insights Through Data Science

122© Copyright 2015 Pivotal. All rights reserved.

Historically data was studied in silosBRCA dataset

Treatments

Protein Assays

Imaging

VariantsPatient History &Follow-Ups

Gene Expression

Copy NumberVariation

miRNA

Page 102: Driving Real Insights Through Data Science

123© Copyright 2015 Pivotal. All rights reserved.

GenomicsData Center

ResearcherComputing

ClusterUnnecessary data movementNetwork usage

Need for new environment: Data movementComputation and storage in a single location

Page 103: Driving Real Insights Through Data Science

124© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SQL & RCOVARIATES GENOTYPESIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

SNP1 2 MAA

CC

TT

AT

CG

TT

AA

GG

TC

TT CG

TC

Page 104: Driving Real Insights Through Data Science

125© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 105: Driving Real Insights Through Data Science

126© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 106: Driving Real Insights Through Data Science

127© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 107: Driving Real Insights Through Data Science

128© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-21

2 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

Page 108: Driving Real Insights Through Data Science

129© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-21

2 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

• In-database computation of 1 million loci for thousands of individuals in seconds

• Results are easily manipulated and explored

Page 109: Driving Real Insights Through Data Science

130© Copyright 2015 Pivotal. All rights reserved.

In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-21

2 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

• In-database computation of 1 million loci for thousands of individuals in seconds

• Results are easily manipulated and explored

Page 110: Driving Real Insights Through Data Science

131© Copyright 2015 Pivotal. All rights reserved.

Procedural Languages in Big Data Science HAWQ & PL/X can take advantage of “data

parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks

– Little/no effort required to break up the problem into parallel tasks

– No dependency (or communication) between tasks

Examples of ‘data parallel’ problems:– Counting words in documents– Genome-Wide Association Study– Studying network anomalies

Sample Implementations by PDL– Digital image processing– Bayesian Inference with MCMC– Parallel Bagged Decision Trees

Doc1 Doc2 DocM

Stem1 Stem2 StemM

SQL & R

Count1 Count2 CountM

NetworkInterconnect

MasterSevers

SegmentSevers

Page 111: Driving Real Insights Through Data Science

132© Copyright 2015 Pivotal. All rights reserved.

Finding Causal Variants in LupusCustomer

Biotech Company

Business Problem

The customer wants to establish internal data science capabilities: building a culture and acquiring hardware and people to support it

Challenges

Customer needs to establish a culture around sharing and analyzing datasets for value

Current in-house technology is unable to support large-scale analysis (e.g. unable to analyze genomics datasets)

Need to learn new paradigms for analyzing data at-scale

Solution

Train customer employees on our solution stack, provide one-on-one consulting and run a hackathon

Greatly reduced computation time enabling implementation and results during a 30 hour period

– Module previously requiring 30min took only 5sec in using SQL in database

Novel scientific discovery on untouched data– Dataset previously untouched for 2 years due to

limited resources– Mine for statistically significant associations

between ~400,000 variants for 1000 phenotypes

Page 112: Driving Real Insights Through Data Science

133© Copyright 2015 Pivotal. All rights reserved.

Processing images and building integrated models at scale

Page 113: Driving Real Insights Through Data Science

134© Copyright 2015 Pivotal. All rights reserved.

Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS

Map reduce

Map reduce

One sequence file

Page 114: Driving Real Insights Through Data Science

135© Copyright 2015 Pivotal. All rights reserved.

Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Join to additional datasets

ProteomicsMedical HistoryVariants

Additional Datasets

Build Models at-Scale

SQLOne sequence file

Page 115: Driving Real Insights Through Data Science

136© Copyright 2015 Pivotal. All rights reserved.

Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

One sequence file

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce

Join to additional datasets

ProteomicsMedical HistoryVariants

Additional Datasets

Build Models at-Scale

SQL

Page 116: Driving Real Insights Through Data Science

137© Copyright 2015 Pivotal. All rights reserved.

Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

One sequence file

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce PL/X

SQL

Join to additional datasets

Feature Generation

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

ProteomicsMedical HistoryVariants

Additional Datasets

Build Models at-Scale

SQL

Page 117: Driving Real Insights Through Data Science

138© Copyright 2015 Pivotal. All rights reserved.

Representing an image in HAWQHAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image:Col

Row

0 1 2012

0 00 10 21 01 11 22 02 12 2

col

row

intsy

Structured:

Page 118: Driving Real Insights Through Data Science

139© Copyright 2015 Pivotal. All rights reserved.

Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) FROM tbl GROUP BY intsy

Output 150, 5215, 4

HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations

No data movement required

Simple SQL queries for data exploration0 00 10 21 01 11 22 02 12 2

Source Image:Col

Row

0 1 2012

col

row

intsy

Structured:

Page 119: Driving Real Insights Through Data Science

140© Copyright 2015 Pivotal. All rights reserved.

Image Processing PipelineFor Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

SmoothingAverage over

window of pixels

ThresholdingSelect pixels under intensity threshold

CleanupMin/max over

window of pixels

Object DetectionConnected

components

Object CountingSelect components

with size filter

Page 120: Driving Real Insights Through Data Science

141© Copyright 2015 Pivotal. All rights reserved.

Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

One sequence file

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce PL/X

SQL

Join to additional datasets

Feature Generation

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

ProteomicsMedical HistoryVariants

Additional Datasets

Build Models at-Scale

SQL

Page 121: Driving Real Insights Through Data Science

142© Copyright 2015 Pivotal. All rights reserved.

A Drug-Centric Data Lake to Enable Drug DiscoveryCustomer A major pharmaceutical company

Business ProblemIdentifying promising drug targets leveraging and integrating the vast datasets available will reduce time and cost to bring a new product to market

Challenges Data for drugs screens across multiple modalities

cannot be easily integrated Current environments cannot support the growing

data form high-content screens Researchers are unable to leverage the entirety of

datasets, instead work with aggregates or summaries

Solution Proved that current customer models can be ported,

sped up and scaled in the Pivotal environment

Created richer models integrating multiple types of data (genomics, images, etc)

Improve models using the raw, most-granular data

Demonstrate how the availability of tools and data enables scientists to interrogate models and derive a deeper, actionable insight

Page 122: Driving Real Insights Through Data Science

143© Copyright 2015 Pivotal. All rights reserved.

http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!

Page 123: Driving Real Insights Through Data Science

144© Copyright 2015 Pivotal. All rights reserved.

FOR FURTHER INFO, CHECKOUT…

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Academy @ https://pivotal.biglms.com

Page 124: Driving Real Insights Through Data Science

145© Copyright 2015 Pivotal. All rights reserved. 145© Copyright 2013 Pivotal. All rights reserved.

Driving Insights from Data Lakes Manufacturing Demo

Matthew Ross & Antonio PetrolePivotal Data Engineering

Page 125: Driving Real Insights Through Data Science

146© Copyright 2015 Pivotal. All rights reserved.

Internet of What????

Page 126: Driving Real Insights Through Data Science

147© Copyright 2015 Pivotal. All rights reserved.

Industrial Internet of Things?

Page 127: Driving Real Insights Through Data Science

148© Copyright 2015 Pivotal. All rights reserved.

IoT Goes MainstreamAccording to Gartner, Inc. (a technology research and advisory corporation), there will be nearly 26 billion devices on the Internet of Things by 2020.

Page 128: Driving Real Insights Through Data Science

149© Copyright 2015 Pivotal. All rights reserved.

GE Doubles DownGE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry. GE estimates that connecting these industrial machines to the IoT could boost global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute research holds that IoT in general could add $6.2 billion to the global economy by 2025.

Page 129: Driving Real Insights Through Data Science

150© Copyright 2015 Pivotal. All rights reserved.

Converging Trends

InnovationNew Data New Processes New Insights

The Journey to the Data-Driven Enterprise

Data Scienceand Machine

LearningBig Data

Internet of Things

Page 130: Driving Real Insights Through Data Science

151© Copyright 2015 Pivotal. All rights reserved.

IoT Key WorkflowsData Flow Management

Reliable Infrastructure

Enterprise Level Tooling

● High Availability

● Fault Tolerant

● Scalable

● Data Overload

● Normalization

● Multiple Sources and Destinations

● Workflow Orchestration

● Admin Tooling

● Developer Enablement

Page 131: Driving Real Insights Through Data Science

152© Copyright 2015 Pivotal. All rights reserved.

IoT Key WorkflowsData Flow Management

Reliable Infrastructure

Enterprise Level Tooling

● High Availability

● Fault Tolerant

● Scalable

● Data Overload

● Normalization

● Multiple Sources and Destinations

● Workflow Orchestration

● Admin Tooling

● Developer Enablement

Page 132: Driving Real Insights Through Data Science

153© Copyright 2015 Pivotal. All rights reserved.

Data Flow Management-Data Overload● The ability to stream

and process massive amounts of data

● Must have a platform that can handle that much data without losing or corrupting any of it

Page 133: Driving Real Insights Through Data Science

154© Copyright 2015 Pivotal. All rights reserved.

Data Flow Management-Data Normalization and Cleansing

● Organizing Fields to fit into a relational structure

● Adding extra fields or removing unneeded ones

Page 134: Driving Real Insights Through Data Science

155© Copyright 2015 Pivotal. All rights reserved.

Data Flow Management-Multiple Sources and Destinations

● Stream data from multiple different sources

● Persist it so multiple different destinations

● Process multiple different data formats

Page 135: Driving Real Insights Through Data Science

156© Copyright 2015 Pivotal. All rights reserved.

IoT Key WorkflowsData Flow Management

Reliable Infrastructure

Enterprise Level Tooling

● High Availability

● Fault Tolerant

● Scalable

● Data Overload

● Normalization

● Multiple Sources and Destinations

● Workflow Orchestration

● Admin Tooling

● Developer Enablement

Page 136: Driving Real Insights Through Data Science

157© Copyright 2015 Pivotal. All rights reserved.

Reliable Infrastructure-High Availability and Fault Tolerance

● Resources are available under any conditions

● System stays up even if some resources go down

Page 137: Driving Real Insights Through Data Science

158© Copyright 2015 Pivotal. All rights reserved.

Reliable Infrastructure-Scalability

● Must be able to handle more traffic demand at any time

● Easily process big data workloads

Page 138: Driving Real Insights Through Data Science

159© Copyright 2015 Pivotal. All rights reserved.

IoT Key WorkflowsData Flow Management

Reliable Infrastructure

Enterprise Level Tooling

● High Availability

● Fault Tolerant

● Scalable

● Data Overload

● Normalization

● Multiple Sources and Destinations

● Workflow Orchestration

● Admin Tooling

● Developer Enablement

Page 139: Driving Real Insights Through Data Science

160© Copyright 2015 Pivotal. All rights reserved.

Enterprise Level Tooling-Workflows and Tooling

● Manage complex flows of data

● Provide rich User Interface Applications

Page 140: Driving Real Insights Through Data Science

161© Copyright 2015 Pivotal. All rights reserved.

Enterprise Level Tooling-Developer Enablement

● Full Featured APIs

● Extreme module customization if needed

Page 141: Driving Real Insights Through Data Science

162© Copyright 2015 Pivotal. All rights reserved.

Are you making the most out of your data?

Page 142: Driving Real Insights Through Data Science

163© Copyright 2015 Pivotal. All rights reserved.

Bringing it all together

Page 143: Driving Real Insights Through Data Science

164© Copyright 2015 Pivotal. All rights reserved. 164© Copyright 2013 Pivotal. All rights reserved.

Reporting is nice, but being able to take action is what drives the value of a platform

Page 144: Driving Real Insights Through Data Science

165© Copyright 2015 Pivotal. All rights reserved.

Predictive Analytics

Proactive Monitoring

Reactive Maintenance

Page 145: Driving Real Insights Through Data Science

166© Copyright 2015 Pivotal. All rights reserved.

ETL vs Streaming● Data is loaded in large

batches● Typically happens once a

day● Analysis can only be done

once data is transformed and persisted to data warehouse

Page 146: Driving Real Insights Through Data Science

167© Copyright 2015 Pivotal. All rights reserved.

ETL vs Streaming● Continuous, data streams

are “listening” for data being emitted from sensors

● Data can be analyzed in stream

● Can be integrated into data driven applications

Page 147: Driving Real Insights Through Data Science

168© Copyright 2015 Pivotal. All rights reserved.

Reactive Maintenance● Alert is sent out to someone

on factory floor● Worker must receive alert, and

be able to react● Equipment is down for

however long it takes for worker to receive notification and complete repairs.

Page 148: Driving Real Insights Through Data Science

169© Copyright 2015 Pivotal. All rights reserved.

Proactive Maintenance Workflow● Manager has Dashboard with

all gauges of the system● Historical Log of Recent Alerts

are visible on the Dashboard● Has the ability to dispatch a

worker to investigate a specific line or robot

Page 149: Driving Real Insights Through Data Science

170© Copyright 2015 Pivotal. All rights reserved.

Predictive Analytics Workflow● Run Machine Learning and

Data Science models

● Incorporate Business Intelligence Tools

● Persist data to Data Lake and run advanced queries

Page 150: Driving Real Insights Through Data Science

171© Copyright 2015 Pivotal. All rights reserved.

Data Streaming Needs an Agile, Scalable and Fast Solution

Data Lake

Data Ingestion

Business Intelligence

Real Time Analytics

Mobile App

Page 151: Driving Real Insights Through Data Science

172© Copyright 2015 Pivotal. All rights reserved.

Ingest Transform SinkSpringXD

Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining

HDB PHD

DataLake

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable

Page 152: Driving Real Insights Through Data Science

173© Copyright 2015 Pivotal. All rights reserved.

INGEST / SINK PROCESS ANALYZE

• No coding required

• Dozens of built-in connectors

• Seamless integration with Kafka, Sqoop

• Create new connectors easily using Spring

• Call Spark, Reactor or RxJava

• Built-in configurable filtering, splitting and transformation

• Out-of-box configurable jobs for batch processing

• Import and invoke PMML jobs easily

• Call Python, R, Madlib and other tools

• Built-in configurable counters and gauges

Spring XDState of the Art Data Pipeline Automation

Page 153: Driving Real Insights Through Data Science

174© Copyright 2015 Pivotal. All rights reserved.

Pivotal HDBHadoop Native SQL

• Exceptional Hadoop Native SQL Performance

• No compatibility risks to SQL developers or SQL BI tools and applications

• Support query roll-ups, dynamic partitions and joins

• Massive MPP scalability to petabytes

• On premise or on the cloud

• Scale your cluster out, not up

• World class parallel loading and unloading

• Fast performance for complex and advanced data analytics

• Integrated with MADLib for advanced machine learning

• Powerful Cost-based Query Optimizer

Page 154: Driving Real Insights Through Data Science

BUILT FOR THE SPEED OF BUSINESS