38
1 © Copyright 2012 EMC Corporation. All rights reserved. OpenChorus: Building a Tool-Chest for Big Data Science Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum

OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

1 © Copyright 2012 EMC Corporation. All rights reserved.

OpenChorus: Building a Tool-Chest for Big Data Science

Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum

Page 2: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

2 © Copyright 2012 EMC Corporation. All rights reserved.

Agenda

! Tools for Data Science ! Data Science Workflow

! Greenplum OpenChorus

! How Chorus Works

Page 3: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

3 © Copyright 2012 EMC Corporation. All rights reserved.

Data Science Tools: Abundance of Riches

! Proliferation of tools ! Languages & Libraries

–  R, Matlab, Python – SciPy, NLTK, Madlib, Mahout

! Frameworks –  Graphlab, Pregel (Giraffe), Mesos, CEP

! Platforms/Data Stores –  MPP Databases, Hadoop, NoSQL (Hbase, Cassandra,

MongoDB), SciDB

Page 4: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

4 © Copyright 2012 EMC Corporation. All rights reserved.

Choice of Tool(s)?

! Hammer ? –  Hadoop ought to be sufficient for most tasks –  “If all you have have a hammer, throw away everything

that is not a nail” – Jimmy Lin (http://arxiv.org/abs/1209.2191)

–  Operational complexity / learning curve not worth efficiency

! Tool-Chest ? –  Use the right tool for the right job –  How to reduce complexity

Page 5: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

5 © Copyright 2012 EMC Corporation. All rights reserved.

Page 6: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

6 © Copyright 2012 EMC Corporation. All rights reserved.

Data Science Workload

! http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

! Obtain

! Scrub

! Explore

! Model !  Interpret

Page 7: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

7 © Copyright 2012 EMC Corporation. All rights reserved.

Obtain

! Corpus needs to be usable & sufficient ! Possibly from multiple independent sources

! Needs to be automated for streams

! Needs to have efficient ingestion for one-time data

Page 8: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

8 © Copyright 2012 EMC Corporation. All rights reserved.

Scrub

! Raw data is always messy –  Missing data, inconsistent data, charsets(!) –  NY, New York, NYC, Big Apple etc

! Growing Dictionaries

!  Join with Crowdsourcing –  Mechanical Turk etc

Page 9: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

9 © Copyright 2012 EMC Corporation. All rights reserved.

Explore

! Visualize, Clustering, Dimensionality reduction –  Feature correlations (scatter plots) –  Single feature histograms

! Challenge: How not to lose these observations

Page 10: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

10 © Copyright 2012 EMC Corporation. All rights reserved.

Model

! Find correlation of past data and outcome –  Find good training set –  Label the training set –  Derive model parameters –  Apply model, and validate

! Ensemble Models: Model of models

Page 11: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

11 © Copyright 2012 EMC Corporation. All rights reserved.

Interpret

! Models are built for prediction and interpretation ! Check that there are no “surprises”

! Reason about models

!  Improve models

Page 12: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

12 © Copyright 2012 EMC Corporation. All rights reserved.

Data Science Data Flow

! Raw Data (Timed, Partitioned, Crowdsourced, De-duped etc)

! Derived data (simple aggregates, other statistics)

! Models (Feature weights, decision trees)

!  Indexes

Page 13: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

13 © Copyright 2012 EMC Corporation. All rights reserved.

Data Diversity

! Natural Language Text, and Annotations !  (Bags of words) : Concept

! Graphs (sparse matrices)

! Dense Matrices

! Location (proximity)

Page 14: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

14 © Copyright 2012 EMC Corporation. All rights reserved.

•  Share drives •  Wiki •  Content mgmt app

Too Many Tools for One Data Science Project

Data Exploration and Sharing

Analytics Tools

Communications

Content and File Management

•  Data marts •  Excel spreadsheets •  Flat files

•  Data mining •  BI/Visualization •  Data integration

•  Emails •  Meetings •  IMs

Process Documentation

•  Some project plans, no up-to-date team collaborative documentation

Page 15: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

15 © Copyright 2012 EMC Corporation. All rights reserved.

High Cost of Knowledge Sharing

! Data science process breaks when organization structure changes

! Very difficult knowledge transfer

! No “insurance policy” for the data science intellectual assets

Page 16: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

16 © Copyright 2012 EMC Corporation. All rights reserved.

Delayed Time-to-Market

1. Find the data

2. Get access To data

4. Move to sandbox

5. Analysis Finally!

6. Operationalize the model

3. Learn about the data

6-9 Months

Page 17: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

17 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Chorus

! Collaborative analytics ! Powerful extensibility ! The freedom of open source

Greenplum’s Social Platform for Collaborative Data Science

Page 18: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

18 © Copyright 2012 EMC Corporation. All rights reserved.

Chorus Enables Collaborative Data Science

!  Collaborate within projects, share data, content, and findings across teams

! Make projects more transparent

!  Iterate faster for accelerated insights with real-time social collaboration

Self-Services Provisioning

Analysis & Modeling

Publish & Share

Data Discovery & Exploration

Collaboration

Page 19: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

19 © Copyright 2012 EMC Corporation. All rights reserved.

Powerful Extensibility

!  Integrated development environment for analytics

! Expand insights with simple access to third-party data

! Fusion with leading analytics and visualization tools

Page 20: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

20 © Copyright 2012 EMC Corporation. All rights reserved.

The Freedom of Open Source

www.openchorus.org

! Modify and extend to any environment

! Promotes an ecosystem of applications, startups, and data scientists community

Page 21: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

21 © Copyright 2012 EMC Corporation. All rights reserved.

How Chorus Works

Page 22: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

22 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Activities & Insights

Data

Work Files

How Chorus Works

Greenplum DB

Sandbox

Source Data

EDW

DB

GPDB

GPDB External

Table

Non-GPDB

Chorus View

Summary of next few slides, with

animation built in

Page 23: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

23 © Copyright 2012 EMC Corporation. All rights reserved.

Data Exploration Search and Data Discovery

!  Automatic indexing of meta-data, work files, comments, and insights

!  Quickly find data across the enterprise regardless of location

Page 24: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

24 © Copyright 2012 EMC Corporation. All rights reserved.

Data Exploration Data Preview and Visualization !  Data preview for instant

understanding !  Quick and easy data

visualizations –  Visualize data for faster insight

into datasets –  No need to export to third-party

applications like R –  Not a replacement for advanced

visualization tools

Page 25: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

25 © Copyright 2012 EMC Corporation. All rights reserved.

Data Exploration Living Data Dictionary !  Bring everything about the

data to the data –  Attach documents –  Ask questions –  Add comments

!  Build a living data dictionary –  Everything is current –  No more spreadsheets

Page 26: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

26 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Workspace – Streamlines Collaboration !  Chorus includes unlimited

workspaces, each representing individual project

!  Streamlines complex user-user and user-data interactions

Page 27: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

27 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Multi-level Secure Collaboration !  Authentication

–  Integrates with LDAP and AD for password management

!  Application access control –  User roles: Admin vs. general

user –  Workspace types: Public or

private

!  Data access control –  Chorus enforces database rules

and permissions

Page 28: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

28 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Activities & Insights

Data

Work Files

Data – Dataset Types 1.  Source Dataset

–  Pointer to the source data –  Both internal and external data –  Support both native connectivity

for GPDB and flat files –  Use GPDB External Tables for Non-

GPDB databases and Hadoop

2.  Sandbox Dataset –  Copy of the source data to be used

for analytics –  Data generated from analytics

Page 29: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

29 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Activities & Insights

Data

Work Files

Data – Sandbox !  Container of all the analytics

data !  Ease of self-service

provisioning of sandboxes –  Free up IT bandwidth –  Minimize data proliferation to

uncontrolled/unmanaged data marts

Greenplum DB

Sandbox

Page 30: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

30 © Copyright 2012 EMC Corporation. All rights reserved.

Data – Populating Sandbox Import data easily from anywhere: !  Directly from sources !  Through Chorus View !  Flat file import

Cho

rus

Wor

kspa

ce

Activities & Insights

Data

Work Files

Greenplum DB

Sandbox

Chorus View

Source Data

Non-GPDB EDW

GPDB

GPDB External

Table

Page 31: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

31 © Copyright 2012 EMC Corporation. All rights reserved.

Data – Chorus View Utility !  Single-view GUI utility for

exploring, filtering, aggregating, and moving the desired data from sources to sandbox

!  Data exploration and visualization prior to bringing the data into sandbox

!  Derive variation of the basic source datasets without bringing the data into sandbox

Cho

rus

Wor

kspa

ce

Activities & Insights

Data

Work Files

Greenplum DB

Sandbox

Chorus View

Page 32: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

32 © Copyright 2012 EMC Corporation. All rights reserved.

Sandbox Dataset

Data - Chorus View Chorus View

Select a.userid, a.customer_name, a.gender, a.customer_state, b.ipaddress, b.device, From customers AS a INNER JOIN weblog_2012q1 as b ON a.userid = b.userid

Source Dataset

GPDB External Table

EDW GPDB

GPDB External Table

Page 33: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

33 © Copyright 2012 EMC Corporation. All rights reserved.

Data – Automated Data Services

!  Subscribe to receive automatic updates

–  Schedule imports from multiple data sources

–  Define and share data sets within the data science team

–  Removes manual data refresh activities

Page 34: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

34 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Work Files

!  Work files are non-data assets –  SQL query statements with

code editor interface –  Execution of in-database

analytics, ex: MADLib, PL/R –  Third-party tool files –  PowerPoint, Word doc, etc.

!  Analytics asset management with version, compare, and archive work files

Activities & Insights

Data

Work Files

Greenplum DB

Sandbox

Chorus View

Page 35: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

35 © Copyright 2012 EMC Corporation. All rights reserved.

Integration with Analytics Tools

!  Third-party tools –  Execute in-database analytics

functions (ex: MADLib, R) from Chorus work files

–  Publish and execute Alpine Miner Workflow from Chorus native interface

–  Data preparation for analysis using SAS and other analytics tools

!  Code-design UI for SQL

Page 36: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

36 © Copyright 2012 EMC Corporation. All rights reserved.

Insight and Data Sharing

!  Post comments and ask questions on any analytics artifacts

!  Share and publish any activities or insights

!  Promote fast iteration on data and ideas

Page 37: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

37 © Copyright 2012 EMC Corporation. All rights reserved.

Cho

rus

Wor

kspa

ce

Activities and Insights

!  Build a living library of activities and insights

–  Define, publish, and share new insights

–  Discover and learn from existing insights

!  Iterate faster, model less

Activities & Insights

Data

Work Files

Greenplum DB

Sandbox

Chorus View

Page 38: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving

38 © Copyright 2012 EMC Corporation. All rights reserved.