66
Optimizing the Data Supply Chain for Data Science 61 Broadway Suite 1105 New York, NY 10006 [email protected] http://www.vital.ai Marc Hadfield CEO, Vital A.I.

Optimizing the Data Supply Chain for Data Science

  • Upload
    vitalai

  • View
    2.748

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Optimizing the Data Supply Chain for Data Science

Optimizing theData Supply Chainfor Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Marc HadfieldCEO, Vital A.I.

Page 2: Optimizing the Data Supply Chain for Data Science

about: vital ai

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Software Applications:Artificial Intelligence, Machine Learning, Data Science.

Software Vendor & Consulting Services

Page 3: Optimizing the Data Supply Chain for Data Science

agenda

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

• Data Models • How A.I., Data Science, & Data Governance relate • Data Supply Chain & the Data Product • Problem: the “Telephone Game” across the DSC • Architecture Transition from Data Warehouse to DSC • Data Models and DSC; a Framework for Solutions • Examples • Collaboration & Visualization

note: general methodology, with some specific examples from Vital AI implementations.

Page 4: Optimizing the Data Supply Chain for Data Science

takeaways:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

• The Data Supply Chain is a supply chain to deliver Data Products

• Data Models can capture the implicit meaning of data (and that is the goal!)

• Data Models can help negotiate the implicit differences across the DSC

• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners

Page 5: Optimizing the Data Supply Chain for Data Science

about data models:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Semantic Models

Page 6: Optimizing the Data Supply Chain for Data Science

big data:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

volume, velocity, variety, veracity

variety: data models“Product”: different meaning in Manufacturing vs Retail context

Healthcare, same entity: “Patient”, “InsuredPerson”, “BillableEntity”

Page 7: Optimizing the Data Supply Chain for Data Science

example:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Class: PersonProperty: birthday

Standardized Unique Global Identifier (URI) data type: date relationship with property: age allowed range of values (can’t be born in the future) typical (average/expected) value…(Birthdays in Wikipedia vs Customer Database)

Page 8: Optimizing the Data Supply Chain for Data Science

about: vital ai tech

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Vital AI Development Kit (VDK)VitalSigns — Data Modeling & Code Generation

VitalService — Common API for Databases, Machine Learning, Apache Spark, Data Transforms

Page 9: Optimizing the Data Supply Chain for Data Science

about: vital ai tech

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

VitalServiceQuery

ExecutableQuery

Query Generator

Common Query API:Relational DB (SQL) Graph DB (Sparql) Key/Value Store NOSQL DBDocument DBApache SparkHive (Hadoop) Predictive Models (a query for an unknown value)

Goal: Build A.I. applications across variety of infrastructure with consistent API & Models.

Page 10: Optimizing the Data Supply Chain for Data Science

example data:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Person:Recipient

Person:Sender Message

hasRecipient

hasSender

Page 11: Optimizing the Data Supply Chain for Data Science

example “MetaQL” query:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

GRAPH { value segments: ["mydata"] ARC { node_constraint { Message.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“[email protected]") }

node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

“Person” may have subtypes, like Student or Employee.

Page 12: Optimizing the Data Supply Chain for Data Science

a.i. and data quality

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 13: Optimizing the Data Supply Chain for Data Science

data models & machine learning:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

using the meaning of classes and properties, automatically generate predictive models.

predictive models features:birthday, zip code, …

Page 14: Optimizing the Data Supply Chain for Data Science

data governance =defining the meaning of data = feature (pre)engineering

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

critical aspect of data science

Page 15: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Progression of Analytics:

Page 16: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

where a.i. happens

Progression of Analytics:

Page 17: Optimizing the Data Supply Chain for Data Science

Garbage In = Garbage Out

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

= Bad A.I.

data governance required for Good A.I.

Page 18: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

one more point ondata governance…

think outside the box(data warehouse)

Page 19: Optimizing the Data Supply Chain for Data Science

data governance: data in motion

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

vs.inside data warehouse

outside data warehouse

Page 20: Optimizing the Data Supply Chain for Data Science

supply chain

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 21: Optimizing the Data Supply Chain for Data Science

supply chain:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

product

Page 22: Optimizing the Data Supply Chain for Data Science

data supply chain:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

data product

Page 23: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Retail Recommendations… Shipping/Logistics Optimization… Compliance, Auditing, Security, Fraud Detection…

data product:

Page 24: Optimizing the Data Supply Chain for Data Science

why data supply chain?

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Partner DW Your DW

"No matter who you are, most of the smartest people work for someone else.” — Bill Joy.

Page 25: Optimizing the Data Supply Chain for Data Science

why data supply chain?

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Partner DW Your DW

"No matter who you are, most of the smartest people data works for someone else.” — Bill Joy. (revised)

Page 26: Optimizing the Data Supply Chain for Data Science

data supply chain

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Partner DW

Your DW

why not ETL?

Page 27: Optimizing the Data Supply Chain for Data Science

Partner DW

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 28: Optimizing the Data Supply Chain for Data Science

Extract…

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

not quite as expected…

Page 29: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Transform…

a bit extreme…

Page 30: Optimizing the Data Supply Chain for Data Science

Load…

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

a bit messy…

Page 31: Optimizing the Data Supply Chain for Data Science

Clean…

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

a lot of manual effort…

Page 32: Optimizing the Data Supply Chain for Data Science

… your imported data61 Broadway Suite 1105

New York, NY [email protected]

http://www.vital.ai

Page 33: Optimizing the Data Supply Chain for Data Science

Your DW61 Broadway Suite 1105

New York, NY [email protected]

http://www.vital.ai

Partner DW

Why?

Page 34: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

what goes wrong?

telephone game…

Page 35: Optimizing the Data Supply Chain for Data Science

You61 Broadway Suite 1105

New York, NY [email protected]

http://www.vital.ai

Partner

Model “A”

Model “B”

Implicit Model

Page 36: Optimizing the Data Supply Chain for Data Science

Resolution: Make explicit the implicit. Align Data Models.

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Reason:Implicit assumptions in the data. ETL can’t see the forest for the trees. (or it’s very difficult with missing assumptions)

Page 37: Optimizing the Data Supply Chain for Data Science

Example: Internet of Things

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Predictive Analytics

“Nest for Office Buildings” Office Tower with Building Management System (BMS) containing 100,000 monitored points (temperature, energy usage of chiller, fan speed, etc.) with significant missing data, errors, and noise. Reconciliation of data to produce predictive models to minimize energy usage. Rules for data correctness.

Page 38: Optimizing the Data Supply Chain for Data Science

Sensor Data Validation:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Source data had temperature values of “0” (zero) which meant either the temperature was 0 degrees or that the sensor had an error.Data Model “knows” that it’s rarely 0 degrees in July (far from the standard deviation), and that the temperature can be compared to weather data on a day in December for reasonableness. If Data Model also knows the maintenance schedule for the sensors, then it “knows” when to expect 0 error values and exclude them.

Missing Maintenance Assumptions. Fill in secondary (weather) data for validation.

Page 39: Optimizing the Data Supply Chain for Data Science

how did we get here?61 Broadway Suite 1105

New York, NY [email protected]

http://www.vital.ai

Architecture Review:a quick step back…

What is a Data Supply Chain architecture?

Page 40: Optimizing the Data Supply Chain for Data Science

“traditional” data warehouse:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

ETL within the organization.Data Governance across the organization.

DW

Page 41: Optimizing the Data Supply Chain for Data Science

tech co. “agile” data warehouse:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

storage

compute

HDFS

Spark

DataSetsJobs

Batch/StreamingBuild Predictive Models Realtime: Spark/Storm

hadoop cluster

Page 42: Optimizing the Data Supply Chain for Data Science

enterprise: data lake

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

storage

compute

HDFS

Spark

X(save $)

“Data Swamp”

Page 43: Optimizing the Data Supply Chain for Data Science

aside: Data Lake

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

better analogy: Scriptorium

library,manuscript copying, & book distribution.

but not as Pithy as “Lake”…

Page 44: Optimizing the Data Supply Chain for Data Science

tech co. microservices (micro-SOA):

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

storage

compute

service

“Composed” App

external: social data, weather API

independent clusters,local data expertise

optimize development processes, scale up.

Page 45: Optimizing the Data Supply Chain for Data Science

microservices example:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Amazon: product search uses 170 independent microservices

including services for predicting customer characteristics, getting product images, etc.

http://www.infoworld.com/article/2903144/application-development/how-to-succeed-with-microservices-architecture.html

Netflix similar architecture

Page 46: Optimizing the Data Supply Chain for Data Science

Data Supply Chain:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

storage

compute

service

Data Product

“ETL”

Owner “A” Owner “B”

optimize development processes, scale up.

independent clusters,local data, ownership

Page 47: Optimizing the Data Supply Chain for Data Science

Interaction Points:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Data Product

service

compute ETL

Owner “A” Owner “B”

Page 48: Optimizing the Data Supply Chain for Data Science

Data Lineage: Cloudera Navigator

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

…within a Data Warehouse

trace back jobs that produced every data field.

Page 49: Optimizing the Data Supply Chain for Data Science

Data Supply Chain with Provenance:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

include provenance data directly in imported dataset. use in rules to interpret the data.

entity-123 | hasSource | datasource-A entity-123 | name | “John Doe”

Data Warehouse B

Page 50: Optimizing the Data Supply Chain for Data Science

Interaction Points: Data Models

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Data Product

service

computeETL

Data Models: Gatekeepers & Transform

Owner “A” Owner “B”

Page 51: Optimizing the Data Supply Chain for Data Science

Data Supply Chain using Models:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

storage

compute

service

Data Product

ETL

Owner “A” Owner “B”Model Server

Data Models: focus of data governance

Page 52: Optimizing the Data Supply Chain for Data Science

Semantic Data Models:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Make explicit the meaning of data

Transformation and Validation Rules leverage the Model and Meaning.Such Rules may be packaged with the Model, and managed together.

Protect against implicit assumptions

Page 53: Optimizing the Data Supply Chain for Data Science

Example: Financial Services

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

A B C

Service Provider

Reconciliation of Corporate Structure across 1,000’s of organizations. Compliance Rules barring communication between “researchers” and “traders”.Rules to infer if “Mary” is a “researcher” or “trader”.Conflicting concepts of “Branch-Office”, “Direct-Report”, etc. across the Globe.

Page 54: Optimizing the Data Supply Chain for Data Science

Example: Hospital Group

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

A B C

Data Analytics

Reconciliation across Patient Records, Insurance, & Billing for Patient Predictive Analytics.Rules for identity: “same person”

Page 55: Optimizing the Data Supply Chain for Data Science

Data Models:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

OWL: Semantic Ontology Model (W3C Standard, Various Standards for Rules)

VitalSigns: Generate Codevalidation, transformation, …

VitalSigns: Versioning, Dependencies, Exchange, Storage, Change Management (Semantic “Diff”)

Page 56: Optimizing the Data Supply Chain for Data Science

Example: Personally Identifiable Information

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Data Governance determines that “Profession” and “ZipCode” cannot be used together. (Maybe a single “Dentist” in a small town…)

Within a single Data Warehouse we can bar these data elements from being combined. But:Microservice A provides value of “Profession” Microservice B provides value of “ZipCode” How to enforce that these two microservices cannot be combined?

Page 57: Optimizing the Data Supply Chain for Data Science

Example: Personally Identifiable Information

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Validation code enforcing data usage:

Person person123 = get_person_details(“entity-123”) // this call works: person123.profession = get-profession(person123)// this call blocks because of data model validation // person123 already has “profession” propertyperson123.zipcode = get-zipcode(person123)

Page 58: Optimizing the Data Supply Chain for Data Science

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

GatekeepersExternally Managed.Active not Passive, more like “code”.Defining what should exist, not cataloguing what exists.Can decide when to be tolerant or strict.

Semantic Data Models:

Page 59: Optimizing the Data Supply Chain for Data Science

Collaborative Conversations:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers SemanticModel

Page 60: Optimizing the Data Supply Chain for Data Science

Collaborative Conversations:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Business +Domain Experts

SemanticModel

Business +Domain Experts

SemanticModel

Partner A Partner B

Model Alignment

What Concepts to combine, not what Tables to combine (that comes later).

Page 61: Optimizing the Data Supply Chain for Data Science

Authoring Tool: OWL IDE Protege

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 62: Optimizing the Data Supply Chain for Data Science

Visualization: Semantic Data

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 63: Optimizing the Data Supply Chain for Data Science

Visualization: WebVOWL

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Page 64: Optimizing the Data Supply Chain for Data Science

in conclusion, takeaways:

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

• The Data Supply Chain is a supply chain to deliver Data Products

• Data Models can capture the implicit meaning of data (and that is the goal!)

• Data Models can help negotiate the implicit differences across the DSC

• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners

Page 65: Optimizing the Data Supply Chain for Data Science

Questions?

61 Broadway Suite 1105 New York, NY 10006

[email protected] http://www.vital.ai

Marc HadfieldCEO, Vital [email protected]

Page 66: Optimizing the Data Supply Chain for Data Science