Upload
vitalai
View
2.748
Download
1
Embed Size (px)
Citation preview
Optimizing theData Supply Chainfor Data Science
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Marc HadfieldCEO, Vital A.I.
about: vital ai
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Software Applications:Artificial Intelligence, Machine Learning, Data Science.
Software Vendor & Consulting Services
agenda
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
• Data Models • How A.I., Data Science, & Data Governance relate • Data Supply Chain & the Data Product • Problem: the “Telephone Game” across the DSC • Architecture Transition from Data Warehouse to DSC • Data Models and DSC; a Framework for Solutions • Examples • Collaboration & Visualization
note: general methodology, with some specific examples from Vital AI implementations.
takeaways:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
• The Data Supply Chain is a supply chain to deliver Data Products
• Data Models can capture the implicit meaning of data (and that is the goal!)
• Data Models can help negotiate the implicit differences across the DSC
• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners
about data models:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Semantic Models
big data:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
volume, velocity, variety, veracity
variety: data models“Product”: different meaning in Manufacturing vs Retail context
Healthcare, same entity: “Patient”, “InsuredPerson”, “BillableEntity”
example:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Class: PersonProperty: birthday
Standardized Unique Global Identifier (URI) data type: date relationship with property: age allowed range of values (can’t be born in the future) typical (average/expected) value…(Birthdays in Wikipedia vs Customer Database)
about: vital ai tech
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Vital AI Development Kit (VDK)VitalSigns — Data Modeling & Code Generation
VitalService — Common API for Databases, Machine Learning, Apache Spark, Data Transforms
about: vital ai tech
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
VitalServiceQuery
ExecutableQuery
Query Generator
Common Query API:Relational DB (SQL) Graph DB (Sparql) Key/Value Store NOSQL DBDocument DBApache SparkHive (Hadoop) Predictive Models (a query for an unknown value)
Goal: Build A.I. applications across variety of infrastructure with consistent API & Models.
example data:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Person:Recipient
Person:Sender Message
hasRecipient
hasSender
example “MetaQL” query:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
GRAPH { value segments: ["mydata"] ARC { node_constraint { Message.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {
Person.props().emailAddress.equalTo(“[email protected]") }
node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
“Person” may have subtypes, like Student or Employee.
a.i. and data quality
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
data models & machine learning:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
using the meaning of classes and properties, automatically generate predictive models.
predictive models features:birthday, zip code, …
data governance =defining the meaning of data = feature (pre)engineering
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
critical aspect of data science
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Progression of Analytics:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
where a.i. happens
Progression of Analytics:
Garbage In = Garbage Out
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
= Bad A.I.
data governance required for Good A.I.
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
one more point ondata governance…
think outside the box(data warehouse)
data governance: data in motion
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
vs.inside data warehouse
outside data warehouse
supply chain:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
product
data supply chain:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
data product
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Retail Recommendations… Shipping/Logistics Optimization… Compliance, Auditing, Security, Fraud Detection…
data product:
why data supply chain?
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest people work for someone else.” — Bill Joy.
why data supply chain?
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest people data works for someone else.” — Bill Joy. (revised)
data supply chain
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Partner DW
Your DW
why not ETL?
Extract…
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
not quite as expected…
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Transform…
a bit extreme…
Clean…
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
a lot of manual effort…
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
what goes wrong?
telephone game…
You61 Broadway Suite 1105
New York, NY [email protected]
http://www.vital.ai
Partner
Model “A”
Model “B”
Implicit Model
Resolution: Make explicit the implicit. Align Data Models.
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Reason:Implicit assumptions in the data. ETL can’t see the forest for the trees. (or it’s very difficult with missing assumptions)
Example: Internet of Things
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Predictive Analytics
“Nest for Office Buildings” Office Tower with Building Management System (BMS) containing 100,000 monitored points (temperature, energy usage of chiller, fan speed, etc.) with significant missing data, errors, and noise. Reconciliation of data to produce predictive models to minimize energy usage. Rules for data correctness.
Sensor Data Validation:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Source data had temperature values of “0” (zero) which meant either the temperature was 0 degrees or that the sensor had an error.Data Model “knows” that it’s rarely 0 degrees in July (far from the standard deviation), and that the temperature can be compared to weather data on a day in December for reasonableness. If Data Model also knows the maintenance schedule for the sensors, then it “knows” when to expect 0 error values and exclude them.
Missing Maintenance Assumptions. Fill in secondary (weather) data for validation.
how did we get here?61 Broadway Suite 1105
New York, NY [email protected]
http://www.vital.ai
Architecture Review:a quick step back…
What is a Data Supply Chain architecture?
“traditional” data warehouse:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
ETL within the organization.Data Governance across the organization.
DW
tech co. “agile” data warehouse:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
storage
compute
HDFS
Spark
DataSetsJobs
Batch/StreamingBuild Predictive Models Realtime: Spark/Storm
hadoop cluster
enterprise: data lake
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
storage
compute
HDFS
Spark
X(save $)
“Data Swamp”
aside: Data Lake
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
better analogy: Scriptorium
library,manuscript copying, & book distribution.
but not as Pithy as “Lake”…
tech co. microservices (micro-SOA):
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
storage
compute
service
“Composed” App
external: social data, weather API
independent clusters,local data expertise
optimize development processes, scale up.
microservices example:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Amazon: product search uses 170 independent microservices
including services for predicting customer characteristics, getting product images, etc.
http://www.infoworld.com/article/2903144/application-development/how-to-succeed-with-microservices-architecture.html
Netflix similar architecture
Data Supply Chain:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
storage
compute
service
Data Product
“ETL”
Owner “A” Owner “B”
optimize development processes, scale up.
independent clusters,local data, ownership
Interaction Points:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Data Product
service
compute ETL
Owner “A” Owner “B”
Data Lineage: Cloudera Navigator
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
…within a Data Warehouse
trace back jobs that produced every data field.
Data Supply Chain with Provenance:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
include provenance data directly in imported dataset. use in rules to interpret the data.
entity-123 | hasSource | datasource-A entity-123 | name | “John Doe”
Data Warehouse B
Interaction Points: Data Models
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Data Product
service
computeETL
Data Models: Gatekeepers & Transform
Owner “A” Owner “B”
Data Supply Chain using Models:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
storage
compute
service
Data Product
ETL
Owner “A” Owner “B”Model Server
Data Models: focus of data governance
Semantic Data Models:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Make explicit the meaning of data
Transformation and Validation Rules leverage the Model and Meaning.Such Rules may be packaged with the Model, and managed together.
Protect against implicit assumptions
Example: Financial Services
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
A B C
Service Provider
Reconciliation of Corporate Structure across 1,000’s of organizations. Compliance Rules barring communication between “researchers” and “traders”.Rules to infer if “Mary” is a “researcher” or “trader”.Conflicting concepts of “Branch-Office”, “Direct-Report”, etc. across the Globe.
Example: Hospital Group
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
A B C
Data Analytics
Reconciliation across Patient Records, Insurance, & Billing for Patient Predictive Analytics.Rules for identity: “same person”
Data Models:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
OWL: Semantic Ontology Model (W3C Standard, Various Standards for Rules)
VitalSigns: Generate Codevalidation, transformation, …
VitalSigns: Versioning, Dependencies, Exchange, Storage, Change Management (Semantic “Diff”)
Example: Personally Identifiable Information
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Data Governance determines that “Profession” and “ZipCode” cannot be used together. (Maybe a single “Dentist” in a small town…)
Within a single Data Warehouse we can bar these data elements from being combined. But:Microservice A provides value of “Profession” Microservice B provides value of “ZipCode” How to enforce that these two microservices cannot be combined?
Example: Personally Identifiable Information
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Validation code enforcing data usage:
Person person123 = get_person_details(“entity-123”) // this call works: person123.profession = get-profession(person123)// this call blocks because of data model validation // person123 already has “profession” propertyperson123.zipcode = get-zipcode(person123)
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
GatekeepersExternally Managed.Active not Passive, more like “code”.Defining what should exist, not cataloguing what exists.Can decide when to be tolerant or strict.
Semantic Data Models:
Collaborative Conversations:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Infrastructure DevOps
Data Scientists
Business +Domain Experts
Developers SemanticModel
Collaborative Conversations:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Business +Domain Experts
SemanticModel
Business +Domain Experts
SemanticModel
Partner A Partner B
Model Alignment
What Concepts to combine, not what Tables to combine (that comes later).
Authoring Tool: OWL IDE Protege
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Visualization: Semantic Data
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Visualization: WebVOWL
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
in conclusion, takeaways:
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
• The Data Supply Chain is a supply chain to deliver Data Products
• Data Models can capture the implicit meaning of data (and that is the goal!)
• Data Models can help negotiate the implicit differences across the DSC
• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners
Questions?
61 Broadway Suite 1105 New York, NY 10006
[email protected] http://www.vital.ai
Marc HadfieldCEO, Vital [email protected]