Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
[email protected]@datatron.com
Machine Learning Model Life Cycle Management
In Production - #aiops #mlops
datatron
Who are we?
�2
Harish DoddiCEO
• Lyft Surge Pricing Model• Twitter’s Distributed Photo
Storage - 12 PB• Architected Snapchat Stories -
scaled from 0 to billions
Jerry XuCTO
• Lyft ETA Machine Learning Model - replaced Google
• Twitter’s Manhattan - replaced Cassandra
• Founding team of Windows Azure
Team: Previously worked at places like Amazon, Microsoft, and AnacondaHeadquarters: 350 Townsend Street, Suite 204, San Francisco, CA
datatron
Today’s Enterprises Data Science Lifecycle
�3
Development
Data Science
Initial AnalysisModel development using multiple frameworksTraining and Experimenting
Production
Engineering DevOps
DeploymentWorkflow processRoll back strategyA/B Testing
Post-Production
DevOps
Model Performance MonitoringInfrastructure MonitoringScalingModel Governance
datatron�4
Production and Post-Production
Deployment, Monitoring & Governance
Data Aggregation
Discovery & Analysis
Training & Experimenting
datatron
Lesson 1
Need for single centralized production teamand platform
�5
datatron�6
Fraud Data Science
MarketingData Science
Credit Risk Data Science
Production infrastructure
Production infrastructure
Production infrastructure
Financial Institutions Production Environment
datatron
Hidden Technical Debt in Machine Learning Systems
�7
Google PaperHidden Technical Debt in Machine Learning Systems
Machine learning offers a fantastically powerful toolkit for building complex systems quickly. … it
is remarkably easy to incur massive ongoing maintenance costs at the system level when
applying machine learning.
Boundary erosionEntanglementHidden feedback loopsUndeclared consumersData dependenciesChanges in the external worldSystem-level anti-patterns
datatron�8
Fraud Data Science
Horizontal DEVOPS AI team
MarketingData Science
Credit Risk Data Science
Production Model Deployment
Centralized Production AI Platform like Datatron
datatron
Lesson 2
Start adopting best practices and standardization early
�9
datatron�10
Old Way
Data Science
End User
Machine Learning
Engineering DevOps
• Teams operate in silos, don’t speak the same language• Errors due to lack of communication• Engineering has to write stand-alone scripts
Production Model
Teams face cross-functional
inefficiencies
datatron
Central devops platform for AI
models streamlines the process and
reduces the inefficiencies
�11
Data Science
End User
platformdatatron
DevOps
Production Model
New Way
datatron
Lesson 3
Data Science Universe shouldn’t be limited
�12
datatron�13
Model Containerization
Data Scientist Universe is un-limited
Team
Upcoming Frameworks
New Way
Team Team
Model
TensorFlowModel
scikit-learnModel
Apache Spark
No Model Containerization
Data Scientist Universe is limited
Old Way
TeamTeam Team
Model
datatron�14
datatron
Framework Agnostic
LibraryAgnostic
LanguageAgnostic
Infrastructure Agnostic
SageMaker
datatron
Lesson 4
Models may go wrong, you need to monitor them continuously
�15
datatron�16
South Park and Alexa
datatron�17
RecommendationModel
Features
Business is continuously losing value
Garbage
Old Way
RecommendationModel
Features
Recommendation
Minimize the continuous loss
Alert!
New Way
Alert!
datatron�18
New Way
# o
f Fra
ud
Tra
nsa
ctio
ns
Jan Feb Mar Apr May Jun
Reality
Model Prediction
Anomalies Detected
Production Model Monitoring
Allows You To Act Preemptively
Old Way
TOOLATE
# o
f Fra
ud
Tra
nsa
ctio
ns
Post-mortem reports
datatron
Model Performance monitoring• Confusion Matrix• Gain and Lift charts• Kolomogorov Smirnov chart• Area Under the ROC curve• Gini Coefficient• Concordant – Discordant ratio• Root Mean Squared Error (RMSE)• etcModel Timeout monitoringInfrastructure monitoringOrganization KPI monitoringAnomoly monitoring
�19
Monitoring for Machine Learning Models
datatron
Lesson 5
You either NEVER deploy a model, or you have to do it over and over again
�20
datatron
ML model is a continuously optimizing process
�21
Concept driftNew concept comes up
Model Building
Model Deployment
Model Monitoring
Model Management
datatron
Connecting Machine Learning to Software world
�22
Before Now
Software deployment once a 1 or 2 years
Software deployment every day
Future
Machine Learning models will deploy
very frequent and fast
Machine Learning models deploy
very slow
BUT
datatron
Lesson 6
Model Governance should be automated as much as possible
�23
datatron�24
Log
Comments
Versions
Model Versioning
New Way
Production Model Governance
Each Model Action Is Logged
Old Way
Comments
Version Log
LogLog
Log
Time for an audit
datatron
Lesson 7
Be prepared, your number of model will grow
�25
datatron�26
Deployment Learning: 1 model vs Multiple models
datatron
Value Proposition: Cost per order of magnitude
�27
Without Model Management
# of models
With Model Management
As the number of models increases, the cost also increases
As the number of models increases, the cost significantly decreases
# of models
Cost per model
Cost per model
datatron
Lesson 8
Senior people are required at later stages
�28
datatron
Software Development vs ML Model Development
�29
EvolutionTestingImplementationDesignRequirements
Monitoring and
Optimization
Deploy to Production
Training / Testing
Data Preparation
Requirements
Senior People
Senior People
datatron
Lesson 9
Your real work starts AFTER you deploy the model to production
�30
datatron
Enterprise AI Life Cycle
�31
Exploration Training Deploy
datatron
Model performance Model latency Infrastructure monitoring
Feature distribution Model result
Model routing Challenger KPI based selection
Model timeout Fall back strategy Alerting
Split traffic Shadowing Multi-Armed Bandit Optimization
Blue-Green deployment Rollback Canary
Enterprise AI Life Cycle After Deployment
�32
Deploy A/B Testing SLA
Model Selection
Anomaly Detection
Monitoring
Replay End-to-end tracing Logging
Troubleshooting
Cluster management integration Security check
IT Env Integration
Versioning History Approval process
Auditing
Tensorflow, sklearn, H2O, SAS, SparkML etc.
ML Frameworks
Support