[Research] azure ml anatomy of a machine learning service - Sharat Chikkerur

Microsoft Azure Machine LearningAnatomy of a machine learning service

Sharat Chikkerur, Senior Software Engineer, Microsoft(On behalf of AzureML team)

Microsoft Azure Machine Learning (AzureML)

• AzureML is a cloud-hosted tool for creating and deploying machine learning models• Browser-based, zero-installation and cross platform

• Describe workflows graphically

• Workflows are versioned and support reproducibility

• Models can be programmatically retrained

• Models can be deployed to Azure as a scalable web service• Can be scaled to 1000+ end points x 200 response containers per service

• Supports versioning, collaboration & monetization

Outline

• Distinguishing features (functional components) of AzureML

• Architectural components of AzureML

• Implementation details

• Lessons learned

Distinguishing features

MLStudio: Graphical authoring environment

AzureML EntitiesWorkspacesExperimentsGraphsDatasetsAssetsActionsWeb services

Versioning

• Each run of an experiment is versioned• Can go back in time and examine historical results

• Intermediate results cached across experiments in workspace• Each dataset has a unique source transformation

Collaboration

• Workspaces can be shared between multiple users• Two users cannot however edit the same experiment simultaneously

• Any experiment can be pushed to a common AzureML gallery • Allows experiments, models and transforms to be easily shared with the

AzureML user community

External Language Support

• Full-fidelity support for R, Python and SQL (via SQLite)• AzureML datasets marshalled transparently

• R models marshalled into AzureML models

• Scripts available as part of operationalized web services

• Code isolation• External language modules are executed within drawbridge (container)

• “Batteries included”• R 3.1.0 with ~500 packages, Anaconda Python 2.7 with ~120 packages

• An experiment to be operationalized must be converted into a “scoring” experiment

• Training and scoring experiments are “linked”

Operationalization

Operationalization

• A successful scoring experiment can be published as a web service• Published web services are automatically managed, scaled out and load-balanced

• Web service available in two flavors• Request/Response: Low-latency endpoint for scoring a single row at a time

• Batch: Endpoint for scoring a collection of records from Azure storage

Monetization

• Data marketplace (http://datamarket.azure.com) allows users to monetize data models

• Supports• Web services published through AzureML

• Stand alone web services

• Integration• Python/R modules can query external web services (including marketplace

APIs) allowing functional composition

http://datamarket.azure.com/

Architectural components

Component services

• Studio (UX)

• Experimentation Service (ES)• Comprised of micro-services

• Job Execution Service (JES)

• Single Node Runtime (SNR)

• Request response service (RRS)

• Batch execution service (BES)

UX ES JES SNR

RRSBES

User

Studio (UX)

• Primary UX layer• Single page application

• Asset Palette• Datasets• Algorithms• Trained models• External language modules

• Experiment canvas• DAG consisting of modules

• Module properties• Parameters

• Action bar• Commands to ES

UX ES JES SNR

RRSBES

User

Experimentation Service (ES)

• Primary backend• Orchestrates all component services

• Handles events to/from UX

• Programmatic access• RESTful API (UX communicates this way)

• Features• Experiment introspection

• Experiment manipulation/creation

• Consists of micro services• UX, assets, authentication, packing etc.

UX ES JES SNR

RRSBES

User

Job Execution Service (JES)• Primary job scheduler

• Dependency tracking• Experiment DAG defines dependencies between modules.

• Topological sort used to determined order of execution

• Parallel Execution• Different experiments can be executed in parallel

• Modules that exist at the same depth in the tree can be scheduled in parallel

• Note: JES itself does not execute the task payload. They are dispatched to a task queue

UX ES JES SNR

RRSBES

User

Single Node Runtime (SNR)

• Executes tasks dispatched from JES• Consumes tasks from a queue

• Tasks consists of input specification along with module parameters

• Stateless : Data required for execution is copied over

• Each SNR contains a copy of Runtime + modules• Runtime-DataTables, Array implementation, IO, BaseClasses etc.

• Modules – machine learning algorithms

• SNR pool shared across deployment• Size of the pool can be scaled based on demand

UX ES JES SNR

RRSBES

User

Machine learning algorithms• Sources of machine learning module assets

• Microsoft research

• Infer.NET (http://research.microsoft.com/en-us/um/cambridge/projects/infernet/)

• Vowpal wabbit (http://hunch.net)

• OpenSource• LibSVM

• PegaSOS

• OpenCV

• R

• Scikit-learn

UX ES JES SNR

RRSBES

User

http://research.microsoft.com/en-us/um/cambridge/projects/infernet/

http://hunch.net/

Category Sub category Module Reference

Supervised Binary Classification Average Perceptron (Freund & Schapire, 1999)Bayes point machine (Herbrich, Graepel, & Campbell, 2001)

Boosted decision tree (Burges, 2010)

Decision jungle (Shotton et al., 2013)

Locally Deep SVM (Jose & Goyal, 2013)

Logistic regression (Duda, Hart, & Stork, 2000)

Neural network (Bishop, 1995)

Online SVM (Shalev-Shwartz et al., 2011)

Vowpal Wabbit (Langford et al., 2007)

Multiclass Decision Forest (Criminisi, 2011)

Decision Jungle (Shotton et al., 2013)

Multinomial regression (Andrew & Gao, 2007)

Neural network (Bishop, 1995)

One-vs-all (Rifkin & Klautau, 2004)

Vowpal Wabbit (Langford et al., 2007)

Regression Bayesian linear regression (Herbrich et al., 2001)

Boosted decision tree regression (Burges, 2010)

Linear regression (batch and online) (Bottou, 2010)

Decision Forest regression (Criminisi, 2011)

Random forest based quantile Regression (Criminisi, 2011)

Neural network based regression (Bishop, 1995)

Ordinal regression (McCullagh, 1980)

Poisson regression (Nelder & Wedderburn, 1972)

Recommendation Matchbox recommender (Stern et al., 2009)

Unsupervised Clustering K-means clustering (Jain, 2010)

Anomaly detection One class SVM (Schölkopf, Platt, Shawe-Taylor, Smola, &

Williamson, 2001)

PCA based anomaly detection (Duda et al., 2000)

Feature Selection Filter Filter based feature selection (Guyon, Guyon, Elisseeff, & Elisseeff, 2003)

Text analytics Topic modeling Online LDA using Vowpal wabbit (Hoffman, Blei, & Bach, 2010)

Request response service (RRS)Batch Execution Service (BES)• RRS

• Handles RESTful requests for single prediction

• Requests may execute full graph• Can include data transformation before and after prediction

• Distinguishing feature compared to other web services

• Models and required datasets in graph are compiled to a static package

• Executes in-memory and on a single machine

• Can scale based on volume of requests

• BES• Optimized for batch request. Similar to training workflow

UX ES JES SNR

RRSBES

User

Implementation details

Implementation details : Data representation• “DataTable”

• Similar to R/Pandas dataframe

• Column major organization with sliced and random access

• Has a rich schema • Names: Allows re-ordering

• Purpose: Weights, Features, Labels etc.

• Stored as compressed 2D tiles• “wide” tiles enable streaming access

• “narrow” tiles enable full column access

• Interoperability• Can be marshalled in/out as R/Pandas dataframe

• Can be egressed out as CSV, TSV, SQL

Index 1

Block 1

Index 2

Block 2

Index 3

Block 3

Implementation details: Modules• Functional units in an experiment graph

• Encapsulates: data sources & sinks, models, algorithms, scripts

• Categories• Data ingress

• Supported sources: CSV, TSV, ARFF, LibSVM, SQL, Hive • Type guessing for CSV, TSV (allows override)

• Data manipulation• Cleaning missing values, SQL Transformation, R & Python scripts

• Modeling• Machine learning algorithm

• Supervised: binary classification, multiclass classification, linear regression, ordinal regression, recommendation

• Unsupervised: PCA, k-means• Optimization• Parameter sweep

Implementation details: Modules• Ports

• Define input and output contracts

• Allows multiple input formats per port• I/O handling is done externally to the

module through pluggable port handlers

• Allows UX to validate inputs at design time

• Parameters• Strongly typed

• Supports conditional parameters

• Can be marked as ‘web service’ parameter – substituted at query time

• Supports ranges (for parameter sweep)

Implementation detail: Testing• Standard tests

• UX tests

• Web services penetration testing

• Services integration test

• AzureML Specific tests• Module properties tests

• Schema propagation tests

• E2E experiment tests

• Operationalized experiment tests

• “Runners” test

• Machine learning tests• Accuracy tests

• Fuzz testing (boundary values testing)

• Golden values tests

• Auto-generated tests

Lessons learned

Lesson: Data wrangling is important• More time is built in data wrangling than model building

• “A data scientist spends nearly 80% of the time cleaning data” – NY Times (http://nyti.ms/1t8IzfE)

• Data manipulation modules are very popular• Internal ranking

• “Execute R script”, “SQL Transform” modules are more popular than machine learning modules.

• It is hard to anticipate all data pre-processing needs• Need to provide custom processing support

• SQL Transform

• Execute R script

• Execute Python script

Lesson: Make big data possible, but small data efficient

• Distributed machine learning comes with a large overhead (Zaharia et al. 2010)

• Typical data science workflows enable exploration with small amounts of data• Should make this effortless and intuitive

• AzureML approach: “Make big data possible, but small data efficient”• Make sure all experiment graphs can handle data size.

• Support ingress of large data – SQL, Azure

• Support features to pre-process big data• Feature selection

• Feature hashing

• Learning by counts – reduces high dimensional data to lower dimensional historic counts/rates

• Support streaming algorithms for big data (e.g. “Train Vowpal Wabbit”)

Lesson: Feature gaps are inevitable

• Cannot cover all possible pre-processing scenarios

• Cannot provide all algorithms

• Support for scripting (R, Python, SQL)• Allow custom data manipulation

• Allow users to bring in external libraries

• Allow users to call into other web services

• Isolate user code

• Support during operationalization

• Support custom modules• Allow user to author first class “modules”

• Allow use to mix custom modules in the workflow

Lesson: Data science workflows should be reproducible

• Data science workflows are iterative, explorative and collaborative• Need to provide a way to version and capture the workflow, settings, inputs etc.

• Make it easy to repeat the same experiment

• Reproducibility• Capture random number seeds as part of the experiment.

• Same settings should produce the same results

• Re-running parts of the graph should be efficient.

• “Determinism”• Modules are tagged as deterministic (e.g. SQL transform) or non-deterministic (e.g. :hive query)

• A graph can also be labeled as deterministic or non-deterministic

• Caching• Outputs from deterministic modules are cached to make re-runs efficient.

• Only changed parts of the graph are re-executed.

Summary• AzureML provides distinguishing features

• Visual authoring• Versioning and reproducibility• Collaboration

• Architecture• Multiple scalable services

• Implementation details• Extensible data format that can be interoperate with R & Python• Modules provide a way to package data & code

• Lessons learned• Data wrangling is important• Allow user code to mitigate feature gaps• Support big data but make small data efficient

Logistics: Getting access to AzureML

• http://azure.com/ml

• https://studio.azureml.net• Guest access w/o sign in

• Free access with sign-in ($200 credit)

• Paid access with azure subscription

• https://manage.windowsazure.com• Manage end points, storage accounts and workspaces

http://azure.com/ml

https://studio.azureml.net/

https://manage.windowsazure.com/

[email protected]

Developing a predictive model is hard

Challenges

• Data processing• Different sources, formats, schemas• Missing values, noisy data

• Modeling• Modeling choice• Feature engineering• Parameter tuning• Tracking & collaboration

• Deployment & Retraining• Productionizing/deployment of the

model• Replication, scaling out

Developing a predictive model is hard

Challenges

• Data processing• Different sources, formats, schemas• Missing values, noisy data

• Modeling• Modeling choice• Feature engineering• Parameter tuning• Tracking & collaboration

• Deployment & Retraining• Productionizing/deployment of the

model• Replication, scaling out

Solutions

• Data processing• Languages: SQL, R, python• Frameworks: dpylr, pandas• Stacks: Hadoop, Spark, Mapreduce

• Modeling• Libraries: Weka, VW, ML Lib, LibSVM• Feature engineering: gensim, NLTK• Tuning: Spearmint, whetlab• Tracking & collaboration: ipynb + github

• Deployment & Retraining• Machine learning web services

Implementation detail: Schema propagation

• Schema is associated with datasets/learners• Dataset attributes

• Required columns for learners etc.

• Design time validation• Module execution has latency overhead

• Schema is computed and propagated before executing module code.

• Method: pre-determined schema calculus• Each module class has well defined modification

of the schema

• One-off modules are encoded as exception

JES FE

JES WORKER

SNR FE

SNR WORKERTASKS STATE

USER WORKSPACE

EXPERIMENTATION SERVICE

Jobs Queue

Tasks Queue

JOBS STATE

• Stateless design, easy scalability, failover simplicity

• Optimistic concurrency, scheduling/locking overhead

• Separate shared storage, holding transient job/tasks state

• Task cache management to speed up execution and facilitate iterative experimentation

• Throttling to limit the resource usage per customer/workspace

• Plugin architecture for task handlers and schedulers

JES SNR interaction

Data & Analytics

[Research] azure ml anatomy of a machine learning service - Sharat Chikkerur