ML INFRASTRUCTURE PART 3 Connectivity

research

Third Quarter 2019 — Algorithmia Research

ML INFRASTRUCTURE PART 3

Connectivity

2 A Third Quarter 2019 — Algorithmia Research

Since 2014, Algorithmia has accelerated the deployment and adoption of

machine learning (ML) for many of the world’s largest enterprises. In addition

to our enterprise product, algorithmia.com currently serves more than 8,000

different models to more than 90,000 developers and processes millions of

requests every day.

As we’ve scaled through the years and serviced requests from our customers,

we’ve learned a lot about best practices and scaling machine learning

infrastructure. We’re passing that knowledge to ML developers to help them

along the path to maturity and empower data science teams to achieve more.

About This Series

https://algorithmia.com

Ml Infrastructure Part 3 — Connectivity A 3

Preparing data for ML pipelines is challenging when end-to-end data and analytic architectures are not refined to interoperate with underlying analytic platforms. New architectural patterns can help, but data engineers must understand end-to-end ML workflows to properly apply them.

Gartner, Preparing and Architecting for Machine Learning: 2018 Update, Carlton Sapp, 14 September 2018

“


ML Infrastructure

Machine learning (ML) toolsets, languages, and processes are evolving quickly.

The infrastructure to support ML must be able to adapt as data scientists

experiment with new and better solutions. At the same time, organizations must

be able to connect a variety of systems into a platform that delivers consistent

results.

ML architecture can be broken into four distinct functional groups: Data and

Data Management Systems, Training Platforms and Frameworks, Serving and

Life Cycle Management, and the External Systems with which they all interact.

This paper examines the infrastructure needed to connect those groups into a

system capable of productionizing ML at scale.

Data and Data Management Systems

The data used in ML projects for training and scoring and their related systems.

In almost all cases this infrastructure is in place before the buildout of the

remainder of an organization’s ML architecture.


Most data management systems include built-in authentication, role access

controls, and data views. In more advanced cases, the organization will have

a data-as-a-service engine that allows for querying data through a unified

interface. Even in the simplest cases, ML projects likely rely on a variety of

data formats—different types of data stores from many different vendors. For

example, one model might train on images from a cloud-based Amazon S3

bucket, while another pulls rows from on-premises PostgreSQL and SQL Server

databases, while a third interprets streaming transactional data from a Kafka

pipeline.

Training Platforms and Frameworks

Training platforms and frameworks are the wide variety of tools used for model

building and training, each of which should ultimately generate model files and

dependencies that the serving infrastructure can run and manage.

EXTERNALSYSTEMS

Data Management System

Training Platforms/Frameworks

Cloudera Workbench

TensorFlowScikit-learn

PyTorchKeras

SageMakerDataiku

H20.aiAzure ML

TRAINING SYSTEM A

TRAINING SYSTEM B

TRAINING SYSTEM C

Serving

Model PortfolioOrchestrationCLIDependency Mgt.LanguagesVersionsPipeliningComputeHardwareModel EvaluationGovernanceMonitoring

ETL Pipelines Feature Store

] Figure 3.1: Source: Algorithmia, 2019


Within training platforms, tooling options are nearly limitless. Dataiku, Amazon

SageMaker, Azure ML Studio, Cloudera Data Science Workbench, and dozens of

other commercial training platforms compete with home-grown solutions, any

of which might be the right solution for a given team and job. Given the highly

specialized nature of training tools, freedom of choice is paramount.

Serving and Life Cycle Management

The services that allow data scientists to deliver trained models into production

and maintain them include everything needed to:

● INGEST and containerize models and dependencies;

● CATALOG models to make them discoverable;

● SERVE models in a scalable environment;

● INTEGRATE into DevOps alerting, logging, and system health monitoring;

● MANAGE and govern the entire life cycle in compliance with

performance and regulatory / governance specifications.

External Systems

Machine learning does not exist in isolation. A wide variety of applications

external to the ML process need to consume model output, log and audit model

behavior, or otherwise monitor or integrate with data, training, and production

systems.


Connectivity ApproachesAs discussed in The Roadmap to Machine Learning Maturity whitepaper,

ML-focused projects generate value only after they connect these distinct

functional areas into a workflow. Data is useful only after models interpret it,

and model inference generates value when external apps consume it. The path

toward integration generally falls into one of two categories:

Approach 1: Horizontally Integrated

● PRO: Fastest path to automating existing processes.

● CON: Fragile, ongoing software development and maintenance, vendor lock-in.

The quickest way to develop an ML platform is by supporting only a subset of

solutions from each of the functional groups. By limiting the available options,

it is faster and easier to integrate each component into a horizontal platform,

often hardcoding the handoff from one component to the next.

This is where many companies building DIY systems begin—automating existing

processes and tightly integrating current tools. For these organizations, horizontal

integration offers the fastest path to in-house production. It requires no additional

TrainingStorage

scikit

Serving

] Figure 3.2: Horizontally integrated systems build hardcoded connectors between specific components.

https://info.algorithmia.com/machine-learning-roadmap


workforce training and simply adds speed to workflows already in place.

Unfortunately, this commits an organization to full-time software development.

Rather than training models and adding business value, organizations spend

resources building and maintaining brittle integrations, creating new projects

for any new tools and services, and ultimately, attempting to compete with

commercial platforms with far larger budgets.

Many commercial platforms also pursue a horizontally integrated

strategy, largely to block competitors. By mandating a training platform,

storage solution, or deployment infrastructure, commercial vendors can

increase the value of their customer relationships and reduce churn.

Compared to DIY solutions, these vendors generally offer more points of

integration—but only with non-competitive components, such as

frameworks or languages. These solutions bet that what they provide will be

good enough, and that the ease of an all-in-one solution will be worth vendor

lock-in and lack of choice as the customer matures.

In theory, horizontally integrated systems can be quite efficient and easy to

use, since they follow a simple path. In practice, resource constraints (for DIY

systems) or lack of competition (for commercial systems) often leads to sub-par

experiences throughout the system.

Regardless of whether the platform was purchased off the shelf or built in-

house, organizations will ultimately encounter complications in their upgrade


path as they attempt to stay on

top in a fast-moving environment.

Commercial cloud platform customers

are beholden to a massive company’s

rollout schedule, while DIY shops are

forced to wait for compatibility among

multiple vendor roadmaps.

Approach 2: Loosely Coupled, Tightly Integrated

● PRO: Flexibility to select the best tools for any job,

which helps, future-proofing ML investment.

● CON: More up-front development time or software licensing costs.

In ML, agility is essential because infrastructure that works today is guaranteed

to be outdated in six months.

Fortunately, each component of the ML system is fairly self-contained, and the

interactions of those components are always fairly consistent:

● Data informs all systems through queries.

● Training systems export model files and dependencies.

● Serving and life cycle management systems return inferences to

applications and model pipelines, and export logs to systems of record.

● External systems call models, trigger events, and capture and modify data.

The infrastructure and workflows within each component are quite complex but

Without the ability or necessity to do better, many systems simply don’t, providing poor documentation, sub-par UX, or lackluster performance.


the connective tissue binding them together does not need to be.

An architecture that allows each system to evolve independently can help

organizations choose the right components for today without sacrificing the

flexibility to rethink those choices tomorrow. To enable this loosely coupled,

best-of-breed approach, a deployment platform must support three kinds of

connectivity: Publish/Subscribe, Data Connectors, and RESTful APIs.

Publish/Subscribe

Publish/subscribe (pub/sub) is an asynchronous, message-oriented notification

pattern. In a pub/sub model, one system acts as a publisher, sending events to

a message broker. Through the message broker, subscriber systems explicitly

enroll in a channel, and the hub forwards and verifies delivery of publisher

notifications, which can then be used by subscribers as event triggers.

The pub/sub pattern is highly scalable, but in the context of ML, its most

important feature is flexibility. By abstracting communications between

publishers and subscribers, each side operates independently. This reduces

the overhead of integrating any number of systems and allows publishers and

subscribers to be swapped at any time, with no impact on performance. Since

ML infrastructure is diverse, evolves quickly, and has a wide variety of demand

cycles, the flexibility of pub/sub’s loose coupling provides an excellent fit for

most high-level communications tasks.


There are myriad technologies and services available to manage pub/sub

systems—Amazon SNS, Azure Service Bus, Google Pub/Sub, Kafka, and many

more. A deployment and management system must be designed to interact with

these systems. Algorithmia’s AI Layer provides configurable event listeners

that allow users to trigger actions based on input from pub/sub systems. In the

diagram above, a change in source data triggers a run of a specific model in

response, while also potentially triggering actions in other subscribed systems.

Data Connectors

While the model is the engine of any machine learning system, data is both the

fuel and the driver. Data feeds the model during training, influences the model in

production, then retrains the model in response to drift. As data changes, so does its

interaction with the model, and to support that iterative process, an ML deployment

and management system must integrate with every relevant data store.

Docs

Publisher SubscribersMessage Broker

SQS 1

SQS 2, 3...

SNS

AI LAYER

EventListener

^ Figure 3.3: A change in source data triggers a run of a specific model in response, while also potentially triggering actions in other subscribed systems.


Connecting to Cloud Data

From consumer apps to the enterprise, the cloud is the new default for data

storage, and a common location for training data. Any model deployment and

management platform should include connectors to the most popular cloud-

based data storage services. The AI Layer, for example, includes support for

multiple blob storage services, such as Amazon, Microsoft, Google, and others.

Connecting to Databases

An enormous amount of enterprise data is stored in databases, most of which

remain on-premises. A model deployment platform should be extensible to

allow developers to connect to a wide variety of databases.

^ Figure 3.4: The AI Layer includes pre-built integrations to popular cloud-based data sources.

algorithmia.com


Connecting to Other Sources

A massive amount of training data is stored in filesystems as images, delimited

data files, or other file formats. While a deployment platform could mandate

that data scientists upload these files to an approved cloud storage bucket then

connect directly (as do horizontally integrated platforms from cloud providers),

this introduces an unnecessary step and ties users to storage solutions that can

increase costs and might become deprecated in the future.

To offset these risks, deployment platforms should offer Web- and API-

based tools to import and host data from filesystems. This hosted data store

should support grouping of assets (similar to folders in a filesystem) and full

permissioning at both the asset and group levels. Models and algorithms should

be able to call all hosted data via a URI.

RESTful APIs

ML model output is consumed in many different ways. Applications written in

a variety of languages call models directly. Other models, written in yet another

set of languages, ingest output as part of multi-model pipelines.

Because of the variety of requesting platforms, and the unpredictability of those

requests, a loose coupling is, again, the most elegant answer, and RESTful APIs

are the most elegant implementation, due to the five required REST constraints:

1. Uniform Interface

● DEFINITION: All requests adhere to a common format.

● BENEFIT: Requests from different systems are formatted identically,


dramatically reducing the burden of supporting disparate platforms.

2. Client–Server

● DEFINITION: The server only interacts with the client through requests.

● BENEFIT: The client and the server are decoupled black boxes, allowing

clients to change and servers to evolve without breaking the relationship.

3. Stateless

● DEFINITION: All necessary information must be included within a request

rather than relying on information from previous requests or other data.

● BENEFIT: Statelessness greatly expands flexibility and horizontal

scalability, while also enabling comprehensive auditing.

4. Layered system

● DEFINITION: The REST client is agnostic to any layers between itself and

the server.

● BENEFIT: This allows for performance- or security-related

infrastructure to sit between the client and server, as needed.

5. Cacheable

● DEFINITION: Developers can declare certain responses to be cacheable.

● BENEFIT: Cacheability reduces latency and increases scalability.

Management APIs

A deployment and management system should not expose just its models via

API, but also its management functions. Every point of entry and exit to the

system, as well as all management commands, should be available via API. This


will enable a variety of integrations with external systems and ensure that no

application is truly incompatible.

The most requested integration for any deployment platform is likely Jupyter

Notebooks—the standard interface for data science across a number of

frameworks. Jupyter notebooks are used to document and visualize work and

administer many functions of data science workbenches. Extending a notebook

to include deployment is a natural closing of the loop that allows data scientists

to remain in their preferred environment while seeing a project through to the

finish.

Manage

Deploy

Connect

algorithmia.com

] Figure 3.5: Sample deployment to the AI Layer from a Jupyter Notebook.


What’s Next?

In the previous chapters in this series, we identified seven challenges of the

ML life cycle and discussed what it takes to deploy ML models. In subsequent

chapters, we’ll continue to examine what you need to maintain the ML life cycle,

whether you build the architecture yourself, use a third-party solution, or work

with a service provider. Topics will include:

● Serving & Scaling

● Management & Governance


About Algorithmia

Algorithmia helps organizations extend human potential with the AI Layer, a

machine learning operating system.

The AI Layer empowers organizations to:

● Deploy models from a variety of frameworks, languages, and platforms.

● Connect popular data sources, orchestration engines, and step functions.

● Scale model inference on multiple infrastructure providers.

● Manage the ML life cycle with tools to iterate, audit, secure, and govern.

To learn more about how Algorithmia can help your company accelerate its ML

journey, visit our website at algorithmia.com.

https://algorithmia.com

Copyright © 2019 Algorithmia, Inc. All Rights Reserved. WP7-190710-v.1.3

Documents

ML INFRASTRUCTURE PART 3 Connectivity