Transcript

Deploying Data Science with Docker and AWS

Audience: Cambridge AWS Meetup Group

Presenter: Matt McDonnell, Data Scientist at Metail

Date: 9th June 2016

Context

Lots of event stream data

Many AWS components

Outputs:- Business Intelligence- Bespoke Analysis- Productionised Science

What?Goal: Moving laptop analyses onto a server

Turn :

<types>run_analysis.sh<presses enter>

… analysis script retrieves data from DB, Looker, web, etc. …

… runs analysis …

… outputs results as csv, png, etc. to local hard disk …

<gets back command prompt>

Into :

Automated process running on a server

Why?• Production scheduled task e.g. Firm Wide Metrics daily processing

• Make use of more powerful Amazon Web Services (AWS) cloud resources for large scale analysis

• Ease of deployment for Data Science analysts

• Build consistent development environment

How?• Containerize applications and runtime using Docker to produce images

• Store images on AWS Elastic Container Registry (ECR)

• Run images either locally, or Amazon Elastic Container Service (ECS)

• Use AWS Lambda functions to trigger scheduled tasks (or react to events)

What is Docker?

“Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in.” -- https://www.docker.com/what-docker

Public code: store Dockerfile on GitHub, use Travis to automatically build image on DockerHub

Private code: private Dockerfile, build locally, push image to AWS Elastic Container Registry

Example application: retrieve market data

PyAnalysisApplication code built on PCR image

https://github.com/mattmcd/PyAnalysis

PCR: Python Component Runtime Base Docker image

https://github.com/mattmcd/PCR

Where? Amazon Web Services Cloud

• Elastic Container Service (ECS) • Defines the task that runs the container• Runs tasks on a cluster of EC2 nodes

• EC2 instance set up to act as node • Needs to be an AWS ECS optimized AMI

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html

• Needs an IAM Role that has:• AmazonEC2ContainerServiceforEC2Role policy attached• Policies to allow access to any AWS resources needed e.g. S3

• Lambda function to trigger ECS task• cron equivalent by using CloudWatch scheduled events

EC2 Instance Security Group

EC2 instance used by ECS can be locked down – no need to SSH in to it so no inbound ports needed

EC2 Instance AMI

Use latest available Amazon ECS Optimized AMI – it has Docker and ECS Container Agent already installed

EC2 Instance Details

Enable Auto-assign Public IP so ECS can connect and assign a custom IAM Role as a hook for access permissions

EC2 Instance IAM Role

Attach AmazonEC2ContainerServiceForEC2Role Policy and any extra access Policies for containers on the instance

ECS Task

ECS task retrieves image and runs it

Lambda function

Use the lambda-canary blueprint as a basis for cron job equivalents

Lambda function

cron job equivalent via CloudWatch scheduled event

Lambda Function

Simple Lambda function to run task on ECS

Lambda function IAM role

AWS will create default IAM Roles for Lambda function – need to add ecs:RunTask to run container

Demo / Q&ABlog posts

• ‘Scheduled Downloads using AWS EC2 and Docker — Medium’ http://bit.ly/1TO9a1h (me)

• ‘Better Together: Amazon ECS and AWS Lambda’ http://amzn.to/1UkitEF (not me)

Code samples

• https://github.com/mattmcd/PyAnalysis

• https://github.com/mattmcd/PCR

Docker images

• mattmcd/pyanalysis

• mattmcd/pcr

Me

• Twitter @mattmcd

• Email [email protected] or [email protected]