© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tim Secor - Manager, Developer Productivity
8/11/2016
Continuous Integration
with ECS and Docker
Topics
• Who is Okta
• Okta Engineering—How Do We work, how do we ship
our code?
• The Challenge of the Developer Productivity Team
• A CI System with Amazon EC2 Container Service and
Docker
Okta: Connect Everything
• Connects all users, devices,
applications, and organizations
• SSO, Adaptive MFA,
Provisioning, Universal Directory,
Mobility
• The broadest and deepest
application network
Leader: Okta
Magic Quadrant
Leader: Okta
Forrester Wave
What We Do
We believe that connecting
everything will make organizations
more productive and more secure.
What We BelieveWe Make Customers
Successful
© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved.
Millions of people use Okta every dayMillions of people use Okta every day
© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential 5
Thousands of enterprises use Okta toconnect to Adobe’s Creative Cloud
© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential 6
Thousands of Enterprise Customers
Ed, Gov,Non-Profit
Services Media ConsumerTechnology Manufacturing, Energy
FinanceCloudHealth
© Okta and/or its affiliates. All rights reserved. Okta Confidential 7© Okta and/or its affiliates. All rights reserved. Okta Confidential 7
Okta Application Network
Mobility
Management
Single Sign On Adaptive MFA Provisioning
Universal Directory
Extensible Profiles, Attribute Transformations,
Directory Integration and AD Password Management
Secure SSO for All Your
Web Apps, On-prem
and Cloud, with Flexible
Policy, from Any Device
Contextual Access
Policies,
Modern Factors,
Adaptive Authentication,
Integrations for Apps
and VPNs
Lifecycle Management,
Cloud & On-prem App
Integration, Mastering
from Apps, Directory
Provisioning, Rules,
Workflow, Reporting
Tight User Identity
Integration, Device
Based Contextual
Access,
Light-weight
Management
Okta IT & Platform products
© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential 8
The most reliable IDaaS available
Never taken offline for upgrades
Redundant and scalable
A B C A B C
DC2 DC1
okta.com/trust
A Platform Architecture For Scale
DATA TIER
A B C LOAD
BALANCERS
APP
SERVERS
© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential© Okta and/or its affiliates. All rights reserved. Okta Confidential 9
Global Datacenters
Okta Engineering—How Do We work, how do
we ship our code?
• 200 engineers, split into teams with embedded
specialists
• 1 week sprints, and deploy to production weekly
• Capability to do more than one hotfix per day at
customers’ request or for bugs found in CI or pre-prod
• Every merge to master is a potential release candidate
Okta Engineering—How Do We Test Our Code?
• Every topic branch goes through the same amount of
vigor in testing as release candidate.
• Passing automated tests is enforced at commit time.
• Largest repo: 30K tests, takes 60 minutes (22 parallel
runs)
• Smallest repo: 100 tests, 5 minutes
• The Developer Productivity team is responsible for
supporting engineering.
Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
Developers expect fast turn-
around time and reliable results.
Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We need to run all the tests
required to guarantee quality.
Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We need to run an
infrastructure which is as cost-
effective as possible
Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We aim to use cloud services
first, wherever possible
Vision
• Clean testing environments
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Vision
• Clean testing
environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Isolate test environments from
others, parallel and serial runs
Vision
• Clean testing environments
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Workers should survive the
loss of their build server
Worker pool should scale
quickly
Number of workers should not
affect memory footprint of build
server
Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Run our services for cheaper
rates, as we have many short
lived tasks, and could certainly
handle a few failures
Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Enable testing of infrastructure
changes in topic branches
Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Should survive build server
reboots
Shouldn’t be tied to specific
workers or build servers
Centralized
Should have good visibility
Re-queuing of lost tasks
Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Push testing and creation of
test machines to developers
Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Launch tasks in secure
environments
EC2 Container Service and Docker
• Amazon Web Services + Java app tailored to Okta
process
• Immutable and Disposable build workers—created for
one-time use, destroyed when job is done
• Near ZERO cost on weekends, scales with load
• EC2 Container Service allows us to maximize usage of
EC2 instances
• Same containers for multiple types and numbers of
builds
• Same Machine Image can run multiple docker images
Docker Update
• Update Dockerfile and our CI system builds the new image,
uploading it to our repository
• Update task definition for cluster updates
Dockerfile
FROM docker.aue1d.saasure.com/okta-base:2.0
MAINTAINER Okta
RUN useradd -d /home/container_user -m -s /bin/bash container_user
# Install wget, tar, hostname
RUN yum install -y wget tar hostname
# Install Java 8
RUN yum install -y java-1.8.0-oracle-1.8.0_31
RUN mkdir -p /opt/sage
RUN mkdir -p /var/log/sage
RUN chown container_user /var/log/sage
ADD conf/* /opt/sage/conf/
ADD core/target/core-*.jar /opt/sage/sage.jar
EXPOSE 8882 8883
USER container_user
CMD java $OKTA_SAGE_JAVA_ARGS -jar /opt/sage/sage.jar server /opt/sage/conf/sage.yml
Docker Security Conventions
Container repository• Only allow containers from internal repository
Security scanning of containers - JFrog Xray
Process monitoring on docker host – cAdvisor from google
Secrets or any form of config NEVER baked in containers
Start from minimal, audited base OS
Run container as non-privileged user w/ user namespaces Docker 1.10+
Monitor alas.aws.amazon.com for critical updates
Docker Source Conventions
3 categories of container definitions
1. “Library” definitions used as the basis for building other images
2. Third-party service definitions e.g. Zookeeper or Elasticsearch
3. Internal service definitions
Repo per internal service
• Dockerfile in same repo => image versioned with code
• Docker compose for running dependent services
• Pegged versions (no builds)
Single repo for library and third-party service definitions
Docker Build Conventions
Integration tests run against code running in container
Build owns creating immutable version and publishing to
artifact server
Strict rules around “FROM” clause
• Must point at internal artifact server
• Must be tagged following SEMVER-SHORT_SHA convention
• Never allow missing or use of “latest” tag for repeatable builds
© Okta and/or its affiliates. All rights reserved.
Logging and monitoring
• Logging
• All output streams pipe to STDOUT/STDERR of the running process
• Log forwarding is provided by underlying host
• Log entries contain
• Host
• Container Id
• Image name & version
• Request Id
• Metrics
• Host level, generic container metrics provided by host
• App level metrics published directly to well defined endpoints
Amazon EC2 Container Service
• ECS Under The Hood
Amazon EC2 Container Service Host Management
Userdata installs:
• Slave terminator – T-800
• Base docker images an option
• Credentials – from s3
• Splunk Forwarder – logging
• Cluster target
• Cache – code and libs
Amazon EC2 Container Service
Identity and Access Management separation per service
• Either service per cluster or use new Identity and Access
Management for Elastic Container Service functionality
Sharing the docker daemon to allow running docker within
docker
Pre-fetching large data blobs and making them available
on the hosts is an option
Multiple containers: mysql, redis, kinesilite
Task Definitions
{
"taskDefinitionArn": "arn:aws:ecs:us-east-1:262205085595:task-definition/base-container-box-task:1",
"containerDefinitions": [
{
"memory": 15000,
"essential": true,
"mountPoints": [
{
"containerPath": "/usr/bin/docker",
"sourceVolume": "docker_daemon",
"readOnly": null
},
{
"containerPath": "/var/run/docker.sock",
"sourceVolume": "docker_socket",
"readOnly": null
}
Task Definitions
],
}
],
"volumes": [
{
"host": {
"sourcePath": "/var/run/docker.sock"
},
"name": "docker_socket"
},
{
"host": {
"sourcePath": "/usr/bin/docker"
},
"name": "docker_daemon"
}
],
"family": "base-container-box-task”
Clean Testing Environments
• Docker images
• Nearly instant machine refresh
• Easy for users to create and upload images that have
been tested to work locally
• Efficient Machine use
• Amazon EC2 Container Service with EC2 Container
Repository and private repository backend
Dynamic Worker Scaling
Simple
Queue
Service
LambdaSimple
Notification
Service
Lambda
Scaling
Bin Packing
EC2 Container Service
Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
Dynamic Worker Scaling`
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
Versioned Jobs With EC2 Container Service
• Versioned build and test scripts can now be run in
versioned docker containers, using versioned task
definitions
• Creates extreme flexibility
• Cloud formation allows us to stand up whole new
clusters with all different versions in a matter of minutes
for long term testing
EC2 Container Service + Docker Problems
• Docker containers not launching
• EC2 Container Service agent failing
• Docker containers stopping
• Incompatibility with certain services
• Docker OS availability
• Cleanup
• Image size
© Okta and/or its affiliates. All rights reserved.
• Elastic Load Balancer
• Dynamic port mapping to containers
• Fail health based on HTTP return code
• Different health endpoint for adding vs removing
• Bin packing scheduler
• Could provide better cost management reporting and tools
• Ability to mark container instances as un-schedulable
• Remove sharp edges around the stopped state
• Give Auto Scaling Groups ability to set Elastic Compute Cloud instance
”shutdown behavior”
• Periodic cleanup process in Elastic Container Service to deregister stopped
instances
EC2 Container Service Feature Requests
© Okta and/or its affiliates. All rights reserved.
• /etc/ecs/ecs.config
• ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION for forensics (default 1hr)
• ECS_LOGLEVEL=debug
• Beware of running services in same cluster that use the same ports
• Tune Elastic Load Balancer health check
• Docker 1.10 for security enhancements
• Canary & Blue/Green separate service attached to same Elastic Load Balancer
• Rollback is trivial
• Elastic Container Service is incredibly easy to get up and running
• The ecosystem is changing quickly, we are moving cautiously
• Holding off on stateful services in Docker
EC2 Container Service Takeaways
Amazon Web Services
Elastic Compute Cloud
Simple Queue Service
LambdaEC2 Container Service Simple Storage Service
Relational Database Service
Kinesis
EC2 Spot Instances
EC2 Container Registry
CloudFormation
Simple Notification Service
CloudWatch
CloudTrail
Expand Use
• Use EC2 Container Service for more services
• Allow Developers to control their test suites and Docker
images more directly
• Developer Environments
• Use docker for local long running services
• Use a VM running the same version OS
• Remote updates to keep it in line with CI
• Aim to enable running CI containers right out of the box
Result: Happy Engineering Team
• Developers can write more tests quicker.
• Happy devs, timely build/test status feedback.
• Happy quality team, all tests are run at each commit.
• Happy ops team, release candidate produced quickly.
• Happy management, infra budget is under control.