37
and containers Andrew Spyker (@aspyker) - Engineering Manager Not

Netflix and Containers: Not Stranger Things

Embed Size (px)

Citation preview

Page 1: Netflix and Containers: Not Stranger Things

and containers

Andrew Spyker (@aspyker) - Engineering Manager

Not

Page 2: Netflix and Containers: Not Stranger Things

About Netflix

● 86.7M members● A few thousand employees● 190+ countries● > ⅓ NA internet download traffic● 500+ Microservices● Many 10’s of thousands VM’s● 3 regions across the world

Page 3: Netflix and Containers: Not Stranger Things

Netflix has a elastic, cloud native, immutable microservice architecture using full devops built on VM’s!

3

Why are we messing around with containers?

Page 4: Netflix and Containers: Not Stranger Things

Technical motivating factors for containers

● Simpler management of compute resources

● Simpler deployment packaging artifacts for compute jobs

● Need for a consistent local developer environment

Page 5: Netflix and Containers: Not Stranger Things

Sampling of realized container benefits

Media Encoding - encoding research development time● VM’s platform to container platform - 1 month vs. 1 week

Continuous Integration Testing● Build all Netflix codebases in hours● Saves development 100’s of hours of debugging

Edge Re-architecture using NodeJS● Focus returns to app development● Simplifies, speeds test and deployment

5

Page 6: Netflix and Containers: Not Stranger Things

Batch applications

Page 7: Netflix and Containers: Not Stranger Things

Multi-tenant (cgroups/Mesos) historically used for batch

Linux cgroups

Page 8: Netflix and Containers: Not Stranger Things

What do batch users want?

● Simple shared resources, run till done, job files

● NOT○ EC2 Instance sizes, autoscaling, AMI OS’s

● WHY○ Offloads resource management ops, Simpler

Page 9: Netflix and Containers: Not Stranger Things

Introducing Titus

Batch

Job Management

Resource Management & Optimization

Container ExecutionIntegration

Workflow, Data Analysis, Adhoc Upstream Systems

Page 10: Netflix and Containers: Not Stranger Things

Netflix Batch Job Examples

● Algorithm Model Training (with GPU’s)

Page 11: Netflix and Containers: Not Stranger Things

Netflix Batch Job Examples

● Media Encoding

● Digital Watermarking

1 1

Page 12: Netflix and Containers: Not Stranger Things

Netflix Batch Job Examples

Open Connect CDN Reporting

AdhocReporting

Page 13: Netflix and Containers: Not Stranger Things

● Docker helped generalize use cases● Scheduling required (GPU, elastic)● Initially ignored failures (with retries)● Time sensitive batch came later

Lessons Learned from Batch

Page 14: Netflix and Containers: Not Stranger Things

Current Container Usage - Batch● 1000’s of container hosts (g2, m4, r3 instances)● 1000’s containers / hour average● Large spikes of CI testing and Digital Watermarking

From day of 10/26● 47K containers● Bursts of 1000

containers in a minute

Page 15: Netflix and Containers: Not Stranger Things

Service applications

Page 16: Netflix and Containers: Not Stranger Things

Why Services in containers?

Theory Reality

Developer

Page 17: Netflix and Containers: Not Stranger Things

Opportunities to evolve our baking

● Java focused supported AMI, baking works well!

● However, wanted to allow○ other stacks to evolve independent of OS updates○ simplified builds (vs. Java and OS based tooling)○ reliable smaller instances for dynamic languages○ ability to develop locally with same image

Page 18: Netflix and Containers: Not Stranger Things

Services are just long running batch?

ServicesJob Management

Resource Management & Optimization

Container ExecutionIntegration

Service Apps

Batch

Page 19: Netflix and Containers: Not Stranger Things

19

Nope, not that easy - Titus Details

19

Titus UITitus UI

Docker RegistryDocker Registry

Rhea

containercontainer

container

docker

Titus Agent metrics agent

Titus executor

logging agent

zfs

Mesos agent

docker

RheaTitus API

Cassandra

Titus Master

Job Management & Scheduler

S3

ZookeeperDocker Registry

EC2 Autocaling API

Mesos Master

Titus UI

Fenzo

container

Pod & VPC network drivers

containercontainer

AWSmetadata proxy

Integration

AWS VM’sCI/CD

Page 20: Netflix and Containers: Not Stranger Things

Services more complex● Services resize constantly and run forever

○ Autoscaling○ Hard to upgrade underlying hosts

● Require IPC integration○ Routable IPs, service discovery○ Ready for traffic vs. just started/stopped

● Existing well defined dev, deploy, runtime & ops tools

Page 21: Netflix and Containers: Not Stranger Things

Real networking is hard

Page 22: Netflix and Containers: Not Stranger Things

Multi-tenant

Need IP per container - in VPC

Using security groups

Using IAM roles

Considering network resource isolation

Page 23: Netflix and Containers: Not Stranger Things

Enabling VPC Networking

No IP, SecGrp A

Task 0

SecGrp Y,Z

Task 1 Task 2 Task 3

Titus EC2 Host VMeth1

ENI1SecGrp=A

eth2

ENI2SecGrp=X

eth3

ENI3SecGrp=Y,Z

IP 1IP 2

IP 3

pod root

veth<id>

app

SecGrp X

pod root

veth<id>

app

SecGrp X

pod root

veth<id>

appapp

veth<id>

Linux Policy BasedRouting + Traffic Control

TitusEC2

Metadata Proxy

169.254.169.254IPTables NAT (*)

* **

169.254.169.254Non-routable IP

*

Page 24: Netflix and Containers: Not Stranger Things

Solutions● VPC Networking driver

○ Supports ENI’s - full IP functionality○ Scheduled security groups○ Support traffic control (resource isolation)

● EC2 Metadata proxy○ Adds container “node” identity○ Delivers IAM roles

Page 25: Netflix and Containers: Not Stranger Things

Reuse existing infrastructure services

VMVM

EC2

AW

S A

utoS

cale

rVMs

App

Cloud Platform(metrics, IPC, health)

VPC

Netflix Cloud Infrastructure (VM’s + Containers)

Atlas Eureka Edda

Page 26: Netflix and Containers: Not Stranger Things

Enable them for containers

VMVM

EC2

AW

S A

utoS

cale

rVMs

App

Cloud Platform(metrics, IPC, health)

VPC

Netflix Cloud Infrastructure (VM’s + Containers)

VMVM

Atlas

Titu

s Jo

b C

ontro

l

Containers

App

Cloud Platform(metrics, IPC, health)

Eureka Edda

VMVM

BatchContainers

Page 27: Netflix and Containers: Not Stranger Things

Spinnaker

Page 28: Netflix and Containers: Not Stranger Things

Deploy based on new images

tags

Page 29: Netflix and Containers: Not Stranger Things

Basic resource requirements

IAM Roles & Sec Groups per container

Deploy Strategies

Same as VM’s

Page 30: Netflix and Containers: Not Stranger Things

Easily see health &

discovery

Page 31: Netflix and Containers: Not Stranger Things
Page 32: Netflix and Containers: Not Stranger Things
Page 33: Netflix and Containers: Not Stranger Things

Secure Multi-tenancyCommon to VM’s and tiered security needed● Protect the reduced host IAM role, Allow containers to have specific IAM

roles● Needed to support same security groups in container networking as VM’s

User namespacing● Docker 1.10 - Introduced User Namespaces

● Didn’t work /w shared networking NS● Docker 1.11 - Fixed shared networking NS’s

● But, namespacing is per daemon, Not per container, as hoped● Waiting on Linux

● Considering mass chmod / ZFS clones

Page 34: Netflix and Containers: Not Stranger Things

Titus Advanced Scheduling

● Support for AZ balancing● Multiple instance types selected based on workload● Elastic underlying common resource pool

○ Bin packing managed transparently across all apps● Hard and soft constraints● Resource affinity and task affinity● Capacity guarantees (critical tier)

34

Page 35: Netflix and Containers: Not Stranger Things

Fenzo - Keep resource scheduling extensible

Fenzo - Extensible Scheduling Library

Features:● Heterogeneous resources & tasks● Autoscaling of mesos cluster

○ Multiple instance types● Plugins based scheduling objectives

○ Bin packing, etc.● Plugins based constraints evaluator

○ Resource affinity, task locality, etc.● Scheduling actions visibility

Page 36: Netflix and Containers: Not Stranger Things

Current Container Usage - Service

● Still small ~ 2000 long running containers

● NodeJS Device UI Scripts Apps● Stream Processing Jobs - Flink● Various Internal Dashboards

Page 37: Netflix and Containers: Not Stranger Things

Questions?

Andrew Spyker (@aspyker) - Engineering Manager