41
ED BUKOSKI SENIOR SOFTWARE ENGINEER NETFLIX @EBUKOSKI Bitbucket in the AWS Cloud At Netflix

Baking Stash in the AWS Cloud at Netflix

Embed Size (px)

Citation preview

ED BUKOSKI • SENIOR SOFTWARE ENGINEER • NETFLIX • @EBUKOSKI

Bitbucket in the AWS Cloud At Netflix

• Big cool statistic

• 2,569

• Add-Ons in Marketplace

• Big cool statistic

• 2,569

• Add-Ons in Marketplace

• Big cool statistic

• 2,569

• Add-Ons in Marketplace

• Big cool statistic

• 2,569

• Add-Ons in Marketplace

• Big cool statistic

• 2,569

• Add-Ons in Marketplace

B I T B U C K E T I N AW S

L E S S O N S L E A R N E D

Q & A

Agenda

Pre-CloudTraditional DC

N-Tier Web Applications Relational Databases

Cloud Cloud infrastructure - AWS

Java, Linux, Apache No SQLHigh Availability

Scale Performance

Bitbucket At Netflix

Bitbucket At Netflix

Total LOC1,610,528,526Total Commits5,554,379Builds / Day7,132Jenkins Slaves335 Developers950

Bitbucket in AWS

Cloud Infrastructure

Metrics, Monitoring, Backups

Load Testing

Bitbucket in AWS

Cloud Infrastructure

Bitbucket in AWS

Cloud Infrastructure

Amazon Machine Image (AMI)

Elastic Compute Cloud (EC2) Instance

Elastic Block Storage Volume (EBS)

Relational Database (RDS)

Elastic Load Balancer (ELB)

Cloud Infrastructure

1. Stock Atlassian AMI

nginx Stash

PostgreSQL setups & configs

Amazon Machine Image (AMI)

2. Custom AMI

Cloud Infrastructure

Apache/Tomcat

How to Make?

Organizational

Amazon Machine Image (AMI)

Metrics Sidecars

Cloud InfrastructureEC2 Instance c3.8xlarge, 32 vcpu, 60 GB RAM 2x320 SSD ephermal

Cloud InfrastructureEC2 Instance c3.8xlarge, 32 vcpu, 60 GB RAM 2x320 SSD ephermal

EBS Volume General Purpose GP2, 1 TB Stash Home

Cloud InfrastructureEC2 Instance c3.8xlarge, 32 vcpu, 60 GB RAM 2x320 SSD ephermal

EBS Volume General Purpose GP2, 1 TB Stash Home

RDS db.m3.xlarge, 4 vcpu, 16 GB RAM 100 GB storage

Cloud InfrastructureEC2 Instance c3.8xlarge, 32 vcpu, 60 GB RAM 2x320 SSD ephermal

EBS Volume General Purpose GP2, 1 TB Stash Home

RDS db.m3.xlarge, 4 vcpu, 16 GB RAM 100 GB storage

ELB DNS

Cloud Infrastructure

Auto Scaling Group?

EC2 Instance c3.8xlarge, 32 vcpu, 60 GB RAM 2x320 SSD ephermal

EBS Volume General Purpose GP2, 1 TB Stash Home

RDS db.m3.xlarge, 4 vcpu, 16 GB RAM 100 GB storage

ELB DNS

Cloud Infrastructure

Bitbucket in AWS

Metrics, Monitoring, Backups

Metrics, Monitoring, Backups

2. Custom AMI

Amazon Machine Image (AMI)

Metrics Sidecars

CPU, memory, sessions, JDBC

Completely stand-alone WAR

Built-in charting

Easy, drop in metrics

Easy, drop in metrics

Completely stand-alone WAR

System (CPU, etc), http sessions

Built-in charting

Prana

Sidecar “platform”

Standalone JVM

Ship and index log files

Visualize with Kibana

Resp times, error rates

Logstash,Elasticsearch, Kibana

Bitbucket DIY Backup Scripts

1. EBS Snapshots

2. Database backups*

RDS Database

*EBS Instance only

custom scripting

maint mode — short < 30s

Metrics, Monitoring, Backups

Cloud Infrastructure

Metrics, Monitoring, Backups

Bitbucket in AWS

Load Testing

Load Testing

load testing goals

client side measurements

1. Load testing scripts

2. Bake an AMIAminator

3. Create Launch Configs and ASG

4. Scale up ASG to generate load

Load Testing

Scale up ASG

Server Metrics

Click here to add page title

Client Metrics

Cloud Infrastructure

Metrics, Monitoring, Backups

Load Testing

Bitbucket in AWS

Lessons Learned

Lessons Learned

RDS Manual Snapshot Limit

max 50 snapshots snapshot error -> kept stash in maint mode

Lessons Learned

Janitor cleans unused AWS infrastructurenew rule to clean old RDS snapshots

RDS Manual Snapshot Limitbackup script more resilient to errorsJanitor Monkey

Volume mount dupe disaster

snapshot, attach to new instance

two Stashs connected to same database

prod

test Stash migrated Stash prod database

tables mismatched with code

prod immediately failed hard

Lessons Learned

3.5context: populate test with prod data

test3.8

connected to prod database

started test instance (prod configs!)

3.5ish Volume mount dupe disasterprod

Lessons Learned

3.5

test3.8

shutdown test

ad-hoc SQL to stabilize database

did not restore database from backups

analyze liquibase code -> roll-back script

upgrade: roll-back then roll-forward

Bitbucket in AWS - Takeaways

Embrace cloud infrastructure

Include monitoring and metrics

Learn from our mistakes

External resources

netflix.github.io

Thank you!

ED BUKOSKI • SENIOR SOFTWARE ENGINEER • NETFLIX • @EBUKOSKI