124
Advanced Topics in Continuous Deployment Mike Brittain Engineering Director, Etsy @mikebrittain mikebrittain.com/talks

Advanced Topics in Continuous Deployment

Embed Size (px)

DESCRIPTION

Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: [email protected]. http://www.etsy.com/careers

Citation preview

Page 1: Advanced Topics in Continuous Deployment

Advanced Topics in Continuous Deployment

Mike Brittain

Engineering Director, Etsy

@mikebrittain mikebrittain.com/talks

Page 2: Advanced Topics in Continuous Deployment

- Config flags - this one goes to eleven.

Today’s TOPICs

credit: photobookgirl (flickr)

Page 3: Advanced Topics in Continuous Deployment

- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less

Today’s TOPICs

credit: photobookgirl (flickr)

Page 4: Advanced Topics in Continuous Deployment

- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!?

Today’s TOPICs

credit: photobookgirl (flickr)

Page 5: Advanced Topics in Continuous Deployment

- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!? - Deploying schema changes - ‘cause everybody asks

Today’s TOPICs

credit: photobookgirl (flickr)

Page 6: Advanced Topics in Continuous Deployment

www. .com

Page 7: Advanced Topics in Continuous Deployment

20 Million Items listed 60+ Million Monthly unique visitors 200 Countries with annual transactions !

175+ Committers, everyone deploys

Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet

Page 8: Advanced Topics in Continuous Deployment

Linux, Apache, MySQL, PHPArchitecture Stack

Memcache, Gearman, Redis, node.js, Postgresql, Solr, Java, Apache Traffic Server, Hadoop, HBase

credit: Saire Elizabeth (flickr)

Git, Jenkins

Chef, Ruby, Python

Page 9: Advanced Topics in Continuous Deployment

@mikebrittain

DEPLOYMENTS PER DAYAPP CODE CONFIG FILES

Page 10: Advanced Topics in Continuous Deployment

1st Day Assignment Put your face on etsy.com/about

Page 11: Advanced Topics in Continuous Deployment

2nd day Complete tax, insurance, and

benefits forms.

credit: ktpupp (flickr)

Page 12: Advanced Topics in Continuous Deployment

I’m not telling you to go do this. (But you can go do this.)

Page 13: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Config flags - this one goes to eleven.

Today’s TOPICs

Page 14: Advanced Topics in Continuous Deployment

Code Deploy ≠ Feature Release

(Most deploys are gated by config flags)

Page 15: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'off'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');

Page 16: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'off'); !

Page 17: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'off'); !

// Meanwhile... !

!

!

!

!

# old and boring search $results = do_grep();

Page 18: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'off'); !

// Meanwhile... !

if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }

Page 19: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off');

“Config Flags,” “Feature Flags,” “Feature Toggles”

Page 20: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

Page 21: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan');

etsy.com/prototypes

Page 22: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !

$cfg[‘new_search’] = array('enabled' => '1%'); !

Page 23: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !

$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); !

Page 24: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !

$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%');

“This one goes to eleven.”

Page 25: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'staff'); !

// or... !

$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !

$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%');

https://github.com/etsy/feature

Page 26: Advanced Topics in Continuous Deployment

Validate in production, hidden from public.Product managers still “release” features.

Page 27: Advanced Topics in Continuous Deployment

Small incremental changes to the application New classes, methods, controllers Graphics, stylesheets, templates Copy/content changes

App deploys

Turning flags on, off, or % ramp up

Config deploys

Page 28: Advanced Topics in Continuous Deployment

http://www.flickr.com/photos/flyforfun/2694158656/

Page 29: Advanced Topics in Continuous Deployment

http://www.flickr.com/photos/flyforfun/2694158656/

OperatorConfig flags

Metrics

Page 30: Advanced Topics in Continuous Deployment

Latent bugs and security holes Traffic management, load shedding Adding and removing infrastructure !

Tweaking config flags or releasing patches.

“Operating” the site

Page 31: Advanced Topics in Continuous Deployment

Favorites “on”

Page 32: Advanced Topics in Continuous Deployment

Favorites “off”

Page 33: Advanced Topics in Continuous Deployment

Favorites “on”

Page 34: Advanced Topics in Continuous Deployment

Promote “dev flags” to “feature flags”

Page 35: Advanced Topics in Continuous Deployment

// Feature flag $cfg[‘mobilized_pages’] = array('enabled' => 'on'); !

// Dev flags $cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on'); $cfg[‘mobile_templates_search’] = array('enabled' => 'off'); $cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off'); !

... !

if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) { // ... // ... }

Page 36: Advanced Topics in Continuous Deployment

// Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); !

$cfg[‘the_entire_web_site’] = array('enabled' => 'on');

Page 37: Advanced Topics in Continuous Deployment
Page 38: Advanced Topics in Continuous Deployment

// Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); !

$cfg[‘the_entire_web_site’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');

Page 39: Advanced Topics in Continuous Deployment

Go add one config flag to your codebase.

Page 40: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Automated deploys - never settle for anything less

Today’s TOPICs

Page 41: Advanced Topics in Continuous Deployment

Interpreted language, text files.

Opcode cache (Opcache or APC)

~100 servers (web, gearman, api)

Rsync (push, not pull)

Avoid restarts

PHP & Apache

Page 42: Advanced Topics in Continuous Deployment

Lots of remote orchestration (ssh and dsh)

Push code from a git clone to production network.

Splay to a few boxes, each splays to more.

Stage files in a temp location on prod boxes.

Local rsync (using dsh) into live docroot.

Keeping things fast

Page 43: Advanced Topics in Continuous Deployment

100+ files opened per request.

Flushing opcode cache (or graceful restart).

Mostly harmless.

What can go wrong with this?

Page 44: Advanced Topics in Continuous Deployment

“Screwed Users”

Page 45: Advanced Topics in Continuous Deployment

Two document roots (“yin” and “yang”)

Symbolic link to the right one

Opcache has to use path name, not inode

Atomic deploys

http://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/

Page 46: Advanced Topics in Continuous Deployment

Binaries, not text files

Requires restarts

Requires search index and cache warming

Rsync (push, not pull)

Solr and JVM

Page 47: Advanced Topics in Continuous Deployment

Take boxes out of rotation, deploy, bring back up

Beware capacity management

Multiple versions running for extended period

Rollbacks are a pain (esp. when in mixed-state)

Rolling restarts

Page 48: Advanced Topics in Continuous Deployment

“Fraught with danger.”

Page 49: Advanced Topics in Continuous Deployment

One live cluster, one dark cluster

Deploy to dark cluster (indexes, pre-warm, restarts)

Define search clusters in app config

Switch cluster traffic via config deploy

“Flip” and “Flop”

Page 50: Advanced Topics in Continuous Deployment

Start with a shell script. Yours will be a unique snowflake.

Page 51: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Release management - who needs it?!?

Today’s TOPICs

Page 52: Advanced Topics in Continuous Deployment

We have a one-button deploy tool, We manage deploys in an IRC channel.

Page 53: Advanced Topics in Continuous Deployment
Page 54: Advanced Topics in Continuous Deployment

#push

Page 55: Advanced Topics in Continuous Deployment

Thank you.

Mike Brittain

Engineering Director, Etsy

@mikebrittain mikebrittain.com/talks

Page 56: Advanced Topics in Continuous Deployment

We have a one-button deploy tool, We manage deploys in an IRC channel.

I mean, seriously… what else do you want to know?!?

Page 57: Advanced Topics in Continuous Deployment

Keep real people in the loop

Queue, with max batch size of seven.

Automated deployment run by humans

Page 58: Advanced Topics in Continuous Deployment

Pssst… Want to see how it works?

Page 59: Advanced Topics in Continuous Deployment

4 people in this deploy.

“I’ve pushed my changes to master.”

“Everyone has checked in.”

Page 60: Advanced Topics in Continuous Deployment

Build QA and Pre-prod

Build progress

Status in #push

Git SHA1 in for each env.

Date, username, deploy log, changeset, link to dashboard from time of deploy

Page 61: Advanced Topics in Continuous Deployment
Page 62: Advanced Topics in Continuous Deployment

Reporting what’s going on in Deployinator, and who triggered

Status from build cluster

Page 63: Advanced Topics in Continuous Deployment

Pre-prod (“princess”) has been deployed. !

SHA1 of the change Time it took to deploy Link to changeset in GitHub Log of the deploy script

Page 64: Advanced Topics in Continuous Deployment

Btw, there are three bots talking in channel at this point. O_o

Page 65: Advanced Topics in Continuous Deployment

Queuing for next deploy

Humans talk to other humans from time to time.

Page 66: Advanced Topics in Continuous Deployment

Talking to pushbot. !

Pushbot knows some Spanish… because, ya know, why not?

Page 67: Advanced Topics in Continuous Deployment
Page 68: Advanced Topics in Continuous Deployment

Link to test results for CI environment, along with how long the tests took.Alerting by name.

Page 69: Advanced Topics in Continuous Deployment

8 minutes have elapsed… We’ve built and tested our release in the CI environment (“QA”). !

QA build failed our 5 min. SLA for tests.

Page 70: Advanced Topics in Continuous Deployment

“Try” is our pre-commit testing cluster.

Page 71: Advanced Topics in Continuous Deployment

Bots help reinforce our values. This is especially helpful for new people on the team.

Page 72: Advanced Topics in Continuous Deployment
Page 73: Advanced Topics in Continuous Deployment

Still 8 minutes elapsed… Pre-prod has been deployed and tested. !

This ran in parallel with our QA build and tests.

Page 74: Advanced Topics in Continuous Deployment
Page 75: Advanced Topics in Continuous Deployment

Cross-traffic: In a separate channel (#config), our app configs files were deployed to pre-prod.

Page 76: Advanced Topics in Continuous Deployment
Page 77: Advanced Topics in Continuous Deployment
Page 78: Advanced Topics in Continuous Deployment
Page 79: Advanced Topics in Continuous Deployment
Page 80: Advanced Topics in Continuous Deployment

Cross-traffic: Ops team deployed a configuration change.

Page 81: Advanced Topics in Continuous Deployment
Page 82: Advanced Topics in Continuous Deployment

Code is live Link to dashboard.

Page 83: Advanced Topics in Continuous Deployment
Page 84: Advanced Topics in Continuous Deployment

13 minutes elapsed… Code is now in production with public traffic.

Page 85: Advanced Topics in Continuous Deployment

Who committed code in the last deploy? And how many lines did each of them change?

Page 86: Advanced Topics in Continuous Deployment
Page 87: Advanced Topics in Continuous Deployment
Page 88: Advanced Topics in Continuous Deployment

Handoff for the next deploy.

Page 89: Advanced Topics in Continuous Deployment

Entire app deploy took 15 minutes. !

4 people running the deployment 8 committers Config deploy and Chef change deployed in parallel.

Page 90: Advanced Topics in Continuous Deployment

Optimal queue size

Normalized communication

Improved visibility

Historical record is ideal for post-mortems

Organic evolution

Page 91: Advanced Topics in Continuous Deployment

Hold up the queue (.hold)

Work the issue with the people available in #push

Additional help always available in #sysops

Buddy-system for off-hours deploys

Ops-on-call, dev-on-call

When something goes wrong?

Page 92: Advanced Topics in Continuous Deployment

You won’t need an IRC channel.

Page 93: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Deploying schema changes…. with code?

Today’s TOPICs

Page 94: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Deploying schema changes…. with code?

Today’s TOPICs

(a.k.a Don’t do this!)

Page 95: Advanced Topics in Continuous Deployment

~15-20 minutes THURSDAYS!

Code deploys Schema changes

Page 96: Advanced Topics in Continuous Deployment

credit: photobookgirl (flickr)

- Deploying schema changes

Today’s TOPICs

- Managing versions across services

Page 97: Advanced Topics in Continuous Deployment

Our web application is largely monolithic.

Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)

Page 98: Advanced Topics in Continuous Deployment

Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)

PHP, Apache, Memcache

Our web application is largely monolithic.

Page 99: Advanced Topics in Continuous Deployment

External “services” are not deployed with the main application.

e.g. Databases, Search, Photo storage, Payments

Page 100: Advanced Topics in Continuous Deployment

e.g. Databases, Search, Photo storage, Payments

MYSQL (schema changes)

SOLR, JVM (rolling restarts)

PROXY CACHE, FILERS, AMAZON S3

(specialized infra.)

PCI (controlled access)

External “services” are not deployed with the main application.

Page 101: Advanced Topics in Continuous Deployment

For every config flag, there are two states we can support — present and future.

Page 102: Advanced Topics in Continuous Deployment

... or past and present.

For every config flag, there are two states we can support — present and future.

Page 103: Advanced Topics in Continuous Deployment

$cfg[‘new_search’] = array('enabled' => 'off'); !

// Meanwhile... !

if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }

Page 104: Advanced Topics in Continuous Deployment

“Non-Breaking Expansions”Expose new version in a service interface;

support multiple versions in the consumer.

Page 105: Advanced Topics in Continuous Deployment

Example: Changing a Database SchemaMerging “users” and “users_prefs”

Page 106: Advanced Topics in Continuous Deployment

RULE OF THUMB Prefer ADDs over ALTERs (non-breaking expansion)C

Page 107: Advanced Topics in Continuous Deployment

!

1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version

Page 108: Advanced Topics in Continuous Deployment

0. Add new version to schema 1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version

Page 109: Advanced Topics in Continuous Deployment

0. Add new version to schemaSchema change to add prefs columns to “users” table. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “off” “read_prefs_from_users_table” => “off”

Page 110: Advanced Topics in Continuous Deployment

1. Write to both versionsWrite code for writing prefs to the “users” table. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “off”

Page 111: Advanced Topics in Continuous Deployment

2. Backfill historical dataOffline process to sync existing data from “user_prefs” to new columns in “users”

Page 112: Advanced Topics in Continuous Deployment

3. Read from new versionData validation tests. Ensure consistency both internally and in production. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “staff”

!

Page 113: Advanced Topics in Continuous Deployment

3. Read from new versionData validation tests. Ensure consistency both internally and in production. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “1%”

!

Page 114: Advanced Topics in Continuous Deployment

3. Read from new versionData validation tests. Ensure consistency both internally and in production. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “5%”

!

Page 115: Advanced Topics in Continuous Deployment

3. Read from new versionData validation tests. Ensure consistency both internally and in production. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “11%”

!

“This one goes to eleven.”

Page 116: Advanced Topics in Continuous Deployment

3. Read from new versionData validation tests. Ensure consistency both internally and in production. !

“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” // same as 100% !

!!

!

Page 117: Advanced Topics in Continuous Deployment

4. Cut-off writes to old versionAfter running on the new table for a significant amount of time, we can cut off writes to the old table. !

“write_prefs_to_user_prefs_table” => “off” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on”

!

Page 118: Advanced Topics in Continuous Deployment

“Branch by Astraction”

Controller Controller

Users Model

“users” (old) “user_prefs” “users”

old schema new schema

(Abstraction)

http://paulhammant.com/blog/branch_by_abstraction.html http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/

Page 119: Advanced Topics in Continuous Deployment

Avoid temptation of putting logic into DB

Async worker queue (Gearman)

Get good at alerting on data inconsistencies

Easier to scale out app servers that DBs

Shards limit complexity

About our database design…

Page 120: Advanced Topics in Continuous Deployment

No longer valid for the business

No longer stable, valid, or trusted code

Impacting performance or readability

We can afford to spend time

Clean up old config flags?

Page 121: Advanced Topics in Continuous Deployment

Decouple schema changes from app code. Aim for simplicity.

Page 122: Advanced Topics in Continuous Deployment

Start small. (We did.)

Automated tests and production monitoring.

Have a story around maintaining quality.

“We can always go back to the old way.”

Demonstrate value to leadership.

Page 123: Advanced Topics in Continuous Deployment

Go write your own story.

Page 124: Advanced Topics in Continuous Deployment

Thank you.

Mike Brittain

Engineering Director, Etsy

@mikebrittain mikebrittain.com/talks