Upload
mike-brittain
View
5.582
Download
3
Embed Size (px)
DESCRIPTION
Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: [email protected]. http://www.etsy.com/careers
Citation preview
Advanced Topics in Continuous Deployment
Mike Brittain
Engineering Director, Etsy
@mikebrittain mikebrittain.com/talks
- Config flags - this one goes to eleven.
Today’s TOPICs
credit: photobookgirl (flickr)
- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less
Today’s TOPICs
credit: photobookgirl (flickr)
- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!?
Today’s TOPICs
credit: photobookgirl (flickr)
- Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!? - Deploying schema changes - ‘cause everybody asks
Today’s TOPICs
credit: photobookgirl (flickr)
www. .com
20 Million Items listed 60+ Million Monthly unique visitors 200 Countries with annual transactions !
175+ Committers, everyone deploys
Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet
Linux, Apache, MySQL, PHPArchitecture Stack
Memcache, Gearman, Redis, node.js, Postgresql, Solr, Java, Apache Traffic Server, Hadoop, HBase
credit: Saire Elizabeth (flickr)
Git, Jenkins
Chef, Ruby, Python
@mikebrittain
DEPLOYMENTS PER DAYAPP CODE CONFIG FILES
1st Day Assignment Put your face on etsy.com/about
2nd day Complete tax, insurance, and
benefits forms.
credit: ktpupp (flickr)
I’m not telling you to go do this. (But you can go do this.)
credit: photobookgirl (flickr)
- Config flags - this one goes to eleven.
Today’s TOPICs
Code Deploy ≠ Feature Release
(Most deploys are gated by config flags)
$cfg[‘new_search’] = array('enabled' => 'off'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
$cfg[‘new_search’] = array('enabled' => 'off'); !
$cfg[‘new_search’] = array('enabled' => 'off'); !
// Meanwhile... !
!
!
!
!
# old and boring search $results = do_grep();
$cfg[‘new_search’] = array('enabled' => 'off'); !
// Meanwhile... !
if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off');
“Config Flags,” “Feature Flags,” “Feature Toggles”
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan');
etsy.com/prototypes
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !
$cfg[‘new_search’] = array('enabled' => '1%'); !
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !
$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); !
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !
$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%');
“This one goes to eleven.”
$cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'staff'); !
// or... !
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... !
$cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%');
https://github.com/etsy/feature
Validate in production, hidden from public.Product managers still “release” features.
Small incremental changes to the application New classes, methods, controllers Graphics, stylesheets, templates Copy/content changes
App deploys
Turning flags on, off, or % ramp up
Config deploys
http://www.flickr.com/photos/flyforfun/2694158656/
http://www.flickr.com/photos/flyforfun/2694158656/
OperatorConfig flags
Metrics
Latent bugs and security holes Traffic management, load shedding Adding and removing infrastructure !
Tweaking config flags or releasing patches.
“Operating” the site
Favorites “on”
Favorites “off”
Favorites “on”
Promote “dev flags” to “feature flags”
// Feature flag $cfg[‘mobilized_pages’] = array('enabled' => 'on'); !
// Dev flags $cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on'); $cfg[‘mobile_templates_search’] = array('enabled' => 'off'); $cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off'); !
... !
if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) { // ... // ... }
// Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); !
$cfg[‘the_entire_web_site’] = array('enabled' => 'on');
// Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); !
$cfg[‘the_entire_web_site’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');
Go add one config flag to your codebase.
credit: photobookgirl (flickr)
- Automated deploys - never settle for anything less
Today’s TOPICs
Interpreted language, text files.
Opcode cache (Opcache or APC)
~100 servers (web, gearman, api)
Rsync (push, not pull)
Avoid restarts
PHP & Apache
Lots of remote orchestration (ssh and dsh)
Push code from a git clone to production network.
Splay to a few boxes, each splays to more.
Stage files in a temp location on prod boxes.
Local rsync (using dsh) into live docroot.
Keeping things fast
100+ files opened per request.
Flushing opcode cache (or graceful restart).
Mostly harmless.
What can go wrong with this?
“Screwed Users”
Two document roots (“yin” and “yang”)
Symbolic link to the right one
Opcache has to use path name, not inode
Atomic deploys
http://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/
Binaries, not text files
Requires restarts
Requires search index and cache warming
Rsync (push, not pull)
Solr and JVM
Take boxes out of rotation, deploy, bring back up
Beware capacity management
Multiple versions running for extended period
Rollbacks are a pain (esp. when in mixed-state)
Rolling restarts
“Fraught with danger.”
One live cluster, one dark cluster
Deploy to dark cluster (indexes, pre-warm, restarts)
Define search clusters in app config
Switch cluster traffic via config deploy
“Flip” and “Flop”
Start with a shell script. Yours will be a unique snowflake.
credit: photobookgirl (flickr)
- Release management - who needs it?!?
Today’s TOPICs
We have a one-button deploy tool, We manage deploys in an IRC channel.
#push
Thank you.
Mike Brittain
Engineering Director, Etsy
@mikebrittain mikebrittain.com/talks
We have a one-button deploy tool, We manage deploys in an IRC channel.
I mean, seriously… what else do you want to know?!?
Keep real people in the loop
Queue, with max batch size of seven.
Automated deployment run by humans
Pssst… Want to see how it works?
4 people in this deploy.
“I’ve pushed my changes to master.”
“Everyone has checked in.”
Build QA and Pre-prod
Build progress
Status in #push
Git SHA1 in for each env.
Date, username, deploy log, changeset, link to dashboard from time of deploy
Reporting what’s going on in Deployinator, and who triggered
Status from build cluster
Pre-prod (“princess”) has been deployed. !
SHA1 of the change Time it took to deploy Link to changeset in GitHub Log of the deploy script
Btw, there are three bots talking in channel at this point. O_o
Queuing for next deploy
Humans talk to other humans from time to time.
Talking to pushbot. !
Pushbot knows some Spanish… because, ya know, why not?
Link to test results for CI environment, along with how long the tests took.Alerting by name.
8 minutes have elapsed… We’ve built and tested our release in the CI environment (“QA”). !
QA build failed our 5 min. SLA for tests.
“Try” is our pre-commit testing cluster.
Bots help reinforce our values. This is especially helpful for new people on the team.
Still 8 minutes elapsed… Pre-prod has been deployed and tested. !
This ran in parallel with our QA build and tests.
Cross-traffic: In a separate channel (#config), our app configs files were deployed to pre-prod.
Cross-traffic: Ops team deployed a configuration change.
Code is live Link to dashboard.
13 minutes elapsed… Code is now in production with public traffic.
Who committed code in the last deploy? And how many lines did each of them change?
Handoff for the next deploy.
Entire app deploy took 15 minutes. !
4 people running the deployment 8 committers Config deploy and Chef change deployed in parallel.
Optimal queue size
Normalized communication
Improved visibility
Historical record is ideal for post-mortems
Organic evolution
Hold up the queue (.hold)
Work the issue with the people available in #push
Additional help always available in #sysops
Buddy-system for off-hours deploys
Ops-on-call, dev-on-call
When something goes wrong?
You won’t need an IRC channel.
credit: photobookgirl (flickr)
- Deploying schema changes…. with code?
Today’s TOPICs
credit: photobookgirl (flickr)
- Deploying schema changes…. with code?
Today’s TOPICs
(a.k.a Don’t do this!)
~15-20 minutes THURSDAYS!
Code deploys Schema changes
credit: photobookgirl (flickr)
- Deploying schema changes
Today’s TOPICs
- Managing versions across services
Our web application is largely monolithic.
Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
PHP, Apache, Memcache
Our web application is largely monolithic.
External “services” are not deployed with the main application.
e.g. Databases, Search, Photo storage, Payments
e.g. Databases, Search, Photo storage, Payments
MYSQL (schema changes)
SOLR, JVM (rolling restarts)
PROXY CACHE, FILERS, AMAZON S3
(specialized infra.)
PCI (controlled access)
External “services” are not deployed with the main application.
For every config flag, there are two states we can support — present and future.
... or past and present.
For every config flag, there are two states we can support — present and future.
$cfg[‘new_search’] = array('enabled' => 'off'); !
// Meanwhile... !
if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }
“Non-Breaking Expansions”Expose new version in a service interface;
support multiple versions in the consumer.
Example: Changing a Database SchemaMerging “users” and “users_prefs”
RULE OF THUMB Prefer ADDs over ALTERs (non-breaking expansion)C
!
1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
0. Add new version to schema 1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
0. Add new version to schemaSchema change to add prefs columns to “users” table. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “off” “read_prefs_from_users_table” => “off”
1. Write to both versionsWrite code for writing prefs to the “users” table. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “off”
2. Backfill historical dataOffline process to sync existing data from “user_prefs” to new columns in “users”
3. Read from new versionData validation tests. Ensure consistency both internally and in production. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “staff”
!
3. Read from new versionData validation tests. Ensure consistency both internally and in production. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “1%”
!
3. Read from new versionData validation tests. Ensure consistency both internally and in production. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “5%”
!
3. Read from new versionData validation tests. Ensure consistency both internally and in production. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “11%”
!
“This one goes to eleven.”
3. Read from new versionData validation tests. Ensure consistency both internally and in production. !
“write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” // same as 100% !
!!
!
4. Cut-off writes to old versionAfter running on the new table for a significant amount of time, we can cut off writes to the old table. !
“write_prefs_to_user_prefs_table” => “off” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on”
!
“Branch by Astraction”
Controller Controller
Users Model
“users” (old) “user_prefs” “users”
old schema new schema
(Abstraction)
http://paulhammant.com/blog/branch_by_abstraction.html http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
Avoid temptation of putting logic into DB
Async worker queue (Gearman)
Get good at alerting on data inconsistencies
Easier to scale out app servers that DBs
Shards limit complexity
About our database design…
No longer valid for the business
No longer stable, valid, or trusted code
Impacting performance or readability
We can afford to spend time
Clean up old config flags?
Decouple schema changes from app code. Aim for simplicity.
Start small. (We did.)
Automated tests and production monitoring.
Have a story around maintaining quality.
“We can always go back to the old way.”
Demonstrate value to leadership.
Go write your own story.
Thank you.
Mike Brittain
Engineering Director, Etsy
@mikebrittain mikebrittain.com/talks