Herding cats in the Cloud

HERDING CATS IN THE CLOUDMAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD

Dewey Sasser

Consulting Cloud Architect

Algined Software

ABOUT THIS TALK

Public Clouds can give developers unprecedented levels

of power

“With Great Power Comes Great Responsibility”

You must structure your development and production

deployment process to use this power well

How do we do this? Experience from a large deployment

ABOUT DEWEY

Distributed Application Developer for 20 years

Doing build/release/software process for about that long

Accidentally doing devops out of self-defense

Wandered in operations about 5 years ago

Built some private cloud for dev

Built some private cloud for prod

Starting architecting using public cloud for everything

ABOUT THE COMPANY

Company Policy: don’t talk for the company

Therefore, these slides don't mention The Company.

There is no information here that is not otherwise publicly available.

Whoever it is, I don't speak for them

Major Gaming Company, multiple AAA titles

History in MMOs

All in on mobile now

ROADMAP

What we Did

What we're Doing

What we might do Next

WE'RE COMING FROM...

Traditionally MMOs in colo

Windows (ugh!) based servers

All in cloud now: mobile, cloud, Docker, MongoDB, Phoenix

Servers, Chaos Monkey, (...other popular buzzwords)

GOALS

100% uptime: players want to play

No more: Patch days, "Down for maintenance"

Profit ( = revenue – cost)

SCALE

$100ks of monthly spend

Many hundreds of instances

Around 500TB of monthly transfer

Peak to 12k tps (for a single title)

Around 1 PB of storage

Approximately 5 billion I/Os monthly

USAGE/LOAD PATTERN

Traditional SAS assumes starting small and scaling. Scaling

quickly is a problem, but a good problem.

Games are weird

Peak usage is release day, it tails off after that

You must be able to scale out of the gate. Users that cannot use

it the first day will often never be back!

PLATFORMS

Swarm pattern

Pods of services

Python/NGINX

Batch Processing pattern

Vertica

Elastic Map/Reduce

Work Queue (Kafka)

NoSQL (MongoDB – ugh!)

Gaming Platform

CoreOS/Docker

Strong Phoenix Server pattern

PROCESS/SOCIAL APPROACH

Must be (people) scalable

Working on 3 new games at any one time

Still supporting old games

Supporting services for the larger company

Don't create a bottleneck

“I'm waiting for a VM”. Bad process. No biscuit.

There are too many controls to get least privilege right!

Validation, not prevention (WHAT???)

POLICIES ARE GREAT, BUT...

They change over time

Are hard to get exactly right up front

Always have exceptions

The space of AWS permissions is HUGE. Permutations are deadly.

So...measure what you care about.

What you care about will change over time.

Trust...and verify

POWER TO THE PEOPLE (OR DEVELOPERS)

Don't gate productivity on fine points of arbitrary policies

Keep responsibility with dev team

domain expertise

put the pain where the control is

Stuff gets automated!!!

APPROACH

Cloud Environment

Multiple accounts (~ 2 dozen right now)

1 central services account

1 account per title

All environments in different VPCs (Dev, QA, Perf, Staging, Prod)

DEV TEAMS RESPONSIBLE FOR...

Developing, validating, deploying and running their games

Responding to production issues

PRODUCTION cost control

CENTRAL "CLOUD SERVICES" TEAM

“Owns”

Metrics, Monitoring, Alerting

Enables use of central services & good practices

Composable components used by the teams

Native packaging -- make it easy

Manages good practices

Their job is to be cloud experts

But they're not the only ones in the company

LOTS of conversation!

Automates everything non-project specific

New account creation, ...

OWNERSHIP/RESPONSIBILITY

Clearly align authority and responsibility.

If a Dev is getting up in the middle of the night to fix

something, they have to have full power to fix it.

On a related note, that means the teams get approval

control over a great deal

GREAT, HOW?

CRITICAL TOOL: RULES & WORKFLOWS

Custom developed rules/workflow system

Rules are small, stateless snippets of Python code that

trigger workflows

But can be company public and extensible by pull request

Workflows are potentially long running, stateful

operations that trigger list of changes.

Can also be company public, but tighter controls around

changes.

Changes can be reviewed manually or automatically.

CRITICAL TOOL: RULES & WORKFLOWS

Runtime is HIGHLY privileged – keep it tight!

This tool can destroy the world – but it

actually keeps it running.

(you have everything automated to recreate the

world, right?)

USER ACCESS CONTROL

Automate user management/creation from source in GIT

Define membership rules as intersection of desired group and

account characteristics (MFA anyone?)

Rules/Workflow enforces MFA. Central team doesn't have to

Remove your MFA, get demoted to “User”

USER ACCESS CONTROL

Don't try for least privilege – you won't get it right and it will be different tomorrow

There are a small number of access levels and people are sorted into those levels per

account

User

ReadOnly (Manager)

Finance

Developer

DevOps

FullAdmin

USER ACCESS CONTROL NG

Federation? Yes, but there are issues

SSO? Likewise

We'll probably go to a SAML based federated MFA

gateway

We might go to AD based access

NETWORK ACCESS CONTROL

VPN into the cloud

Bastion hosts

Private VPCs

Shared root keys

Yup, shared.

No user management on individual nodes

Cattle, not cats

COST CONTROL

It's a thing.

It's a really BIG THING!

COST CONTROL

Tagging policy

Owner (who to go to)

Environment (Dev, Prod, QA, …)

Project (Cost Center – DO NOT USE THIS FOR AUTOMATIOLN!)

Enforce tagging by rules/workflow process

Measure compliance, escalate to GM

Kill off instances that don't comply

With lots of warning

Now tools will give good data

CloudHealth (there are others)

WHAT YOU CARE ABOUT WITH COSTS (AWS SPECIFIC)

Reserved Instances

Go for about 80% of always on – Leave room to optimize

Periodically review it and move RIs

Turn off developer systems overnight – small but significant.

Stay on current generation (instance type and OS)

Better performance/$, results in lower $

Pay attention to traffic – inter AZ as well as outbound.

Compression!

Do cost estimates based on loads – have guidelines

ACTUALLY HERDING THE CATS

Devops Working Group

Senior engineers

No managers: If you can't put hands on a keyboard to fix something going

wrong, this is not the place for you

Things are brought up, opinions are formed. Don’t attribute to individuals.

Discuss cross-cutting needs

GREAT place for the central cloud team to mine for new work

ACTUALLY HERDING THE CATS

Central Cloud Team

Is ½ service organization and ½ cloud owner

Be nice, or the cats will go away and ignore you.

The cats are your scouts and your customers. Listen to them

so you know what's important.

RESULTS

PROs

Maximizes velocity, agility

Scalable

Can try out different working

patterns

CONs

Inconsistent

Have to be careful about

responsibilities

You always have some weeds in

the garden

You're always trying to keep up

with developers

But at least you know it

And you're not in the way

RESOURCES

AWS Enterprise Support

Expensive, but good

Cloud based services – lots of options here

OpEX, not CapEX (except for Ris?)

Metrics (Librato)

Cost Exploration (CloudHealth)

TOOLS

Automated Rules/Workflow

Github Enterprise – it's Github that makes your security geeks happy.

Docker

Quay (Private Docker Hub)

Jenkins (for network Cron)

Chef (not much use any more)

NEXT STEPS

Cloud Services Liaisons

Send a member of cloud central to each team's sprint planning

"Lunch and Learn"

Goes both ways -- NOT just the cloud central team

More Policy Automation!!!

LESSONS LEARNED

Start when you're small – fixing the problem

after the fact is much harder

Automate everything, even when you don't “have”

to – it makes things easier to change

Have a Central Services Team to deal with cross-

cutting concerns

Put the power in the hands of people who can

make things better

QUESTIONS?

PHOTO CREDITS

• https://www.flickr.com/photos/pelican/6180235561

• https://www.youtube.com/watch?v=puijCrETsrY

• https://www.flickr.com/photos/afu007/2398217277

• https://www.flickr.com/photos/jurvetson/5419597546

• https://commons.wikimedia.org/wiki/File:Catch_cats_3.JPG

• https://pixabay.com/en/photos/pet/?cat=industry

• https://commons.wikimedia.org/wiki/File:White_Cat_and_a_mouse.jpg

• https://www.flickr.com/photos/dan4th/2839915202

• https://pixabay.com/en/cat-annoyed-mauzen-teeth-stress-1370024/

• https://et.wikipedia.org/wiki/Pilt:PR_Siriuksen_EeroCurl_ACS_ds_09_24_1.JPG

• https://commons.wikimedia.org/wiki/File:PR_Siriuksen_EeroCurl_ACS_ds_09_24_2.JPG

• https://commons.wikimedia.org/wiki/File:Tunnel_cat_(6414878527).jpg

• https://www.flickr.com/photos/petsadviser-pix/8652859754

• https://commons.wikimedia.org/wiki/File:Antu_mongodb.svg

• https://www.flickr.com/photos/michael-broad/4642745499

• http://maxpixel.freegreatpicture.com/Cat-Animal-Kennel-Cats-Eyes-Cute-Cat-Animals-

269047

• http://maxpixel.freegreatpicture.com/Cat-Kitty-Kitten-Cute-Pipe-Curious-Tube-Feline-

568593

• https://www.flickr.com/photos/santamonicamtns/16613805934

• http://maxpixel.freegreatpicture.com/Surprise-Kitten-Kittens-Cat-Money-Animals-Pet-

602944

• https://www.pexels.com/photo/animals-cat-pets-7792/

• https://commons.wikimedia.org/wiki/File:Cat_into_the_box.jpg

• https://commons.wikimedia.org/wiki/File:White_cat_over_water_2012.jpg

• https://pixabay.com/en/black-cat-reading-white-paper-33843/

• https://pixabay.com/en/photos/hidden/

• https://www.flickr.com/photos/editor/1195653047

• https://en.wikipedia.org/wiki/File:Exponential_Decay_Function.png

• Other Photos by Chris Williams, Dewey Sasser, and Jennifer Moore

All photos found by Google Images marked for commercial reuse, or by personal permission

Technology

Herding cats in the Cloud