Herding cats in the Cloud

  • View
    389

  • Download
    0

Embed Size (px)

Text of Herding cats in the Cloud

  • HERDING CATS IN THE CLOUDMAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD

    Dewey Sasser

    Consulting Cloud Architect

    Algined Software

  • ABOUT THIS TALK

    Public Clouds can give developers unprecedented levels

    of power

    With Great Power Comes Great Responsibility

    You must structure your development and production

    deployment process to use this power well

    How do we do this? Experience from a large deployment

  • ABOUT DEWEY

    Distributed Application Developer for 20 years

    Doing build/release/software process for about that long

    Accidentally doing devops out of self-defense

    Wandered in operations about 5 years ago

    Built some private cloud for dev

    Built some private cloud for prod

    Starting architecting using public cloud for everything

  • ABOUT THE COMPANY

    Company Policy: dont talk for the company

    Therefore, these slides don't mention The Company.

    There is no information here that is not otherwise publicly available.

    Whoever it is, I don't speak for them

    Major Gaming Company, multiple AAA titles

    History in MMOs

    All in on mobile now

  • ROADMAP

    What we Did

    What we're Doing

    What we might do Next

  • WE'RE COMING FROM...

    Traditionally MMOs in colo

    Windows (ugh!) based servers

    All in cloud now: mobile, cloud, Docker, MongoDB, Phoenix

    Servers, Chaos Monkey, (...other popular buzzwords)

  • GOALS

    100% uptime: players want to play

    No more: Patch days, "Down for maintenance"

    Profit ( = revenue cost)

  • SCALE

    $100ks of monthly spend

    Many hundreds of instances

    Around 500TB of monthly transfer

    Peak to 12k tps (for a single title)

    Around 1 PB of storage

    Approximately 5 billion I/Os monthly

  • USAGE/LOAD PATTERN

    Traditional SAS assumes starting small and scaling. Scaling

    quickly is a problem, but a good problem.

    Games are weird

    Peak usage is release day, it tails off after that

    You must be able to scale out of the gate. Users that cannot use

    it the first day will often never be back!

  • PLATFORMS

    Swarm pattern

    Pods of services

    Python/NGINX

    Batch Processing pattern

    Vertica

    Elastic Map/Reduce

    Work Queue (Kafka)

    NoSQL (MongoDB ugh!)

    Gaming Platform

    CoreOS/Docker

    Strong Phoenix Server pattern

  • PROCESS/SOCIAL APPROACH

    Must be (people) scalable

    Working on 3 new games at any one time

    Still supporting old games

    Supporting services for the larger company

    Don't create a bottleneck

    I'm waiting for a VM. Bad process. No biscuit.

    There are too many controls to get least privilege right!

    Validation, not prevention (WHAT???)

  • POLICIES ARE GREAT, BUT...

    They change over time

    Are hard to get exactly right up front

    Always have exceptions

    The space of AWS permissions is HUGE. Permutations are deadly.

    So...measure what you care about.

    What you care about will change over time.

    Trust...and verify

  • POWER TO THE PEOPLE (OR DEVELOPERS)

    Don't gate productivity on fine points of arbitrary policies

    Keep responsibility with dev team

    domain expertise

    put the pain where the control is

    Stuff gets automated!!!

  • APPROACH

    Cloud Environment

    Multiple accounts (~ 2 dozen right now)

    1 central services account

    1 account per title

    All environments in different VPCs (Dev, QA, Perf, Staging, Prod)

  • DEV TEAMS RESPONSIBLE FOR...

    Developing, validating, deploying and running their games

    Responding to production issues

    PRODUCTION cost control

  • CENTRAL "CLOUD SERVICES" TEAM

    Owns

    Metrics, Monitoring, Alerting

    Enables use of central services & good practices

    Composable components used by the teams

    Native packaging -- make it easy

    Manages good practices

    Their job is to be cloud experts

    But they're not the only ones in the company

    LOTS of conversation!

    Automates everything non-project specific

    New account creation, ...

  • OWNERSHIP/RESPONSIBILITY

    Clearly align authority and responsibility.

    If a Dev is getting up in the middle of the night to fix

    something, they have to have full power to fix it.

    On a related note, that means the teams get approval

    control over a great deal

  • GREAT, HOW?

  • CRITICAL TOOL: RULES & WORKFLOWS

    Custom developed rules/workflow system

    Rules are small, stateless snippets of Python code that

    trigger workflows

    But can be company public and extensible by pull request

    Workflows are potentially long running, stateful

    operations that trigger list of changes.

    Can also be company public, but tighter controls around

    changes.

    Changes can be reviewed manually or automatically.

  • CRITICAL TOOL: RULES & WORKFLOWS

    Runtime is HIGHLY privileged keep it tight!

    This tool can destroy the world but it

    actually keeps it running.

    (you have everything automated to recreate the

    world, right?)

  • USER ACCESS CONTROL

    Automate user management/creation from source in GIT

    Define membership rules as intersection of desired group and

    account characteristics (MFA anyone?)

    Rules/Workflow enforces MFA. Central team doesn't have to

    Remove your MFA, get demoted to User

  • USER ACCESS CONTROL

    Don't try for least privilege you won't get it right and it will be different tomorrow

    There are a small number of access levels and people are sorted into those levels per

    account

    User

    ReadOnly (Manager)

    Finance

    Developer

    DevOps

    FullAdmin

  • USER ACCESS CONTROL NG

    Federation? Yes, but there are issues

    SSO? Likewise

    We'll probably go to a SAML based federated MFA

    gateway

    We might go to AD based access

  • NETWORK ACCESS CONTROL

    VPN into the cloud

    Bastion hosts

    Private VPCs

    Shared root keys

    Yup, shared.

    No user management on individual nodes

    Cattle, not cats

  • COST CONTROL

    It's a thing.

    It's a really BIG THING!

  • COST CONTROL

    Tagging policy

    Owner (who to go to)

    Environment (Dev, Prod, QA, )

    Project (Cost Center DO NOT USE THIS FOR AUTOMATIOLN!)

    Enforce tagging by rules/workflow process

    Measure compliance, escalate to GM

    Kill off instances that don't comply

    With lots of warning

    Now tools will give good data

    CloudHealth (there are others)

  • WHAT YOU CARE ABOUT WITH COSTS (AWS SPECIFIC)

    Reserved Instances

    Go for about 80% of always on Leave room to optimize

    Periodically review it and move RIs

    Turn off developer systems overnight small but significant.

    Stay on current generation (instance type and OS)

    Better performance/$, results in lower $

    Pay attention to traffic inter AZ as well as outbound.

    Compression!

    Do cost estimates based on loads have guidelines

  • ACTUALLY HERDING THE CATS

    Devops Working Group

    Senior engineers

    No managers: If you can't put hands on a keyboard to fix something going

    wrong, this is not the place for you

    Things are brought up, opinions are formed. Dont attribute to individuals.

    Discuss cross-cutting needs

    GREAT place for the central cloud team to mine for new work

  • ACTUALLY HERDING THE CATS

    Central Cloud Team

    Is service organization and cloud owner

    Be nice, or the cats will go away and ignore you.

    The cats are your scouts and your customers. Listen to them

    so you know what's important.

  • RESULTS

    PROs

    Maximizes velocity, agility

    Scalable

    Can try out different working

    patterns

    CONs

    Inconsistent

    Have to be careful about

    responsibilities

    You always have some weeds in

    the garden

    You're always trying to keep up

    with developers

    But at least you know it

    And you're not in the way

  • RESOURCES

    AWS Enterprise Support

    Expensive, but good

    Cloud based services lots of options here

    OpEX, not CapEX (except for Ris?)

    Metrics (Librato)

    Cost Exploration (CloudHealth)

  • TOOLS

    Automated Rules/Workflow

    Github Enterprise it's Github that makes your security geeks happy.

    Docker

    Quay (Private Docker Hub)

    Jenkins (for network Cron)

    Chef (not much use any more)

  • NEXT STEPS

    Cloud Services Liaisons

    Send a member of cloud central to each team's sprint planning

    "Lunch and Learn"

    Goes both ways -- NOT just the cloud central team

    More Policy Automation!!!

  • LESSONS LEARNED

    Start when you're small fixing the problem

    after the fact is much harder

    Automate everything, even when you don't have

    to it makes things easier to change

    Have a Central Services Team to deal with cross-

    cutting concerns

    Put the power in the hands of people who can

    make things better

  • QUESTIONS?

  • PHOTO CREDITS

    https://www.flickr.com/photos/pelican/6180235561

    https://www.youtube.com/watch?v=puijCrETsrY

    https://www.flickr.com/photos/afu007/2398217277

    https://www.flickr.com/photos/jurvetson/5419597546

    https://commons.wikimedia.org/wiki/F