Learning to Scale OpenStack: An Update from the Rackspace Public Cloud

Embed Size (px)

Citation preview

Openstack_ATL

An update from the Rackspace Public Cloud

Learning to Scale Openstack

Rainya Mosher and Jesse Keating Deployment Engineering@rainyamosher @iamjkeating

Introductions, welcome to the talk.

The Rackspace Public Cloud6 Public Regions3 Pre-Production Regions10s of Thousands of nodesGrowing continuallyFrequent deploymentsStaying aligned with upstream

#rackstackatlA review of the Rackspace Public Cloud sets the context for the conversation

We could not deploy code in a reasonable window of time

We did not have confidence in the code we were deploying

We could not keep up with upstream

Our Old Challenges

This is our third summit presenting on this topic. Here is a brief review of some of the scale issues we were facing back at the Havana Summit in Portland

Our window of time is 30 minutes perceived downtime, 4 hour deploy windows

Code coverage wasn't great, lots of errors discovered in production

Upstream moved very fast, and we couldn't keep up with all the testing downstream

Deploys taking 6+ hours

Deploys often failed the first time

Migrations were an unknown factor

Deploys roughly 2 months behind upstream

Old Challenges Met

Deploys take an hour, as short as 10 minutes

Deploys rarely fail the first time

Migrations tested upstream and timed downstream

Still up to 2 months behind

Here is a comparison of how we met some of our challenges

Our deploys are much faster, some as short as 10 minutes total in our largest environment with 3 minutes of API interruption

Deploys are now more reliable

Migration data is known ahead of time (and bad ones blocked upstream)

We still haven't solved keeping up with upstream. Many factors there.

It is by riding a bicycle that you learn the contours of a country best, since you have to sweat up the hills and coast down them.
~ Ernest Hemingway

We are also learning the countours of openstack, by being the largest public cloud operator. We get to sweat up the hills and coast back down.

Scaling Services

Scaling Deployments

Scaling Frequency

Our New Challenges

Some of our new challenges scaling not just deploying bits on nodes as fast as we can.

Scaling servicesScaling DeploymentsScaling Frequency

While we are trying to be a thought leader and front runner, collaboration is the key to success. The developer, operator, and testing communities need be aware of these scaling challenges

Scaling Services#rackstackatl

Scaling Services As the size of our cloud grows, and the features of our cloud grows, the services used need to scale along with them. Here we will walk through two scaling scenarios that highlight the challenge.

Scaling Glance

Scheduled Images feature went live

Glance saw much more usage

Glance servers became saturated

Builds and snapshots slowed down, eventually piling up faster than could be consumed

Resolved by:Scaling number of glance-api nodes

Scaling size of glance-api nodes

Scaling use of glance-bypass feature

Glance is an interesting case. Our glance talks acts as a middle person between HVs and Swift. As glance got used more, the bottleneck emerged. Partly due to our own configuration, but partly due to the nature of glance.

Once we resolve the glance issues, swift could be the next bottleneck, care will be needed to make sure we don't just kick performance problems down the line to the next group.

Scaling Nova Cells

Performance Cells went live

More and more cells added to regions

Nova cells service became single funnel slowing down the exchange of data

Eventually our single nova-cells service could not consume messages faster than they were being produced

Resolved by:Scaling number of nova-cells services

Optimizing instance healing calls

Optimizing database usage from cells service

Nova cells is responsible for interacting between the global cell and all the child cells. Doing this with just a single instance was never going to scale, we just ran out of runway before the pain hit.

Through collaboration with upstream, we are now more able to scale out nova-cells as our cell counts grow.

How do we anticipate where our growth will hurt and proactively scale to match?

These challenges will repeat. New bottlenecks will be found and new resource limits will be discovered. Staying ahead of the pain is key. We will not be the only ones to experience this, we are looking for collaboration on how best to manage this kind of scale.

Scaling Deployments#rackstackatl

Our next scale challenge involves deployments.

We made great strides around Havana, what have we been doing since?

Higher Form Orchestration

Pre-staging content outside of deploy window

Increased tolerance of downed hosts

Targeted bring up of servicesAPI first, then computes

More deployment optionsFactonly

Cellonly

No migrations

Reduced complexitySingle entry point: bin/deploy

Single orchestration system: Ansible

Orchestration has been our theme around deployments. We continue to iterate on the parts of the deployment causing the most pain, always making improvements for the next time.

Walk through each block and explain why the change was made

We still treat OpenStack as a legacy software deployment. As a community we need to treat it more like a cloud application, but that requires collaboration!

Even with the improvements, we still treat openstack like a legacy application; upgrading in place, not utilizing load balancers, stopping everything to migrate databases, preventing mixed versions, etc.. There are many things that are preventing us from getting to zero downtime, and that's where we can all work together!

Scaling Frequency#rackstackatl

A third scale challenge is frequency. This is the scale of doing things much more often.

It never gets easier, you just go faster.
~ Greg LeMond

A very relevant quote, but unlike bicycling, when you do something more often in the DevOps world, it does tend to get easier, but there are challenges to going faster!

Scaling Change

New features coming

New configurations coming

Accommodate without interrupting customer experience

Change faster, change frequently, on an ever growing fleet of systems

Resolved by:Understanding change before it happens

Scheduling changes to not conflict

Dedicating release iterations to risky change on top of known good code

Custom deploy modes per change type

Change comes from many sources. These changes need to be distributed to the environments, but with as little customer impact as possible. If we can't deploy changes often enough, we fall behind upstream, we fall behind our features, and we have larger deployments to consume. A snowball effect.

Our work on creating new multiple release pipelines, improving our deployment methods, and moving our tests upstream have enabled us to move faster, but not fast enough.

Customer Experience is our most important measurement of how fast we can scale.

This is our limit. We absolutely have to make this better. This is a global need, throughout the community of developers, operators, and testers.

Object Placeholder

The Next Iteration

A quick look at what we've got cooking for the Juno cycle

Leverage object model in Icehouse for mixed-version services

Implement Nova conductor service

Investigate read-only states

Zero Perceived Downtime

In Icehouse nova made great strides toward live upgrade with object model and conductor, which give us the ability to run multiple versions of openstack at the same time, notably we could run newer nova-api against an older version in the rest of the environment and shield nova-compute from migrations. This could allow us to roll the update through without downtime of the API and the computes with less interruption.

Investigate putting API nodes in read-only during migrations to satisfy some requests and queue others

Can we give Glance it's own pipeline and deployment capability, independent of Nova or other services?

How do we combat the exponential growth of service version combinations?

Does this actually make the whole pipeline any faster?

Individual Service Deployment Pipelines

This is an ongoing conversation. If we allow each service to work independently, what does that do to the version test matrix? Can we reliably validate anything? While individual projects/services might go faster, does that allow the entire pipeline to go faster? This ties into the discussions happening now at the design summit about cross project interactions.

Creating not just ephemeral environments, but production ones as well

Upgrades are easy, initial setups are a lot harder

Validation is critical

Developers and Operators need to collaborate on this use case when services are being designed

Fully Automated Environments

Yeah, we need them. Setting them up is hard, lets work together to make them easier.

The ops meetups are great for collaborating on the issues at hand.

I have always struggled to achieve excellence. One thing that cycling has taught me is that if you can achieve something without a struggle it's not going to be satisfying.
~ Greg LeMond

We do a lot of things that are hard, but if it wasn't hard, it wouldn't be as satisfying. That's what keeps us coming back.

Scaling is more than just tossing code on nodes. There are a lot more considerations to take into account.

The development, operator, and tester communities need to collaborate more on where the painful parts are, particularly at scale, and work together on solutions.

#rackstackatl

Click to edit the title text formatClick to edit Master title style

#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

RACKSPACE HOSTING | WWW.RACKSPACE.COM

Click to edit the title text formatClick to edit Master title style

#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick to edit Master title style

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline Level

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the title text formatClick to edit Master title style

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlRACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline Level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick to edit Master title style

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatl

Click to edit the title text formatClick to edit master title style

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatl

RACKSPACE HOSTING | WWW.RACKSPACE.COM Click to edit the title text formatCLICK TO EDIT MASTER TITLE STYLE

#rackstackatl

Click to edit the title text formatCLICK TO EDIT MASTER TITLE STYLE

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatl

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the title text formatClick to edit Master title style

RACKSPACE HOSTING | WWW.RACKSPACE.COM

#rackstackatl

RACKSPACE HOSTING | 5000 WALZEM ROAD | SAN ANTONIO, TX 78218US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COMRACKSPACE HOSTING | RACKSPACE US, INC. | RACKSPACE AND FANATICAL SUPPORT ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COMRACKSPACE HOSTING | RACKSPACE US, INC. | RACKSPACE AND FANATICAL SUPPORT ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM