45
Moving a Running OpenStack Cloud to a New Data Center

Moving a Running OpenStack Cloud to a New Data Center

Embed Size (px)

Citation preview

Page 1: Moving a Running OpenStack Cloud to a New Data Center

Moving a Running OpenStack Cloud to a New Data Center

Page 2: Moving a Running OpenStack Cloud to a New Data Center

IntroductionMatt Fischer

[email protected]– IRC: mfisch– Twitter: @openmfisch

Craig DeLatte• [email protected]• IRC: cdelatte

Page 3: Moving a Running OpenStack Cloud to a New Data Center

Background• OpenStack in two national data centers• Hundreds of nodes per data center• Running lots of business critical applications

Page 4: Moving a Running OpenStack Cloud to a New Data Center

Why Move?• Data centers physically out of space for expansion• Separation of environments for performance and

robustness• Allow us to control our own full hardware stack• Redesign the network layout

Page 5: Moving a Running OpenStack Cloud to a New Data Center

Mission Impossible?• We aren’t allowed in the data centers• We weren’t allowed to make network switch

changes • We don’t use the corporate change management

system• We don’t set schedules or priorities for other groups• Customer VMs are pets

Page 6: Moving a Running OpenStack Cloud to a New Data Center

First Technical Planning Session

Page 7: Moving a Running OpenStack Cloud to a New Data Center

“You want to do what?” “You mean like physically move the boxes?”

Page 8: Moving a Running OpenStack Cloud to a New Data Center

Hardware Plan• What do we “need” to accomplish:

–Network layout change–Upgrade firmware across the board

• What do we “want” to accomplish:–Burn-in testing to eliminate hardware issues–Fix server hardware layout

Page 9: Moving a Running OpenStack Cloud to a New Data Center

Physical Node Move Steps• After the node has been cleared of running services

–Don’t forget to wipe your boot drive –Physically move the server–Swap nics–Re-cable–Upgrade f/w• Re-IP/Update MAC

Page 10: Moving a Running OpenStack Cloud to a New Data Center

Hardware Hurdles• Standing up infra is hard, things to consider

–Firmware upgrades–Hardware config–Burn-in testing

Page 11: Moving a Running OpenStack Cloud to a New Data Center

“Infrastructure automation is a prerequisite for a project like this”

Page 12: Moving a Running OpenStack Cloud to a New Data Center

We Designed For Automation• Full node server build automation with

PXE/Cobbler/Puppet• Hardware load balancer automation with Ansible• Network switch automation with Ansible• API quiescing with xinetd• Guest VM live-migration• Tooling to manage and move virtual routers

Page 13: Moving a Running OpenStack Cloud to a New Data Center

Load Balancers• Software load balancers (haproxy) managed by

puppet already• Hardware load balancer without automation

–config done by hand–validation is looking for a green dot in GUI

• We automated the A10 deploy and post-deploy validation with ansible

Page 14: Moving a Running OpenStack Cloud to a New Data Center

Switches• Switch config before automation

–Required approval from three teams–Configs pasted in from a Wiki–Could take of days or weeks

• Automated Juniper switch deployment–Done with ansible+Jenkins–Follows code review process–Network Engineering team using gerrit!

Page 15: Moving a Running OpenStack Cloud to a New Data Center

Caveats• Some expensive pieces of hardware lack full API

support or documentation.• Ansible or Puppet support may be missing• You may be the first one to ask your vendor rep

about their automation story

Page 16: Moving a Running OpenStack Cloud to a New Data Center

“This should be no more disruptive to APIs than a normal weekly deploy or to guests than a live-migration”

Page 17: Moving a Running OpenStack Cloud to a New Data Center

General Node Move Process• Drop DNS TTL• Evacuate Node/Quiesce Traffic• Wipe Drive• Power off box• Physically move the node*• Update DNS Record• Build box with PXE• Test new node• Update load balancers/nova/ceph config

Page 18: Moving a Running OpenStack Cloud to a New Data Center

Traffic Quiescing• API services have a special health check port

–Utilizes xinetd and socat–Used by haproxy and A10

• Dropping a file into place marks the node as disabled, but doesn’t interrupt active connections.

• This also works for internal services like mysql and rabbitMQ

Page 19: Moving a Running OpenStack Cloud to a New Data Center

Ordering - First Production Move

Page 20: Moving a Running OpenStack Cloud to a New Data Center

Ordering - Second Production Move

Page 21: Moving a Running OpenStack Cloud to a New Data Center

Puppet Master / Build Server• Puppet master moved via “brain transplant”• Automated this process with ansible• The puppet master also handles PXE boot via

cobbler–First box to be moved

• Wanted to avoid inter-DC PXE booting, but had emergency procedures

Page 22: Moving a Running OpenStack Cloud to a New Data Center

Load Balancer + VIP Move

haproxy

Backup haproxy

node

VIP - 1.2.3.4

api.twc.net

API Services

API Calls

Old DC New DC

Page 23: Moving a Running OpenStack Cloud to a New Data Center

Load Balancer: Move Node + Test

haproxy haproxy

VIP - 1.2.3.4

api.twc.net

API Services

API Calls

Old DC New DC

Test API Calls

VIP - 5.6.7.8

Page 24: Moving a Running OpenStack Cloud to a New Data Center

Load Balancer: Move DNS & Wait

haproxy haproxy

VIP - 1.2.3.4

API Services

Running API Connections

Old DC New DC

API Calls

VIP - 5.6.7.8

api.twc.net

Page 25: Moving a Running OpenStack Cloud to a New Data Center

Load Balancer: Final State

haproxy

Backup haproxy

node

VIP - 5.6.7.8

api.twc.net

API Services

API Calls

Old DC New DC

Page 26: Moving a Running OpenStack Cloud to a New Data Center

Keystone• Quiesce traffic & wait for connections to drop• Stop services• Power off box• Rebuild new box in new DC• Test new box before adding to API cluster• No impact - done during the day

Page 27: Moving a Running OpenStack Cloud to a New Data Center

Control - Routers• Router moves are the most customer impacting

part of this process.• Some customers have a lot of FIPs per router• Evacuating all routers at once is a bad idea,

although that’s what we did.• Moved all routers before stopping OpenStack

services.

Page 28: Moving a Running OpenStack Cloud to a New Data Center

Control - API Services + RabbitMQ• Quiesce connections on this node• But what about connections to RabbitMQ?

– Stop OpenStack on this node–Restart OpenStack on other control nodes–Restart nova-compute on other nodes

• Stop Rabbit/Stop mysql• Power down• Rebuild, Test, Add to API Cluster

Page 29: Moving a Running OpenStack Cloud to a New Data Center

Compute• Basic Plan: build, evac, move. Repeat.• ansible tooling for live migration

–canary VM with ping check• But… Live-migration is not guaranteed to work

–Limit your parallelization–Bigger and busier ones may never live-migrate

Page 30: Moving a Running OpenStack Cloud to a New Data Center

“..uh what about our Petabytes of data?”

Page 31: Moving a Running OpenStack Cloud to a New Data Center

Swift• Power off, move, then rebuild node

–Leave data drives alone during rebuild, only incrementally migrate data

• Add in nodes to accept data–Ensure all routes are in place to new networks

Page 32: Moving a Running OpenStack Cloud to a New Data Center

Cephmon• Attempts to virtualize cephmon IPs failed

–Alerts it may be a security breach• Multiple steps to get the right cephmon IPs

– Instance boot drive• Nova stop/start or nova resize

–Attached volumes• Live-migration

Page 33: Moving a Running OpenStack Cloud to a New Data Center

Ceph OSDs• Bring up some new nodes• Add New OSDs to crushmap• Data migrated to new nodes• Remove old OSDs from crushmap• Power off some old nodes• Physically move some old nodes to new DC• Repeat...

Page 34: Moving a Running OpenStack Cloud to a New Data Center

Data MigrationsCan you spot the issue?

Page 35: Moving a Running OpenStack Cloud to a New Data Center

Watch Your Bottlenecks

Page 36: Moving a Running OpenStack Cloud to a New Data Center

“I can promise you you there will be problems.”

Page 37: Moving a Running OpenStack Cloud to a New Data Center

Issues• Networking

–ACLs– Incorrect cabling–Bottlenecks

• Software–VTEP address overlap–keepalived VIPs

Page 38: Moving a Running OpenStack Cloud to a New Data Center

Issues (cont.)• Vendors

–Bugs bugs bugs….• Deployment process

–Running different levels of deployments until the move was complete

• Customers–Gaining customer buy-in is a chess match

Page 39: Moving a Running OpenStack Cloud to a New Data Center

Delays• Vendors

–Found multiple issues with PXE booting, VLAN, and LACP

• Space–Needed to build a new data hall–Not to mention a new data center

Page 40: Moving a Running OpenStack Cloud to a New Data Center

Customer Issues• Actual

–VTEP overlap–Oops we upgraded OVS–File descriptor limits on qemu processes

• Perceived–High latency reports

• App owners released a new campaign targeted to millions of customers

Page 41: Moving a Running OpenStack Cloud to a New Data Center

“If you are going to do this...”

Page 42: Moving a Running OpenStack Cloud to a New Data Center

If You’re Going to Do This...• Our cloud has lots of interdependencies, tracking

these was key.–Caching DNS on load balancers

• Which things in your system are still configured using IP addresses?

–Galera, ceph, haproxy

Page 43: Moving a Running OpenStack Cloud to a New Data Center

If You’re Going to Do This...• What resources are protected by VLAN specific

ACLs in your company?– DNS, LDAP/AD

• Do you have maintenance plans and automation for each of your nodes?

Page 44: Moving a Running OpenStack Cloud to a New Data Center

If You’re Going to Do This...• Communicate with customers, but don’t over-

communicate. –Most get nervous if they know too much.

• Don’t get overly aggressive with your timeline• Practice Practice Practice

–Production was our 3rd time, not our 1st–We made improvements to the process every

time.

Page 45: Moving a Running OpenStack Cloud to a New Data Center

Summary