How Adobe Built An OpenStack CloudJun Park (Ph.D, MBA), Solutions Architect At Adobe
Arghya Banerjee, Sr. Systems Engineer At Adobe
OpenStack Mitaka Summit At Tokyo, Oct 2015
Swiss Cheese Model
2
From Wikipedia
If aligned, flaws would allow an accident to occur
Flaws In Defense layers
Two More Factors That Complicate Things
3
SpaceTime Continuum- Einstein
Interactions, Higgs Field & Boson
From Wikipedia
From Youtube
Our Template To Analyze
4
Time
Components
DependenciesIn Red: Bugs or IssuesIn Green: Fix or Stable
OpenStack Survey, May 2015
5
The most common arch:Ubuntu + KVM + OVS + Ceph
Adobe OpenStack Architecture
6
VM1 VM2eth0 eth1 eth0 eth1
VM3eth0 eth1
Private Networks: VxLAN-based
External Provider Networks: VLAN-basedAdobe Network Firewall
Adobe Corporate Networks
Storage: Ceph RBD
What Happened At Networking?
7
May ‘15Jul ‘14Apr ‘14
Ubuntu 14.04 Trusty ReleasedWith OVS 2.0.1
Bug Report With OVS 2.0.1In Ubuntu 14.04
Cherry-PickOn OVS 2.0.2
In Ubuntu 14.04.2
Ubuntu
14.04
OpenvSwitch
(OVS)
Bug FixIn all OVS 2.x
Jun ‘13
This BugIntroduced withOVS Mega Flow
Aug ‘14
OVS 2.3.0OVS 2.1.3OVS 2.0.2Released
A New Bug: OVS Sporadically Crashes In Adding A Port(https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1336555 and 1449012)
OVS 2.0.1 Released:Mega Flow
Multiprocessing
Dec ‘13
Enhancement PatchNot Yet Integrated
(e.g., 270 secs to 3 secsFor 25K rules)
NeutronSecurity GroupO(N^2) Issue
Restarting agents re-establishes entire flows
Fix ready, not added
What Happened At Networking?
8
May ‘15Nov ‘14
Cherry-PickOnto OVS 2.0.2
In Ubuntu 14.04
Ubuntu
14.04
OpenStack
Summits
A New Bug: OVS Sporadically Crashes In Adding A Port(https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1336555 and 1449012)
OVS 2.0.1 Released:Mega Flow
Multiprocessing
Dec ‘13
OVS
ParisJuno
VancouverKilo
• Some companies reverted OVS to LinuxBridge!• Some pundits spread FUD about Neutron!
AtlantaIceHouse
May ‘14Apr ‘14
Ubuntu 14.04 Trusty ReleasedWith OVS 2.0.1
What Happened At Storage?
9
July ‘15Apr ‘14
Ubuntu 14.04 Trusty Released
With Ceph FireFly 0.79
Ubuntu 14.04 UpdatesWith Ceph FireFly 0.80.10
Ubuntu
14.04
Ceph Failover InstabilityWith FireFly
Hammer?
Ceph Operational Instability,Cinder Scalability Issue
Enhancement SolutionNot Yet Integrated
(e.g., APIs Stacked Up -> Multiprocessing)
CinderCinder is stuck
when Ceph is stuck(e.g., use local drive for copying an image)
May ‘14
What Happened At Data Node?
10
July ‘15Apr ‘14
Ubuntu 14.04 Trusty ReleasedWith Kernel…
Ubuntu
14.04
KernelXFS
Deadlock
Bug
Kernel Memory Bug,Security Issue
Security PatchKVM Security Issue
May ‘14 Nov‘14
Bug Fix
Dec‘13
Ubuntu 14.04 Trusty ReleasedWith Kernel…
May ‘15
Check List Networks
Understand OVS and find stable OVS Cherry-pick for Neutron Scalability: firewall rules Our own out-of-band rate limiting on networks, e.g., 200 Mbps Set up right MTU size on OVS structure Turn off GRO/LRO on hosts
Storage Decouple Storage system from OpenStack API services Cinder Scalability Ceph Stability: Hammer, reconfigure towards optimal
11
How To Test at Scale Emulate future production env
Create hundreds of VMs, inject workloads, and destroy all Recycle this entire test over and over again Findings: dead tokens stacked up
Each component scalability Neutron: OVS Cinder: Ceph Nova: KVM
12
Have We Done Enough?
4?
3?
14
It's not that I'm so smart, it's just that I stay with problems longer.
- Albert Einstein
New Efforts In OpenStack OpenStack Product Working Group
Link up between contributors and users
Governance/DefCoreCommittee Defining OpenStack Core
Large Deployment Team Operational issues for large delpoyments
Open Virtual Network (OVN) In-kernel Conntrack, DPDK, etc. Will run atop OVS
15
17
APPENDIX
18
USE CASE: Mesos Cluster
Possible Combinations
19
Containers VMsBare Metals ContainersIn ContainersVMs
Mesos Cluster Via Heat
20
MarathonZookeeper
VM1: mesos masterVM2: mesos slave1 VM3: mesos slave2
http server http server
Host1 Host2 Host3
-> Ubuntu-mesos imageavailable via diskimage-builder-> Post configuration for master-> starting services
-> Ubuntu-mesos image-> Post configuration for slave using mesos master IP.-> starting services
Mesos Cluster with Marathon
21
Marathon
Mesos Slave2
http server
Mesos MasterWith
ZooKeeper
Request to run a micro-servicevia REST API
Mesos Slave1
http server
Master + 2 slaves: Heat Stacks
22
Topology of Slave2
23
Marathon: Two Apps on Slave1
24
App Running On Slave
25
Mesos UI
26
Heat Template
27
Time
Components
Dependencies
Adobe OpenStack Architecture
28
VM1eth0 eth1
External Provider Networks: VLAN-basedAdobe Network Firewall
Adobe Corporate Networks
Linux Bridge
OpenvSwitchbond
0 Physical VLANs
Set of ImagesGlance API
Server
Image1: Ubuntu Trusty
Volume1 : Ubuntu Trusty Copy-On-Write (COW) Ceph VolumeSnapshot1: Ubuntu
Trusty
New Volume1 for VM1
New Volume2 for VM2
New Volume3 for VM3
Cinder API Server
Base Volume ForAll Three VMsIndividual COW
Volumes
Volume Management in OpenStack
2. Snapshot
3. Volumes
1. Copy