54
Showcasing OpenStack in Telefónica I+D Digital Operations & Deployment TGR/gCTO

La apuesta de Telefónica por la cloud privada

Embed Size (px)

Citation preview

Showcasing OpenStack in Telefónica I+D

Digital Operations & DeploymentTGR/gCTO

Introduction

• Showcase deployment of Red Hat OpenStack within the Digital Service Node (dSN) infrastructure in Telefónica I+D

• History & Motivations

• How we did it

• Problems we found

What

What is OpenStack?• OSS for creating private and public clouds

• Manages large pools of resources

• Compute, Storage, Networking

• User-friendly Web interface, REST APIs, CLI

• Cloud services: LBaaS, DNSaaS, FWaaS, DBaaS (Trove), BigData-aaS (Sahara), etc.

What is OpenStack?

What is OpenStack?

• Reuses lots of OSS projects:

• Linux, KVM, Open vSwitch, MySQL, MongoDB, Apache, memcache, Python, etc.

• Builds missing pieces as OSS:

• Keystone, Glance, Cinder, Ceilometer, Nova, Neutron, etc.

Releases• Releases every 6 months

• Development release (unreleased):

• Kilo

• Currently supported releases:

• Juno (stable, security fixes, 2014.2), Icehouse (security fixes only, 2014.1)

Who

Who makes OpenStack?

• Built by a thriving community of developers

• In collaboration with users

• Lots of contributors, like:

• AT&T, Canonical, Cisco, Dell, Ericsson, HP, IBM, Intel, RackSpace, Red Hat, NEC, NetApp, Novell, VMware, Yahoo!, etc.

Who uses OpenStack?• World-wide:

• Cisco ,CERN, PayPal, Wells Fargo, Wikimedia Foundation, SWITCH, Canonical, IBM, Intel, Mirantis, RackSpace, HP, Red Hat, SUSE, etc.

• Spain:

• Telefónica I+D, BTACTIC SCCL, Spanish National Research Council, BBVA

• Source: http://www.openstack.org/user-stories/

OpenStack in Telefónica’s dSN

Digital Service Node (dSN)

• Digital services from Telefónica deployed here

• Main datacenter in Madrid

• Most digital services from Telefónica have been migrated to the dSN during 2013

• New digital services developed within Telefónica I+D are deployed in the dSN

Why OpenStack• IaaS platform on top of existing infrastructure

• Previous bets on other technologies failed:

• Joyent, Cloud Stack, vCloud Director, Tashi, …

• Aligned with Telefónica’s Technological Plan:

• Open & Open Source

• Strong industry support

Why OpenStack• Well suited for DevOps:

• Automated, repeatable deployments, CI & CD

• OpenStack API allows developers, testers, integrators, deployment engineers, etc. to deploy software in a consistent manner

• Development, integration, pre-production and production

Why Red Hat OpenStack• Red Hat is a trusted partner and provider

• Red Hat Enterprise Linux is the reference OS within Telefónica’s Technological Plan for production services

• Strong commercial and technical support from Red Hat

• Helps when meeting our SLAs

Why Red Hat OpenStack• Maturity of OpenStack was in question:

• Telefónica trusted in Red Hat’s support, technical know-how and expertise in Linux, OpenStack and OSS

• Red Hat Professional Services were key to deploying OpenStack within Telefónica’s dSN:

• Strong engineering and technical background

• Key contributor to OpenStack

• Back porting of fixes from newer releases

Initial requirements• To have 2 separate OpenStack clusters in dSN:

• Pre-production: integration tests, load-tests, before digital services hit production

• Production: digital services in production

• Based on Red Hat OpenStack 4.0 (Havana)

• Meets reference architecture defined by Telefónica’s Cloud division

History• PoC in June 2012: OpenStack vs. CloudStack

• First serious deployment in May 2013 using RHOS 3.0 (Grizzly)

• First production deployment in December 2013 using RHOS 3.0 (Grizzly)

• Second production deployment in February 2014 using RHOS 4.0 (Havana)

How we did it

• A team of several engineers from Telefónica I+D

• Plus professional services from Red Hat

• Analysed the Telefónica I+D requirements

• Created a deployment plan

• And executed it

Problems we faced

• Architecture

• Manual deployment

• Migration from Quantum to Neutron

• Manual workarounds and patches

Architecture

dSN peculiarities• Multiple external networks (Internet, Management and

Internodos):

• Each external network requires a dedicated Neutron L3 agent

• Keystone integration with dSN OpenLDAP

• Several layers of firewalls

• Hardware load-balancers

• NetApp storage (exposes Cinder API and NFS)

Architecture

• Adapted and verified by Red Hat Professional Services to meet dSN peculiarities

• Reference HA architecture has changed since first deployment :

• Active-Active components not in PaceMaker

• Neutron is a SPoF and a bottleneck

High Availability

• Infrastructure is fully redundant:

• Firewalls, load-balancers, routers, etc.

• OpenStack API working in Active-Active mode

• MySQL, MongoDB, QPID, Neutron, Ceilometer, Heat deployed in Active-Passive mode and managed by PaceMaker

PaceMaker

• Initial HA deployment was suboptimal:

• Only Active-Passive services in PaceMaker

• Complex set of dependencies and constraints

• Non-critical services could prevent critical ones from failing over automatically (e.g. MongoDB)

PaceMaker’s failcount• We didn’t know about it:

• Multiple resource start up failures end up disabling it (failcount <- INFINITY)

• Learned about it the hard way:#  pcs  resource  failcount  show  mongoDB  Failcounts  for  mongoDB    10.26.238.227:  INFINITY  #  crm_failcount  -­‐r  mongoDB  -­‐G  scope=status    name=fail-­‐count-­‐mongoDB  value=INFINITY  #  crm_failcount  -­‐r  mongoDB  -­‐U  10.26.238.227  -­‐v  0  

Manual deployment

Deployment• Automated installation via Satellite & Foreman:

• Unattended network-based installs (PXE, Kickstart)

• Integrates ISC DHCP, TFTP, Cobbler, Puppet, Pulp, Candelpin, etc.

• HA reference architecture not supported

• Custom Puppet classes developed specifically for installing OpenStack in Telefónica I+D

Compute nodes

• Completely automated

• Custom Puppet for deploying compute nodes:

• Satellite installs base OS over the network

• Foreman finishes configuration with Puppet

Controller nodes

• Mostly manual process

• Complex set up:

• SSL/TLS

• Active-Passive services using PaceMaker

• Active-Active services using HW LBs

Controller nodes• Manual configuration:

• Configure DNS

• Generate SSL/TLS certificates

• Configure QPID SSL certificate store

• Configure OpenLDAP schema for Keystone

• Configure NFS

• Configure VLANs

• Configure FWs and LBs

• Satellite PXE-installs base OS

• PaceMaker configuration

• Keystone configuration

• Create basic tenants and roles

• Glance, Neutron, Nova, Cinder, Ceilometer, Heat, Horizon

Quantum to Neutron

Quantum to Neutron

• Quantum (Grizzly) to Neutron (Havana)

• Neutron configured in Active-Passive

• Multiple L3 agents

• Increased instability and complexity

Neutron HA• Failovers disrupt connectivity:

• Neutron nodes have different MAC addresses

• Unicast vs. Multicast (not used)

• Have to update ARP tables in hosts and switches

• Network namespaces not properly cleaned up

Kernel oopsen

• Kernel constantly logging oops messages due to Ethernet HW checksum problems

• Floods our logging facility

• Presumed incompatibility between kernel and our Cisco NIC drivers

• Solution consists of disabling HW CKSUM

Nov 2 03:06:41 esjc-ostt-cc05l kernel: qg-8b0190e0-cc: hw csum failure.Nov 2 03:06:41 esjc-ostt-cc05l kernel: Pid: 0, comm: swapper Not tainted 2.6.32-504.el6.x86_64 #1Nov 2 03:06:41 esjc-ostt-cc05l kernel: Call Trace:Nov 2 03:06:41 esjc-ostt-cc05l kernel: <IRQ> [<ffffffff8145cd32>] ? netdev_rx_csum_fault+0x42/0x50Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81454660>] ? __skb_checksum_complete_head+0x60/0x70Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81454681>] ? __skb_checksum_complete+0x11/0x20Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff814df23d>] ? nf_ip_checksum+0x5d/0x130Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa0365f91>] ? udp_error+0xb1/0x1e0 [nf_conntrack]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027b652>] ? ovs_vport_send+0x22/0x90 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa0360028>] ? nf_conntrack_in+0x138/0xa00 [nf_conntrack]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027b6ee>] ? ovs_vport_receive+0x2e/0x30 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa01ce721>] ? ipv4_conntrack_in+0x21/0x30 [nf_conntrack_ipv4]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496410>] ? ip_rcv_finish+0x0/0x440Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496410>] ? ip_rcv_finish+0x0/0x440Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496ab4>] ? ip_rcv+0x264/0x350Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027cb83>] ? ovs_netdev_frame_hook+0xb3/0x110 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8145cbca>] ? process_backlog+0x9a/0x100Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81462083>] ? net_rx_action+0x103/0x2f0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff810b034a>] ? tick_program_event+0x2a/0x30Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81533c0a>] ? smp_apic_timer_interrupt+0x4a/0x60Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20Nov 2 03:06:41 esjc-ostt-cc05l kernel: <EOI> [<ffffffff812ea5ee>] ? intel_idle+0xde/0x170Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff812ea5d1>] ? intel_idle+0xc1/0x170Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81425b97>] ? cpuidle_idle_call+0xa7/0x140Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8151061a>] ? rest_init+0x7a/0x80Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124

Manual workarounds and patches

Keystone vs. UTF-8

• Doesn’t cope well with UTF-8 out of the box

• Had to manually patch source code:

• Not yet integrated upstream

• Patch has to be reapplied after upgrades

/usr/lib/python2.6/site-­‐packages/keystone/__init__.py:  

 import  sys  +import  default_encoding_utf8    import  pkg_resources  

 #  If  there  is  a  conflicting  non  egg  module,    #  i.e.  an  older  standard  system  module  installed,    #  then  replace  it  with  this  requirement    def  replace_dist(requirement):        …

MongoDB rc.d• Script starts MongoDB

• Checks that it accepts queries by means of “mongostat”:

• But “mongostat” currently doesn’t work when authentication is mandatory

• Had to patch the rc.d script to comment out the invocation of “mongostat”

MongoDB rc.d• Init script returns OK even if service is not fully

operational:

• https://bugzilla.redhat.com/show_bug.cgi?id=1066408

• Latest errata from Red Hat introduces a syntax error: [ ] for comparison instead of [[ ]]:

• https://bugzilla.redhat.com/show_bug.cgi?id=1158076

Poor SSL/TLS support• At the time of our first deployments, SSL/TLS support was poor

or missing

• Manual patches and workarounds deployed:

• /usr/lib/python2.6/site-packages/ceilometer/alarm/service.py

• /usr/lib/python2.6/site-packages/ceilometer/service.py

• /usr/lib/python2.6/site-packages/ceilometer/image/glance.py

• /usr/lib/python2.6/site-packages/neutron/agent/metadata/agent.py

Poor SSL/TLS support• Update openstack-keystone as it fails to start if

SSL is configured

• Updated package python-django-openstack-auth to allow user authentication when keystone is using SSL

• Updated package python-eventlet to allow access to Glance via HTTPS for several services

Present

Some numbers

• 5 OpenStack private clouds:

• 2 production private clouds, interconnected over a backbone network plus 1 testbed environment

• 1 development private cloud plus 1 testbed environment

Some numbers• First support case was opened on 2013/02/05

• Since then, we have opened 149 support cases:

• 30 support cases in 2013

• 119 support cases in 2014

• Support case does not necessarily mean “bug”:

• Used for RFE, consulting, advice, etc.

Future

Improvements• Converge with Red Hat HA reference

architecture

• More coverage for our Puppet classes

• Less overlap between Satellite and Puppet

• Missing automated integration test suite:

• Rally, Tempest

Improvements

• Migrate to Icehouse or Juno

• Add a third production private cloud

Better Monitoring

• More visibility into network traffic (Neutron):

• Neutron API, Open vSwitch, netns, L3 agents, DHCP agents, metadata agent, etc.

• Better tools for tracing network packets

• Better instrumentation (latency, throughput)

How to test-drive OpenStack?

• Run it on your laptop or PC:

• DevStack, PackStack

• Run it on cluster of machines:

• CentOS / Fedora, deploy with Foreman

• Ubuntu, deploy with MaaS, Juju

Credits

The End

Q&A