Kubernetes to scale

Kubernetes to Scale

[email protected] @micheleorsi

GDG Cloud - London, 11 January 2017

Started with a monolith ...

https://www.flickr.com/photos/southtopia/5702790189

https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/

... broken into microservices

Micro-problems at scale

● alignment

● real pipelines

● infrastructure

● resilience

● monitoring

● constraints

An year-long endeavour

● build a new, modern infrastructure

● migrate the search (flight/hotel) product there

... without:

● impacting the business● throwing away our whole datacenter

How we did that: technology

● company framework

● docker

● kubernetes

How? Teams and peopleHow we did that: team/people

https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/

APP3-PRODUCTION

Kubernetes: our architecture

APP2-PRODUCTIONAPP1-PRODUCTION


APP1-PREVIEW


APP1-DEVELOPMENT


APP1-QA


APP1-STRESSTEST

nonproductionproduction


APP1-PRODUCTION

deployment

replica-set

POD3

POD2

POD1

production


APP1-PRODUCTION

deployment

replica-set

secret configmap

POD3

POD2

POD1

production


APP1-PRODUCTION

deployment

replica-set

(ingress)path: app1-production.prd.lmn.intra

secret configmap

POD3

POD2

POD1

production


nginx-ingress-ctrl: 80

cluster

F5POD

10.0.0.2

POD10.0.0.1



POD10.0.0.3POD

10.0.0.4

POD10.0.0.5

POD10.0.0.6

APP1-PRODUCTION


POD

collectd

production

application fluentd

/liveness:

● when tomcat container is up● when “active/max” threads < threshold

/readiness:

● all the startup jobs have run● no termination request has been received

.. ongoing never-ending research ..

Self-healing: our choice for resilience

Kubernetes: what’s left outside?

● datastores

● distributed caches (early 2017)

● distributed locking

● pub-sub/queues

● logs and metrics storage

● zero downtime during rollout

● monitoring in place

● alerting

● centralized logging

● legacy infrastructure to the rescue in case of problem

When can you test with production traffic?

... failure ... at all different levels ..

https://www.flickr.com/photos/ghost_of_kuji/2763674926

Main problems

● configuration

● infrastructure

● tools

● manual mistakes

● (external) scalability

There’s light .. at the end

https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/

Pipeline: a huge step forward

microservice = factory.newDeployRequest().withArtifact(“com.lastminute.application1”,2)

lmn_deployCanaryStrategy(microservice,”qa”)

lmn_deployStableStrategy(microservice,”preview”)

lmn_deployCanaryStrategy(microservice,”production”)

pipeline

APP1-PRODUCTION

POD

Monitoring: grafana/graphite/nagios

cluster

graphiteapplication collectd

Grafana

nagios

icons from http://www.flaticon.com

http://www.flaticon.com

● lead and migration time

● resilience

● root cause analysis

● speed of deployment

● instant scaling

... benefits

● 36 bare-metal nodes (only for production cluster)● 5100 req/sec in the new cluster● 2M metrics/minute flows● 35 micro-services migrated in 5 months

○ 3 new micro-services migrated per week○ 10 minutes to create a new environment

● 11 min to roll-out a new version with 55 instances○ whole pipeline runs in 16 min

Give me the numbers!

Yes, we’re hiring!

THANKS

www.lastminutegroup.com

Technology

Kubernetes to scale