Upload
michele-orsi
View
134
Download
0
Embed Size (px)
Citation preview
Started with a monolith ...
https://www.flickr.com/photos/southtopia/5702790189
https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/
... broken into microservices
Micro-problems at scale
● alignment
● real pipelines
● infrastructure
● resilience
● monitoring
● constraints
An year-long endeavour
● build a new, modern infrastructure
● migrate the search (flight/hotel) product there
... without:
● impacting the business● throwing away our whole datacenter
How we did that: technology
● company framework
● docker
● kubernetes
How? Teams and peopleHow we did that: team/people
https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/
APP3-PRODUCTION
Kubernetes: our architecture
APP2-PRODUCTIONAPP1-PRODUCTION
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-PREVIEW
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-DEVELOPMENT
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-QA
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-STRESSTEST
nonproductionproduction
Kubernetes: our architecture
APP1-PRODUCTION
deployment
replica-set
POD3
POD2
POD1
production
Kubernetes: our architecture
APP1-PRODUCTION
deployment
replica-set
secret configmap
POD3
POD2
POD1
production
Kubernetes: our architecture
APP1-PRODUCTION
deployment
replica-set
(ingress)path: app1-production.prd.lmn.intra
secret configmap
POD3
POD2
POD1
production
Kubernetes: our architecture
nginx-ingress-ctrl: 80
cluster
F5POD
10.0.0.2
POD10.0.0.1
nginx-ingress-ctrl: 80
nginx-ingress-ctrl: 80
POD10.0.0.3POD
10.0.0.4
POD10.0.0.5
POD10.0.0.6
APP1-PRODUCTION
Kubernetes: our architecture
POD
collectd
production
application fluentd
/liveness:
● when tomcat container is up● when “active/max” threads < threshold
/readiness:
● all the startup jobs have run● no termination request has been received
.. ongoing never-ending research ..
Self-healing: our choice for resilience
Kubernetes: what’s left outside?
● datastores
● distributed caches (early 2017)
● distributed locking
● pub-sub/queues
● logs and metrics storage
● zero downtime during rollout
● monitoring in place
● alerting
● centralized logging
● legacy infrastructure to the rescue in case of problem
When can you test with production traffic?
... failure ... at all different levels ..
https://www.flickr.com/photos/ghost_of_kuji/2763674926
Main problems
● configuration
● infrastructure
● tools
● manual mistakes
● (external) scalability
There’s light .. at the end
https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/
Pipeline: a huge step forward
microservice = factory.newDeployRequest().withArtifact(“com.lastminute.application1”,2)
lmn_deployCanaryStrategy(microservice,”qa”)
lmn_deployStableStrategy(microservice,”preview”)
lmn_deployCanaryStrategy(microservice,”production”)
pipeline
APP1-PRODUCTION
POD
Monitoring: grafana/graphite/nagios
cluster
graphiteapplication collectd
Grafana
nagios
icons from http://www.flaticon.com
● lead and migration time
● resilience
● root cause analysis
● speed of deployment
● instant scaling
... benefits
● 36 bare-metal nodes (only for production cluster)● 5100 req/sec in the new cluster● 2M metrics/minute flows● 35 micro-services migrated in 5 months
○ 3 new micro-services migrated per week○ 10 minutes to create a new environment
● 11 min to roll-out a new version with 55 instances○ whole pipeline runs in 16 min
Give me the numbers!
Yes, we’re hiring!
THANKS
www.lastminutegroup.com