Upload
michele-orsi
View
243
Download
0
Embed Size (px)
Citation preview
Kubernetes and lastminute.com group: our course towards better scalability and processes
[email protected]@micheleorsi
Rome, 24-25 March 2017
A tech company to the core
Tech department: 300+ people
Applications: ~100
Database: 4 TB data
Servers: 1400 VMs, 300 physical machines
Locations: Chiasso, Milan, Madrid, London, Bengaluru
3
https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/
"... let’s break into microservices"
A lot of issues
● LONG provisioning time
● LACK OF alignment across environments
● LACK OF alignment across applications
● LACK OF awareness about ops (monitoring, alerting)
7
An year-long endeavour
● build a new, modern infrastructure
● migrate the search (flight/hotel) product there
... without:
● impacting the business● throwing away our whole datacenter
8
Our plan
● same architecture across environments
● a common framework to align software
● centralized monitoring/logging, with alerts
● zero downtime deployment
● automation everywhere
9
How? Teams and peopleNew teams
https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/
Our infrastructure and technologyOur infrastructure and technology
https://www.pexels.com/photo/colorful-toothed-wheels-171198/
Docker containers
registry.intra/application:v2-090025032017
BASE OS
JAVA JRE
START/STOP SCRIPTS
JAR APPLICATION
● build once, run everywhere
● externalised configuration
12
Kubernetes
● independent from OS/hosts
● isolated env, managed at scale
● self-healing
● externalised configuration
Omega paper: http://research.google.com/pubs/pub41684.html
13
Kubernetes: physical representation
NODE1
cluster
NODE2
NODE70
...
K8S
DOCKER
FLAN
NE
LD
ET
CD
Ubuntu
K8S
DOCKER
FLAN
NE
LD
ET
CD
Ubuntu
K8S
DOCKER
FLAN
NE
LD
ET
CD
Ubuntu
15
Kubernetes: logical representation
NAMESPACE1CPU 10
MEM 40GB
cluster
NAMESPACE2CPU 20
MEM 80GB
NAMESPACE3CPU 80
MEM 90GB NAMESPACE4CPU 100
MEM 10GB
16
APP3-PRODUCTION
Kubernetes: our architecture
APP2-PRODUCTIONAPP1-PRODUCTION
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-PREVIEW
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-DEVELOPMENT
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-QA
nonproductionproduction
17
APP1-PRODUCTION
Kubernetes: our architecture and choices
POD
collectd
production
applicationfluentdcarbon
18
APP1-PRODUCTION
POD
Monitoring and alerting: grafana + graphite
cluster
graphiteapplication
Grafana 4
icons from http://www.flaticon.com
collectd
carbon
19
Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
app1.lastminute.intra
secret configmap
POD3
POD2
POD1
production
20
Kubernetes: what’s left outside?
● datastores
○ DBs
○ logs
○ metrics
● distributed caches
● distributed locking
● pub-sub
21
Self-healing
ref: https://technologyconversations.com/2016/01/26/self-healing-systems
application
I am fine, thanks
Hey, how are you?
Hey, how are you?
I have problems
23
Kubernetes contract
"When a container is dead I will restart it"
"When a container is ready I will forward traffic to it"
Kubernetes probes: liveness & readiness
Two questions:
● when can I consider my container alive?
● when can I consider my container ready to receive traffic?
spec: containers: livenessProbe: httpGet: path: /liveness
readinessProbe: httpGet: path: /readiness
deployment.yaml
/liveness:
● when tomcat container is up● when ratio active/max threads < threshold
/readiness:
● all the startup jobs have run
.. ongoing never-ending research ..
Our choices: framework - k8s
26
● zero downtime during rollout
● resilience improved
● legacy infrastructure to the rescue in case of problem
2nd try (with production traffic)
27
● temporary team focus on objective
● automation
● Go deeper in docker/kubernetes
Another improvement step
30
Pipeline: a huge step forward
microservice = factory.newDeployRequest().withArtifact("com.lastminute.application1",2)
lmn_deployCanaryStrategy(microservice,"qa") lmn_deployCanaryStrategy(microservice,"preview")lmn_deployCanaryStrategy(microservice,"production")
pipeline
31
Pipeline: a huge step forward
● git push○ continuous integration○ continuous delivery
pulljar
builddocker
(gate)
QAcanary
(gate)
QAstable
(gate)
PREVcanary
(gate)
PREVstable
(gate)
PRODcanary
(gate)
PRODstable
32
nginx ingress controller problem
NGINX
NGINX
NGINX
LB
10.0.0.1
10.0.0.2
10.0.0.3
10.0.0.4
10.0.0.5
10.0.0.6
NGINX
NGINX
NGINX
NGINX
NGINX
34
There’s light .. There’s a light .. at the end
https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/
● lead and migration time
● resilience
● root cause analysis
● speed of deployment
● instant and easy scaling
... benefits
36
● 70 physical nodes, 1300 pods, 5200 containers● 20k req/sec in the new cluster● 35 micro-services migrated in 6 months● 10 minutes to create a new environment ● whole pipeline runs in 16 minutes
○ 4 minutes to release 100 instances of a new version● 2M metrics/minute flows
Give me the numbers!
37