Providing resilient and highly-available IT services at ... · Providing resilient and highly-available IT services at the LHCb Experiment. About Me Staff DevOps Engineer/Systems

Hristo Mohamed, IT Infrastructure & Operations Conference, 30 May Athens

Providing resilient and highly-available

IT services at the LHCb Experiment

About Me

Staff DevOps Engineer/Systems Administrator at CERN

•Passionate about OSS•Likes running latest thing in production•Gets easily hyped over hipster technology stacks•Likes talking at conferences•Also passionate about Roman history

About CERNCERN - European Organization for Nuclear Research (Conseil européen pour la recherche nucléaire)

• Worldwide Collaboration• 23 member states• 2 Associate Member States in the pre-stage to Membership• 5 Associate Member States• Observers: The European Union, Japan, JINR, the Russian Federation, UNESCO and the United States of America• ~ 2500 Staff Members• ~ 17 500 Collaborators

• Birthplace of the World Wide Web• CernVM File System

Image Wikipedia, Kohonet, CC BY 3.0

27 km circumference circular tunnel

9593 Helium Magnets

Main magnets colder than space -271.3°C11245 turns per second

1 billion

1 billion collisions per second

About LHCb

Over 1000 scientists

From over 100 universities

worldwide

1000 Sensors

Each outputting data every 25 ns

About LHCb Online

•Over 4000 Network Connected Device•Over 1800 physical machines•Over 500 virtual machines•Various and exotic custom build electronics•Traditional data center approach

•Over 8000 Network Connected Device•Over 4000 physical machines•Over 800 virtual machines(It is 2019, enough resource wasting)•Self-hosted Cloud Native Solutions•Data Center in Containers

The Glorious Past The Bright Future

About LHCb Online

•130 Racks, 5200U, 40mW Data Center•Running mostly on Open-source software•95% of our infrastructure runs on CERN CentOs 7•Windows for a few development cases and exotic electronics•FreeBSD NFS Fileservers•Puppet, Foreman, KVM, CEPH, Kubernetes, Drone.io, Prometheus, Grafana, Icinga2, ElasticSearch, Mcollective, Docker, Telegraf, InfluxDB

Image Wikipedia, Larry Ewing, Simon Budig, Garrett LeSage , CC0

•SA = System Administrator•HA = High Availability•DS = Distributed System•DNS = Domain NoResponse Service ( who noticed the joke?)•Operations Team = Your System Administrators, DevOps Engineers and Site Reliability Engineers

Before I start

Goals as IT professionals

•Availability•Reliability•Durability•Safety•Security•Maintainability

The dreams of everyone in Operations

•99.999999999999999999% Availability•Homogeneous system with 100% debugged and compatible kernels, drivers, firmware•Cheap continuous integration•Even cheaper continuous delivery•Zero human related outages•Easy distributed systems•Even easier load balancing•Ah yes and if possible all of the above for the price of the first car you ever bought

Errare humanum est, sed perseverare diabolicum

•Humans cause the most outages•Improving your human factor•Promote and encourage•blame free culture•Encourage and reward •postmortems•Document, Document, Document•Have engineers do rewarding work•Discourage repetitive tasks•Invest into training

Image Wikipedia, CC BY-SA 3.0

Embracing Heterogeneity

•Lower SA and Operational efforts•Streamlined purchasing & high discount, who doesn’t love that?•Until you start hitting bizarre kernel bugs, affecting ALL of your infrastructure, as an added bonus AT SIMILAR/SAME TIME.

Image wikipedia, Banej, CC BY-SA 3.0

Heterogeneity is cheaper long term

•For how long will prices be the same?•Start with heterogeneity in mind from the beginning and plan accordingly•Avoid the pernicious and costly single vendor lock-in•Operational staff is at ease with multiple hardware vendors•Procurement staff deals with multiple vendors•Configuration management code is ready from Day 1 to handle multiple cases•Figure out common metric to classify machines (2018 Hardware computes with MetricName=5, 2017 Hardware computes with MetricName=3) for your schedulers/loadbalancers•Just don’t forget about documenting all those hardware quirks :)

Running Open-source software

•Evaluate your cost accordingly•OSS can end up costing more in Engineering time•Choose software with appropriate license to avoid issues in the future•Wait for a community to develop/develop one yourself•Critically review if in-house development is justified (Don’t forget Engineer time spend on maintenance)•Try to adapt existing solutions first

Don’t be afraid of experimenting

•For our new Datacenter we decided against the traditional data center in favor of modular solution•Lowest PUE (when using free cooling)•Easy to reuse & can be resold•Lower or same risk as brick and mortar data center•Ready to use (providing foundation and power are already provided)

Let’s talk about high availability

Availability does not scale linearly.Neither does its cost.Nor does it complexity.

Availability

in %

Downtime

per year

Downtime per

month

Downtime per

week

Downtime per

day

Downtime per

hour

99% 3.65 days 7.2 hours 1.68 hours 14.4 minutes 3 minutes

99.9% 8.76 hours 43.2 minutes 10.1 minutes 1.44 minutes 1.44 minutes

99.99% 52.6

minutes

4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds

99.999% 5.26

minutes

25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds

Do you really need 99.9% ?

•Unrealistic HA cripples innovation•Those you rely on will probably not have 99.99% reliability•Unrealistic HA can cost more than it is worth (especially deep into the .9s)•Did I already mention HA costs a lot? (Engineers are not cheap)•Not even hospitals shoot for 99.9999% Availability on every service

Increasing HA - Some things are cheap(ish)

•Did I mention documenting ? Probably.•Know the quirks of your public cloud provider•Have a good dependency overview of your services

Or even worse...

Services Mesh

•Not only useful running (containerized) microservice loads•WILL help you identify dependencies (and if it is worthy to resolve and decouple them)•Might help you find resource dependencies•Will help new hires understand a complex system faster•Will help you define, enforce and monitor proper latency deadline propagation

Complex systems will fail

•And not in an expected way•And you have to be prepared for it•Allow systems to enter “lame duck state” before an actual overload•Learn to serve degraded results instead of complete failure•Instrument edge(all) services for load shedding•Capacity planning and performance testing to hand in hand•Often test a DS past its breaking point and beyondhttps://github.com/asatarin/testing-distributed-systems•Try Chaos engineering? ;-)https://github.com/Netflix/chaosmonkey

Working in degraded state

•Enter lame duck state early•It is always better to return a small/bad response than no response•Keep cache of computed frequent queries•Server degraded results•Plan for antagonistic neighbors•Capacity plan that an automatic restart might take more resources than runtime•During planned/unplanned down times, utilize CPU time as much as possible for secondary/unplanned tasks•Try Chaos engineering? ;-) https://github.com/Netflix/chaosmonkey

On deployments

•Use red/black deployments•Use deployment windows•Engineer confidence should be backed up by tests and formal verification•Procedural correct should be backed up by the same•Avoid interventions before off days (weekends, national holidays, etc)•Did I mention how we formatted half our hypervisors on a Friday not following this mantras?•Ensure any automatic pipe line can run unattended

Recovering from failure

•Write failure recovery procedures and train in them•Avoid and discourage Engineer Freelancing•Separate tasks (Don’t make engineer focused on debugging be bothered by your CEO/CTO/C.* on WHY? HOW? WHAT?)•Keep a history of outages and introduce (new) people to them (Write postmortem)•Discourage workarounds and encourage problem reporting

In conclusion

•Nurture engineering talent•Foster blame free culture and use postmortems extensively as learning tool•Don’t be afraid to try new technologies (Software, hardware and infrastructure wise)•Always plan for failure and degraded states•Go beyond normal performance testing•Have clear overview of your entire system (Clear enough that a non technical person can trace it)•Know your cloud provider and hardware quirks•Automatize deployments•Provide users with visibility during outages and separate tasks accordingly, most people suck at multitasking

Documents

Providing resilient and highly-available IT services at ... · Providing resilient and highly-available IT services at the LHCb Experiment. About Me Staff DevOps Engineer/Systems