26
Cloud Resilience Fault Injection for Increased Resilience Jorge Cardoso ([email protected]) Huawei European Research Center Riesstraße 25, 80992 München The Butterfly Effect Project OpenStack Munich - Cloud Resilience & Experiences with OpenStack Wednesday, April 13, 2016 6:30 PM

Cloud Resilience with Open Stack

Embed Size (px)

Citation preview

Page 1: Cloud Resilience with Open Stack

Cloud Resilience Fault Injection for Increased Resilience

Jorge Cardoso ([email protected]) Huawei European Research Center

Riesstraße 25, 80992 München

The Butterfly Effect Project

OpenStack Munich - Cloud Resilience & Experiences with OpenStack Wednesday, April 13, 2016 6:30 PM

Page 2: Cloud Resilience with Open Stack

1

FusionSphere from Huawei

#6

Page 3: Cloud Resilience with Open Stack

2

News from OpenStack

06 April 2016

Page 4: Cloud Resilience with Open Stack

3

FAILURES ARE INEVITABLE! THE BEST WE CAN DO IS BE PREPARED FOR THEM AND LEARN FROM THEM TEST, REPAIR, LEARN & PREDICT !

Page 5: Cloud Resilience with Open Stack

4

Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8%

*Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.

Page 6: Cloud Resilience with Open Stack

5

Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6%

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.

Page 7: Cloud Resilience with Open Stack

6

One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to try to ensure 100%

uptime

Why does using a cloud infrastructure requires advanced approaches for resiliency?

Page 8: Cloud Resilience with Open Stack

7

Technology Trends

GOOGLE TRENDS CLOUD AVAILABILITY

CLOUD FAILURE

Page 9: Cloud Resilience with Open Stack

8

Chaos Monkey Randomly terminates instances in a cluster

Chaos Gorilla Simulate an Availability Zone becoming unavailable

Chaos Kong Simulate an entire region outages

Latency Monkey Introduce latency to network packets to simulate

degradation of the EC2 network

Janitor Monkey Clean up unused resources

Security Monkey Analyze and notify

on security profile changes

Netflix: Chaos Monkey

AWS recently recommended firms using its infrastructure test their resilience by using Chaos Monkey to induce failures

Page 10: Cloud Resilience with Open Stack

9

Netflix: Chaos Monkey

Fewer alerts

for ops team

Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011

September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1

Transfer traffic to east region

Page 11: Cloud Resilience with Open Stack

10

A program designed to increase resilience by purposely injecting

major failures Discover flaws and subtle dependencies

Amazon AWS: GameDay

“That seems totally bizarre on the face of it, but as you dig down, you end up finding

some dependency no one knew about previously […] We’ve had situations where we

brought down a network in, say, São Paulo, only to find that in doing so we broke our

links in Mexico.”

Page 12: Cloud Resilience with Open Stack

11

Google DIRT (Disaster Recovery Test) Annual disaster recovery & testing exercise

8 years since inception

Multi-day exercise triggering (controlled) failures in systems and process

Premise 30-day incapacitation of headquarters following a disaster

Other offices and facilities may be affected

When “Big disaster”: Annually for 3-5 days

Continuous testing: Year-round

Who 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)

Business units (Human Resources, Finance, Safety, Crisis response etc.)

Google: DiRT

http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf

Page 13: Cloud Resilience with Open Stack

12

Goal

-- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud

Applications

CLOUD APPLICATION

HUAWEI FusionSphere

The system works by intentionally injecting different failures, test the ability to

survive them, and learn how to predict and repair failures preemptively

Failure

Repair

Test

Page 14: Cloud Resilience with Open Stack

13

Use Case: OpenStack Resiliency

Kill cinder database (Simulate update failure)

Introduce delay in messages (Full-scale traffic shows where the real bottlenecks are)

Operation Error OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST

Operation Error /etc/nova/nova.conf Delete: auth_strategy=keystone

Remove driver to HD Remove access to NFS (Simulate hardware failure)

Best way to avoid failure: Fail constantly

The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)

Page 15: Cloud Resilience with Open Stack

14

Use Case 1: Increasing Reliability

Public Cloud

Damage Pattern

Butterfly Effect

Fix configurations Fix bugs Replace hardware Upgrade memory

Fault Type

Page 16: Cloud Resilience with Open Stack

15

Use Case 2: Run Book Automation (RBA)

Public Cloud

Incident Management

Is this really an incident?

Major Incident Procedure

Butterfly Effect

Fault Type

Damage Pattern

Recovery Script

Page 17: Cloud Resilience with Open Stack

16

MONITORING Nagios Zabbix Cacti StackTach Synaps Monasca

CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat

FAULT-INJECTION ENGINES DestroyStack FSaaS ChaosMonkey AnarchyApe

FAULT LIBRARIES AND PLANS pyCallGraph Intellect RunDeck Nose

DATA VISUALIZATION Kibana Graylog2 Grafana

DAMAGE DETECTION Tempest Nose

DATA STORAGE ElasticSearch OpenTSDB Neo4J Graphite Cassandra Redis

DATA AGGREGATION Logstash Collectd Flume Fluentd Heka Ceilometer

MANUAL REPAIR Bash Python Chef Puppet

AUTOMATED REPAIR jCOLIBRI myCBR Puppet Rundeck (R)?ex Chef

DATA PROCESSING Hadoop Pig Hive Spark Storm

OPERATIONS ANALYTICS Statsd R Panda Weka Machine Leaning

ALERTING Errbit Honeybadger Nagios Zabbix OpenPager Riemann

DATA SOURCE Log files Collectd Plg FlumeNG OpenStack Tbls Zabbix Agt Nagios Plg

DATA TRANSPORT rsyslog ZeroMQ

Components of a Solution

CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat

1

2 3

4

7

5

6

Design & Deploy

Test Infrastructure

Monitoring Facilities

Design & Execute Fault-Injection Plan

Identify Damages

Predict Future Errors

Automatic Repair

Repair & Learn

Page 18: Cloud Resilience with Open Stack

17

Technological Overview (1) Design & Deploy Test Environment

Customizable, automated OpenStack deployment

FusionServer RH2288 + VirtualBox + Vagrant + RDO

(2) Design & Execute Fault-Injection Plan Language = Python (no DSL yet)

Fault Engine = based on BPM

Fault Plan = Workflow paradigm

(3) Monitoring Facilities Monasca (from HP, RackSpace, IBM)

Visualization with Grafana

(4) Damage Detection OpenStack Tempest

1200 tests (but only API testing :( )

(5) Repair & Learn …

(6) Predict Future Errors …

(7) Automated Repair …

1

2

3

4

7

5

6

Design & Deploy

Test Infrastructure

Monitoring Facilities

Design & Execute Fault-Injection Plan

Damage Detection

Predict Future Errors

Automatic Repair

Repair & Learn

Page 19: Cloud Resilience with Open Stack

18

Design & Deploy Test Environment Customizable, automated OpenStack deployment

FusionServer RH2288 + VirtualBox + Vagrant + RDO

Deploy Test Environment

2 hours to deploy OpenStack infrastructure with 32 VMs

Page 20: Cloud Resilience with Open Stack

19

Faults to Inject Disk temporarily unavailable

unmount a disk

wait for replicas to regenerate

remount the disk with the data intact

wait for replicas to regenerate the extra replicas from handoff nodes

should get removed

Disk replacement unmount a disk

wait for replicas regenerate

delete the disk and remount it

wait for replicas to regenerate

Extra replicas from handoff nodes should get removed

Expected failure damage three disks at the same time

more if the replica count is higher

check that the replicas didn’t regenerate even after some time period

fail if the replicas regenerated

this tests if the tests themselves are correct

VM failures send VM creation request

find compute node where request was scheduled

damage to the compute server

check if the VM creation was re-scheduled to another node

3

Inject Faults

Page 21: Cloud Resilience with Open Stack

20

Damage Detection

The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)

Network tests

• create keypairs

• create security

groups

• create networks

Compute tests

• create a keypair

• create a security

group

• boot a instance

Swift tests

• create a volume

• get the volume

• delete the volume

Identity tests

Cinder tests

Glance tests

echo "$ tempest init cloud-01"

echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"

echo "$ cd cloud-01"

echo "Next is the full test suite:"

echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario))'"

echo "Next ist the minimum basic test:"

echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest.scenario.test_minimum_basic)'"

Page 22: Cloud Resilience with Open Stack

21

Zabbix and ELK

Page 23: Cloud Resilience with Open Stack

22

Monasca Overview: Uses the Keystone OpenStack Identity Service for authentication,

authorization and multi-tenancy. Monasca integrates with several other

OpenStack services such as Heat for auto-scaling and Ceilometer for

monitoring OpenStack resources.

Apache Kafka: A high-throughput distributed messaging system. Kafka is a

central component in Monasca and provides the infranstructure for all internal

communications between components.

Apache Storm: A free and open source distributed realtime computation

system. Apache Storm is used in the Monasca Threshold Engine.

InfluxDB: An open-source distributed time series database with no external

dependencies. InfluxDB is one of the supported databases for storing metrics

and alarm history.

MySQL: MySQL is one of the supported databases for the Monasca Config

Database.

Grafana: An open source, feature rich metrics dashboard and graph editor.

Support for Monasca as a data source in Grafana has been added.

Anomaly Detection: Engine implements real-time streaming anomaly detection.

Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and

Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime

streaming.

Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K

metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load

balancing; for more performance use HP Vertica database.

See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-Paris-Summit.pdf

Grafana (compute_instance_create_time)

Anomaly Detection (cpu.user_perc)

Page 24: Cloud Resilience with Open Stack

23

Application Domains

Page 25: Cloud Resilience with Open Stack

24

Join the Cause!

Internship positions for MSc students

Fault injection, fault models, fault libraries, fault plans,

brake and rebuild systems all day long, …

OpenStack Engineers positions

Rapid prototyping of cool ideas: propose it today,

code it, and show it running in 3 months…

Innovative PoCs

Solving difficult challenges of real problems using

quick and dirty prototyping

Page 26: Cloud Resilience with Open Stack

Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.

The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product

portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive

statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time

without notice.

HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY