40
Survive in Cloud The Zen of High Availability at Massive Scale in Cloud [email protected]

Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Survive in Cloud The Zen of High Availability at Massive Scale in Cloud

[email protected]

Page 2: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis
Page 3: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Mobvista

No.1

950M320M

200+

TOP 10

Mintegral SDK DAU

China

Count r ies /Reg ions

wor ld-w ide

DMP’s DAU

60B Daily Ads request

Page 4: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

All in CloudPublisher

RDS

Offer management Online DMP

Kinesis

EMR

Redshift*

Big Data & ML

S3

CloudWatchES

Metrics & Alarm

SD

KA

PI

Manual

KinesisS3

Lambda function

DynamoDB

Tracking Service

instances

Spot Fleet

Auto Scaling

ElastiCache

SQS

Volume Processing Service

instances

Spot Fleet

Auto Scaling

RTB

Advertiser

Page 5: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Cloud Computing

Quick Scaling

Low Cost

High Reliable

On-Demand

Rapid elasticity

Pay per use

Uncertain downtime

Cloud Characteristics Service Goals

High Available

Page 6: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Fault Oriented

Page 7: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Once you accept that failures will happen, you have the ability to design your system’s reaction to specific failures.

Page 8: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Isolated Design

Micro Kernelpl

ug-in

plug

-in

plug

-in

plug

-in

plug

-in

plug

-in

plug-in

plug-in

plug-in

plug-in

plug-in

plug-in

Extension PointExtension Point Extension Point

Extension PointExtension Point Extension Point

Page 9: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Isolated DeploymentOrdering Service Cart Service

Checkout Service

Payment Service

Fulfillment Service

Page 10: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Reused vs. Isolated

Reused logic structure vs. Isolated physical structure

Critical Data Collector Log Data Collector

Data Transform Service

Data Transform Service Data Transform Service

Critical Data Collector Log Data Collector

Page 11: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Redundancy

Page 12: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Redundancy

Online Service Standby Service

Load Balancer Load Balancer

Online Redundancy

Page 13: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Common Failure Modes

Page 14: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Propagated Failure

Load Balancer

QPS 1500

Max QPS 1000

Page 15: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Rate Limit

Page 16: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Cascading Failure

ServiceD

ServiceE

ServiceB

Service

ServiceA

ServiceC

Client

Page 17: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Circuit Breaker

Page 18: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Circuit Breaker

ServiceD

ServiceE

ServiceB

Service

ServiceA

ServiceC

Client

Fallback

Page 19: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Slow ResponseA quick rejection is better than a slow response.

Pooled resources are exhausted!

Page 20: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

No Unlimited Waiting

Any blocking operation needs a time limit!

Page 21: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Recovery Oriented

Page 22: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

“A priori prediction of all failure modes is not possible.”

Page 23: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Health Check

• Zombie Process

• Pooled resources exhausted

• Dead Lock

Page 24: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Recoverable

• Say “NO” to Monolithic system

• Stateless

• Survive when the dependent services crashing

• Quick restart

Page 25: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Let it Crash!

try{

… }catch (Throwable t){ }

Page 26: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Negotiate With Client

Server: “I am busy, please, slow down”

Client: “Get back to me, after one minute.”

Page 27: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Chaos Engineering

Page 28: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

“If something hurts, do it more often!”

Page 29: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Chaos under control

Chaos Engineering

• You learn how to fix the things that often break.

• You don’t learn how to fix the things that rarely break.

Terminate host

Inject latency

Inject failure

Page 30: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Chaos Engineering

Set expected SLA

Inject Failures

Measure services

meet SLA?

E

S

Improve system

Page 31: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Chaos Engineering Principles

• Build a Hypothesis around Steady State Behavior

• Vary Real-world Events

• Run Experiments in Production

• Automate Experiments to Run Continuously

• Minimize Blast Radius

http://principlesofchaos.org

Page 32: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Higher Resilience, Lower Cost

Page 33: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Cost

Scale

Page 34: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Spot Instance

Page 35: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

m i c r o s e r v i c e

s t a t e l e s s

q u i c k r e s t a r t

f a u l t t o l e r a n c e

c h a o s e n g i n e e r i n gReserved Instance

Spot FleetAuto Scaling

Fault and Recovery Oriented Architecture

Spot Instance

Page 36: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Multi-Clouds Ecosystem

Page 37: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Multi-Clouds Foundation

Cloud Connection

Mobvista Cloud Solution

Mobvista Cloud Platform

Spot Instance Mgr Logging Monitoring

CI/CD Pipeline Auto Scaling

High Reliability

AWS API Ali APIAWS CLI Ali CLI

Alarm

Cost Optimization

Smart Load Balance

DevOps

Public Cloud PlatformAli CloudAWS Cloud

Mobvista AI PlatformBig Data Platform Machine Learning Platform

Page 38: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis

Service Decorator

https://github.com/easierway/service_decorators/blob/master/README.md

Page 39: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis
Page 40: Survive in Cloud · Daily Ads request. All in Cloud Publisher RDS Offer management Online DMP Kinesis EMR Redshift* Big Data & ML S3 ES CloudWatch Metrics & Alarm K API l S3 Kinesis