Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Survive in Cloud The Zen of High Availability at Massive Scale in Cloud
Mobvista
No.1
950M320M
200+
TOP 10
Mintegral SDK DAU
China
Count r ies /Reg ions
wor ld-w ide
DMP’s DAU
60B Daily Ads request
All in CloudPublisher
RDS
Offer management Online DMP
Kinesis
EMR
Redshift*
Big Data & ML
S3
CloudWatchES
Metrics & Alarm
SD
KA
PI
Manual
KinesisS3
Lambda function
DynamoDB
Tracking Service
instances
Spot Fleet
Auto Scaling
ElastiCache
SQS
Volume Processing Service
instances
Spot Fleet
Auto Scaling
RTB
Advertiser
Cloud Computing
Quick Scaling
Low Cost
High Reliable
On-Demand
Rapid elasticity
Pay per use
Uncertain downtime
Cloud Characteristics Service Goals
High Available
Fault Oriented
Once you accept that failures will happen, you have the ability to design your system’s reaction to specific failures.
Isolated Design
Micro Kernelpl
ug-in
plug
-in
plug
-in
plug
-in
plug
-in
plug
-in
plug-in
plug-in
plug-in
plug-in
plug-in
plug-in
Extension PointExtension Point Extension Point
Extension PointExtension Point Extension Point
Isolated DeploymentOrdering Service Cart Service
Checkout Service
Payment Service
Fulfillment Service
Reused vs. Isolated
Reused logic structure vs. Isolated physical structure
Critical Data Collector Log Data Collector
Data Transform Service
Data Transform Service Data Transform Service
Critical Data Collector Log Data Collector
Redundancy
Redundancy
Online Service Standby Service
Load Balancer Load Balancer
Online Redundancy
Common Failure Modes
Propagated Failure
Load Balancer
QPS 1500
Max QPS 1000
Rate Limit
Cascading Failure
ServiceD
ServiceE
ServiceB
Service
ServiceA
ServiceC
Client
Circuit Breaker
Circuit Breaker
ServiceD
ServiceE
ServiceB
Service
ServiceA
ServiceC
Client
Fallback
Slow ResponseA quick rejection is better than a slow response.
Pooled resources are exhausted!
No Unlimited Waiting
Any blocking operation needs a time limit!
Recovery Oriented
“A priori prediction of all failure modes is not possible.”
Health Check
• Zombie Process
• Pooled resources exhausted
• Dead Lock
Recoverable
• Say “NO” to Monolithic system
• Stateless
• Survive when the dependent services crashing
• Quick restart
Let it Crash!
try{
… }catch (Throwable t){ }
Negotiate With Client
Server: “I am busy, please, slow down”
Client: “Get back to me, after one minute.”
Chaos Engineering
“If something hurts, do it more often!”
Chaos under control
Chaos Engineering
• You learn how to fix the things that often break.
• You don’t learn how to fix the things that rarely break.
Terminate host
Inject latency
Inject failure
Chaos Engineering
Set expected SLA
Inject Failures
Measure services
meet SLA?
E
S
Improve system
Chaos Engineering Principles
• Build a Hypothesis around Steady State Behavior
• Vary Real-world Events
• Run Experiments in Production
• Automate Experiments to Run Continuously
• Minimize Blast Radius
http://principlesofchaos.org
Higher Resilience, Lower Cost
Cost
Scale
Spot Instance
m i c r o s e r v i c e
s t a t e l e s s
q u i c k r e s t a r t
f a u l t t o l e r a n c e
c h a o s e n g i n e e r i n gReserved Instance
Spot FleetAuto Scaling
Fault and Recovery Oriented Architecture
Spot Instance
Multi-Clouds Ecosystem
Multi-Clouds Foundation
Cloud Connection
Mobvista Cloud Solution
Mobvista Cloud Platform
Spot Instance Mgr Logging Monitoring
CI/CD Pipeline Auto Scaling
High Reliability
AWS API Ali APIAWS CLI Ali CLI
Alarm
Cost Optimization
Smart Load Balance
DevOps
Public Cloud PlatformAli CloudAWS Cloud
Mobvista AI PlatformBig Data Platform Machine Learning Platform
Service Decorator
https://github.com/easierway/service_decorators/blob/master/README.md