Upload
datastax
View
178
Download
1
Embed Size (px)
Citation preview
Capacity Forecast @ ScaleCDE, Cloud Database EngineeringNetflix.
●CDE, Cloud Database Engineering ●Providing data stores as a service
○Cassandra,○ Dynomite, ○ Elasticsearch and RDS
Ajay Upadhyay Cloud Data Architect @ Netflix
Arun AgrawalSr. Software Engineer @
Netflix
Who are we?
●Cassandra @ Netflix●Cassandra footprint ●Capacity planning lifecycle
●Forecasting the capacity
●Q and A
Agenda
• 98% of streaming data is stored in Cassandra
• Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment
Cassandra @ Netflix
Cassandra Footprint
Hundreds C*
Cassandra Footprint
Thousands
Capacity Planning
•Able to predict
– Current usage and available capacity
– Resources needing upgrade– Life cycle of current configuration– Appropriate configuration for new
and existing App/Service
•Optimize – Under or over utilized resource– Increased business productivity
Capacity Planning
Avoid:
• Impact on Business • No service or SLA
disruption• Un-planned
maintenance• Firefighting
Life Cycle
Capture Requirement
RequirementAnalysis/
feasibility
Proxy or Simulate
Requirement
Monitoring /
Trending
New / Increased
traffic Optimization
Capture Requirement
– IOPs and SLA– Maintenance overhead– Failover – Access pattern
IOPs and SLAQuestions Response
Read OPS/sec [avg, peak] 5k - 10kRead Latency requirement 95th - 20ms
99th - 100ms Write OPS/sec [avg, peak] 1k - 2kWrite Latency requirement 95th - 20ms
99th - 100msNum Columns / Row 100
Avg col size / or avg row size 64kNum of rows 100 Mil
TTL [life Cycle of data] 365 Days
Data storeC*
Gutenberg publisher service
Gutenberg publisher serviceReadWrite
Maintenance Overhead
Repairs / Compactions Y/N
Node replacement Y
Backup - Full / Incrementals
Y/N
TypeRespons
e
Failover
Region Failover Y/N
SLA in case of region failover
Y/N
Questions Response
Access Pattern
Questions ResponseRead Point read
All row readersColumn slices
Write Part existing rowNew rows
Proxy/Simulate Traffic
– Proxy existing traffic – Simulate traffic
–NDBench– Generate actual /
synthetic traffic before final deployment using app
Optimization
• Cache - Application level- Fronting cache engine before C*
- Stagger R - W operations if possible
Cluster Sharding
Trend AnalysisContinuous monitoring / trending on usage pattern
New / Increased TrafficCapacity planning cycle begins
Capture
Requirement
RequirementAnalysis/
feasibility
Proxy or Simulate
Requirement
Monitoring /
Trending
New / Increased
traffic
Optimization
Capacity Forecasting
Arun AgrawalSr. Software Engineer
Demo
Metrics
Atlas
Previous Architecture
Pain Points
•No support for complex relationships
•Hardware failure could fail leading to false positives
Winston• Bridge between atlas and oncall• Complex relationship modeling
between metrics• Reduce false positives• Auto remediation platform
Lesson Learnt•It might be already too late to fix the system.
•Reactive than proactive
Requirements• Show us trend for the clusters. • Warn us of what is coming if
trend continues.• Give us time to scale their
cluster
Automic (UC4)
Architecture
Aggregation• Daily • Instance Level• Cluster Level
•Instance Failures•Adding capacity over days
Growth Criteriaf(x) of – Subscriber – Netflix content– # Viewing Sessions
ARIMA– AR
•Regression on prior values–I•Data values are replaced with (x(i) - x(i-1))
–MA•Linear combination of error terms
Future•Vector Auto Regression
•Automate manual judgement
Resources– https://www.otexts.org/fpp/8
Q & A
You may not control all the events that happen to you, but you CAN decide not to be reduced by them.
-Maya Angelou