Upload
planet-cassandra
View
547
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Presenters: Alexander Filipchick and Staff Software Engineer, Staff Software Engineers at Sony Network Entertainment Since the launch of the PlayStation 4, many of the PSN features have been delivered using Cassandra. We will be talking about our experience as we launched one of the most popular gaming consoles in the world on well over 300 nodes. - Why we picked Cassandra - Exactly what PSN features for PS4 are powered by Cassandra - The infrastructure used to deploy our clusters - How we monitor system heath - How we design, test and deploy - Issues we faced and lessons learned along the way
Citation preview
Launching PS4 with Cassandra
Introduction • Alexander Filipchik – Staff Software Engineer at SNEI
• Dustin Pham – Staff Software Engineer at SNEI
Agenda • Journey towards Cassandra • Cassandra-backed PS4 Features • Ops-y Stuff • Lessons learned
Journey towards Cassandra
Challenges • Small Team • Legacy Support • Hardware Deadline • Scaling @ Peak Time
Why Cassandra • Strong community • Horizontally scalable architecture • Good performance • Cost effective • New adventure J
6
PS4 Features backed by Cassandra
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • +more
Cassandra-backed PS4 features
• What’s New • Video Library • My Library • PS Now • Notifications • LiveArea • Store catalog • Pre-order • PS Plus • Recommendations • Remote Download • Share • Authentication • + more
Ops-y Stuff
Infrastructure • Hosted in cloud and physical DCs • Several hundred nodes and growing • Cluster by feature • Vnodes and Assigned token clusters • Astyanax Client
Stats for PS4 cloud nodes • Data throughput: Gigabytes / sec • Cassandra read/writes: > 200,000 / sec • Data size: tens of terabytes • 10M PS4 and 80M PS3 sold
24
Clusters • Cluster per Read/Write pattern initially • Now use cluster per feature • Seeds referenced by DNS names • Size Tiered compaction • Manual compactions for some CFs
25
A typical node • m2.4xl + i2.2xl • 2 ephemeral disks (~ 2 x 800 GB) • Commit log on root partition • Topology managed in the topology file
managed by chef
26
AWS • Nodes are
interleaved between AZs – Replication factor
spreads data across AZ’s
– Minimizes downtime due to AZ outage
Availability Zone A Availability Zone C
Eph1
Disk Layout
Pre-Launch Launch Current
ü 2 Ephemerals in a RAID 0 ü Higher throughput (io
spreads into 2 devices for reading & writing)
ü If you lose 1 device, you loose the array !
ü 2 Ephemerals in a RAID 1 ü Higher throughput for
reading (io spreads into 2 devices), but not for writing
ü If you lose 1 device, the array continues up in degraded mode.
ü ½ the available space
ü 2 individual Ephemerals ü Higher throughput (io
spreads into 2 devices for reading & writing)
ü You lose 1 device, Cassandra stops (configurable)
ü No RAID overhead
Eph0
AWS m2.4xl
RAID 0
Eph1
Eph0
AWS m2.4xl
RAID 1
Eph1
Eph0
AWS m2.4xl
Cluster Resizing
Thrift Payload Size
thri%_framed_transport_size_in_mb thri%_max_message_length_in_mb
Bouncing Nodes phi_convict_threshold
Inter-DC Latency
Monitor system health • Nagios • Kibana/Elasticsearch • Graphite • AWS Cloudwatch • App level monitoring • Opscenter
App level metrics
Lessons Learned
Fun with Astyanax Client • Cross DC Latencies – Several second latencies in JP and EE data
centers – Astyanax configs to ensure local datacenters
used • Imbalanced node traffic – Hashing algorithm (MD5 vs Murmur3)
• DNS Caching in the JVM – Stale seed nodes
A tale of 2 Nodes
Cluster lessons • A single bad node can raise app
latencies significantly • Taking out an entire cassandra cluster is
easy (not so fun) – Compressing data before sending to
cassandra helps a lot. • Corrupted SStable resulted in
cascading failure
• Monitoring – Memtable flush frequency – Hinted handoffs – Garbage collection – Compactions – Histograms
• VPNs are a dangerous bottle neck
• Easier to rebuild a node than to fix
• Backup data – Replication factor helps
but does not account for data corruption
• Denormalization costs • Disk is cheap but EC2s are
not • TTL on almost everything • Adjust gc_grace_period
based off TTL times • Transactions ? Be creative • Load test with real data
• Replication strategy: – Read / Write pattern – Data is source of truth or not – Data locality – User Level data vs App level
data • Cluster wide commands
should be staggered – Global repair L
Tokens • Vnodes vs Assigned Tokens – Increased chattiness on gossip protocol
with vnodes – Perceived slowness on repair and cleanup
operations on vnodes enabled cluster – Astyanax client does not like vnodes…
Compactions • Compactions are your worst enemy – larger disk usage = high cpu & longer
compactions • Leveled compaction vs sized compaction – Start up time – Cpu tradeoff – IO tradeoff
• Updates + Removals eat up disks
We are hiring… sonyentertainmentnetwork.com/careers