Scaling Social Games

Preview:

DESCRIPTION

Short talk given at the Berlin hadoop get together on the 27th of january 2011

Citation preview

Scaling social games“the order of magnitude

challenge”

Paolo Negri @hungryblank

Order of magnitude

DAU:

daily active users

0

250000

500000

750000

1000000

July December

DAU

Flash client (game) HTTP API

Social Games

http://www.flickr.com/photos/stars6/4381851322

Flash client

Social Games

• Game actions need to be persisted and validated

• 1 API call every few secs

HTTP API

Social Games

http://www.flickr.com/photos/stars6/4381851322

• 5000 HTTP reqs/sec

• more than 90% writes

• 60K queries/sec

July 2010

HAproxy

Ruby on Rails

MySQL

• ~ 170 000 daily users

• Plain Ruby on Rails app

• Persistency 100% SQL

July 2010

HAproxy

Ruby on Rails

MySQL

• 1 haproxy server

• multiple RoR servers

• 4 mysql servers (sharded dataset)

July 2010

HAproxy

Ruby on Rails

MySQLSlow down

July 2010

HAproxy

Ruby on Rails

MySQLSlow down

High queries/requestratio

Queries/request

• Which code is triggering extra queries?

• Why in our test environment the ratio is lower than live?

Queries/request

Application Ruby on RailsPlugins

Running code of live system

Queries/request

Plugins

Source of extra queries

• sharding plugin “breaks” std Rails query cache

• Flash wire protocol plugin generates extra queries

Plugins

• Deceiving “feature for free”

• Might provide the right feature

• But might not meet scaling need

Plugins

• Instant code legacy, for new projects also!

• Once added it’s your code

• Even if it’s maintained, will it follow your needs?

Plugins

• Assess code quality when you add it

• Can you afford to maintain/change it?

Plugins

• We fixed it!

• Query cut up to 40% on some requests

Early August

• The MySQL hiccup

• every 70 mins query time spikes x7

0

7.5

15

22.5

30

6:00 6:10 6:20 6:30 6:40 6:50 7:00 7:10 7:20 7:30 7:40 7:50 8:00 8:10

query time in ms

Hiccup causes

• Code (app + plugins + Rails)?

• Some periodic job?

• The devil (AWS)?

Who is periodically blocking MySQL

Hiccup quick fix

• We shard out the top queried table(40% of all queries)

shard 2 shard 4shard 1 shard 3

MySQL servers

Hiccup quick fix

• We shard out the top queried table(40% of all queries)

Top tableshard 2

Top tableshard 4

Top tableshard 1

Top tableshard 3

Other tablesshard 2

Other tablesshard 4

Other tablesshard 1

Other tablesshard 3

Hiccup quick fix• Mysql likes it

• “top table” shards will go a long way in the scaling process

Top tableshard 2

Top tableshard 4

Top tableshard 1

Top tableshard 3

Other tablesshard 2

Other tablesshard 4

Other tablesshard 1

Other tablesshard 3

Hiccup causes

• Code (app + plugins + Rails)?

• Some periodic job?

• The devil (AWS)?

Who is periodically blocking MySQL

None of the Above

Hiccup real cause

• Emerging MySQL internal at high volume

• MySQL flushes its buffer

• Under heavy write IO it’s blocking

Hiccup solution

• Percona MySQL patches (XtraDB) avoid blocking behavior

• Query time profile gets smooth

• IO capacity limit manifested with gradual performance decay

Write through cache

• Memcache in front of MySQL

• Evaluated before sharding

• Was discarded

• Because of our read/write reatio

Write through cache

90% of the times we read datain order to modify it

Write through cache

It means 90% of the times

1. read cache

2. write cache

3. write SQL

Write through cache

• memcache perfs

Read heavy

• Mysql write (unless async)

• Write through lib optimized for writes?

Write heavy

Bound to

MySQL

• Sharding SQL is a painful way to scale

• Data migrations at high load imply downtime

• ACID benefits all lost because of sharding or in name of performance

Redis

• A persistent cache

• Fast 60000 qps on AWS hardware

• Interesting data structures, not only KV

• Already some small scale experince in house

Redis adoption

• Which data to start from?

• How do we migrate without downtime?

• Which Ruby object - Redis structure lib?

Redis adoption

• Which data to start from?

• Best data fit for Redis hashes

• Top 3rd queried table

• a collection of integer fields that need only increment / decrement

Redis adoption

• How do we migrate without downtime?

• Migrate one user at a time

• Use a Redis set to keep note of migrated/non migrated

• No downtime, transparent to users

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

read original data

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

write migrated data

Redis adoption

• How do we migrate without downtime?

• Migration might never complete

• SQL + Redis set information to generate final batch migration

Redis 1st result

10% query load from 4 MySQL server

is moved to 1 Redis server

Redis server load is 0.05

Redis

• Becomes the tool to use

• Migration plan for all write intensive data

• Migrate one “class” at a time

Redis honeymoon end

• Memory usage grows more than data

• Snapshot to disk causes spikes in query time

• Starting new slaves eats memory on the master node

Redis honeymoon end

• Redis machine sized with overabundant RAM

• Rigorous slave/master starting plan

Russian Roulette Feeling

Redis

• Redis team acknowledges persistency/replication problems

• Redis 2.4 diskstore plan starts

1.000.000

And counting...

1.000.000

HAproxy

Ruby on Rails

Persistency

painless scaling

1.000.000

HAproxy

Ruby on Rails

Peristency

just add serversas load grows

1.000.000

HAproxy

Ruby on Rails

PeristencyPainful and

troublesome

Infrastructure

• AWS

• Chef - through Scalarium

• Ganglia

Thanks...

woogaIs looking for

Business Intelligence Engineer

http://wooga.com/jobs