111

1,000,000 daily users and no cache (Splash 2011)

  • Upload
    wooga

  • View
    42.908

  • Download
    0

Embed Size (px)

DESCRIPTION

Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective. Starting from a rather classic Ruby on Rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes. Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.

Citation preview

Page 1: 1,000,000 daily users and no cache (Splash 2011)
Page 2: 1,000,000 daily users and no cache (Splash 2011)

Who  is  that  guy?

Jesper  Richter-­‐Reichhelm

Twi1er:  @jrirei

Head  of  Engineering

wooga  

Berlin,  Germany

Page 3: 1,000,000 daily users and no cache (Splash 2011)

wooga  is  #3  game  developer  on  Facebook

Page 4: 1,000,000 daily users and no cache (Splash 2011)

Wooga  has  dedicated  game  teams

Coomingsoon

Page 5: 1,000,000 daily users and no cache (Splash 2011)
Page 6: 1,000,000 daily users and no cache (Splash 2011)

Flash  client  sends  state  changes  to  backend

Flash  client Ruby  backend

Page 7: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

400  million  PIs  /  month

Page 8: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

400  million  PIs  /  month

Page 9: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

14  billion  requests  /  month

Page 10: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

14  billion  requests  /  month

Page 11: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

14  billion  requests  /  month

100,000  DB  operaKons  /  second

Page 12: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

14  billion  requests  /  month

50,000  DB  updates  /  second

Page 13: 1,000,000 daily users and no cache (Splash 2011)

Social  games  need  to  scale  quite  a  bit

14  billion  requests  /  month

50,000  DB  updates  /  second

no  cache

Page 14: 1,000,000 daily users and no cache (Splash 2011)

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paradise

Conclusion

Page 15: 1,000,000 daily users and no cache (Splash 2011)

October  2009:  wooga’s  first  simulaKon  game

Page 16: 1,000,000 daily users and no cache (Splash 2011)

Instead  of  PHP  we  used  Ruby

Page 17: 1,000,000 daily users and no cache (Splash 2011)

Our  database  was  MySQL

Page 18: 1,000,000 daily users and no cache (Splash 2011)

Our  database  was  MySQL

even  user  ids odd  user  ids

Page 19: 1,000,000 daily users and no cache (Splash 2011)

And  we  went  into  the  cloud

Page 20: 1,000,000 daily users and no cache (Splash 2011)

Master-­‐slave  replicaKon  for  DBs  worked  fine

app app app

lb

db db

Page 21: 1,000,000 daily users and no cache (Splash 2011)

We  added  a  few  applicaKon  servers  over  Kme

app app app app app app app app app

lb

db db

Page 22: 1,000,000 daily users and no cache (Splash 2011)

250K  daily  users  and  no  problems

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Life  was  good

Page 23: 1,000,000 daily users and no cache (Splash 2011)

Life  was  well  and  I  went  on  a  nice  vacaKon

<picture:  Jesper  in  clot  canyon>

TO  DO

Page 24: 1,000,000 daily users and no cache (Splash 2011)
Page 25: 1,000,000 daily users and no cache (Splash 2011)

Our  bane:  MySQL  hiccups

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

!# )# (!# ()# $!# $)# *!# *)# %!#

Page 26: 1,000,000 daily users and no cache (Splash 2011)

Our  bane:  MySQL  hiccups

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

!# )# (!# ()# $!# $)# *!# *)# %!#

Page 27: 1,000,000 daily users and no cache (Splash 2011)

Our  bane:  MySQL  hiccups

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

!# )# (!# ()# $!# $)# *!# *)# %!#

Page 28: 1,000,000 daily users and no cache (Splash 2011)

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paradise

Conclusion

Page 29: 1,000,000 daily users and no cache (Splash 2011)

SQL  queries  generated  by  Rubyamf  gem

AMF  responses  to  Flash  client

Page 30: 1,000,000 daily users and no cache (Splash 2011)

SQL  queries  generated  by  Rubyamf  gem

AMF  responses  to  Flash  client

Wrong  config...

...  so  associated  data  was  included,  too

Page 31: 1,000,000 daily users and no cache (Splash 2011)

SQL  queries  generated  by  Rubyamf  gem

AMF  responses  to  Flash  client

Wrong  config...

...  so  associated  data  was  included,  too

=>  Easy  to  fix

Page 32: 1,000,000 daily users and no cache (Splash 2011)

More  traffic  using  the  same  cluster

app app app app app app app app app

lb

db db

Page 33: 1,000,000 daily users and no cache (Splash 2011)

Config  tweaks  brought  us  to  300K  DAU

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Config  fixes

Page 34: 1,000,000 daily users and no cache (Splash 2011)

AcKveRecord’s  checks  caused  20%  extra  DB  

Checking  connecKon  state

MySQL  process  list  full  of  ‘status’  calls

Page 35: 1,000,000 daily users and no cache (Splash 2011)

AcKveRecord’s  checks  caused  20%  extra  DB  

Checking  connecKon  state

MySQL  process  list  full  of  ‘status’  calls

=>  Fixed  by  1  line  of  code

Page 36: 1,000,000 daily users and no cache (Splash 2011)

I/O  on  MySQL  masters  sKll  was  the  bo^leneck

New  Relic:  60%  of  all  UPDATEs  on  ‘Kles’  table

Page 37: 1,000,000 daily users and no cache (Splash 2011)

Tiles  are  part  of  the  core  game  loop

Core  game  loop1)  plant2)  wait3)  harvest

Page 38: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards

old  master

old  slave

Page 39: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards1)  Setup  new  masters  as  slaves  of  old  ones

old  master

old  slave

new  master

Page 40: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards1)  Setup  new  masters

old  master

old  slave

new  master

new  slave

Page 41: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards1)  Setup  new  masters2)  Start  using  new  masters

old  master

old  slave

new  master

new  slave

Page 42: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards1)  Setup  new  masters2)  Start  using  new  masters3)  Cut  replica<on

old  master

old  slave

new  master

new  slave

Page 43: 1,000,000 daily users and no cache (Splash 2011)

We  started  to  shard  on  model,  too

Adding  new  shards1)  Setup  new  masters2)  Start  using  new  masters3)  Cut  replica<on4)  Truncate

old  master

old  slave

new  master

new  slave

Page 44: 1,000,000 daily users and no cache (Splash 2011)

4  DB  masters  and  a  few  more  servers

app app

app app app app app app app app

app appapp

lb

<lesdb

<lesdb

db db

app app app

Page 45: 1,000,000 daily users and no cache (Splash 2011)

Sharding  by  model  brought  us  to  400K  DAU

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Shard  by  model

Page 46: 1,000,000 daily users and no cache (Splash 2011)

We  improved  our  MySQL  setup

RAID-­‐0  of  EBS  volumes

Page 47: 1,000,000 daily users and no cache (Splash 2011)

We  improved  our  MySQL  setup

RAID-­‐0  of  EBS  volumes

Using  XtraDB

Page 48: 1,000,000 daily users and no cache (Splash 2011)

We  improved  our  MySQL  setup

RAID-­‐0  of  EBS  volumes

Using  XtraDB

Tweaking  my.cnf

Page 49: 1,000,000 daily users and no cache (Splash 2011)

Sharding  gem  circumvented  AR’s  internal  cache

AcKveRecord  caches  SQL  queries...

Page 50: 1,000,000 daily users and no cache (Splash 2011)

Sharding  gem  circumvented  AR’s  internal  cache

AcKveRecord  caches  SQL  queries...

...  only  in  our  development  environment!

Page 51: 1,000,000 daily users and no cache (Splash 2011)

Sharding  gem  circumvented  AR’s  internal  cache

AcKveRecord  caches  SQL  queries...

...  only  in  our  development  environment!

=>  Fixed  by  2  lines  of  code

Page 52: 1,000,000 daily users and no cache (Splash 2011)

I/O  sKll  was  not  fast  enough

If  2  +  2  is  not  enough,  ...

Page 53: 1,000,000 daily users and no cache (Splash 2011)

I/O  sKll  was  not  fast  enough

If  2  +  2  is  not  enough,  ...

…  perhaps  4  +  4  masters  will  do?

Page 54: 1,000,000 daily users and no cache (Splash 2011)

It’s  no  fun  to  handle  8+8  MySQL  DBs

app app app app appapp app

app app app app app app app app app

appapp

lb

<lesdb

<lesdb

db db

Page 55: 1,000,000 daily users and no cache (Splash 2011)

It’s  no  fun  to  handle  8+8  MySQL  DBs

app app app app appapp app

app app app app app app app app app

appapp

lb

<lesdb

<lesdb

<lesdb

<lesdb

db db db db

Page 56: 1,000,000 daily users and no cache (Splash 2011)

At  500K  DAU  we  were  at  a  dead  end

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Page 57: 1,000,000 daily users and no cache (Splash 2011)

At  500K  DAU  we  were  at  a  dead  end

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Page 58: 1,000,000 daily users and no cache (Splash 2011)

I/O  remained  bo^leneck  for  MySQL  UPDATEs

Each  DB  master  could  do

about  1000  DB  write/s.

Page 59: 1,000,000 daily users and no cache (Splash 2011)

I/O  remained  bo^leneck  for  MySQL  UPDATEs

Each  DB  master  could  do

about  1000  DB  write/s.

That’s  not  enough!

Page 60: 1,000,000 daily users and no cache (Splash 2011)

Pick  the  right  tool  for  the  job!

Page 61: 1,000,000 daily users and no cache (Splash 2011)

Redis  is  fast  but  goes  beyond  simple  key/value

Redis  is  a  key-­‐value  storeHashes,  Sets,  Sorted  Sets,  ListsAtomic  opera<ons  like  set,  get,  increment

Page 62: 1,000,000 daily users and no cache (Splash 2011)

Redis  is  fast  but  goes  beyond  simple  key/value

Redis  is  a  key-­‐value  storeHashes,  Sets,  Sorted  Sets,  ListsAtomic  opera<ons  like  set,  get,  increment

50,000  transacKons/s  on  EC2Writes  are  as  fast  as  reads

Page 63: 1,000,000 daily users and no cache (Splash 2011)

Wooga  has  dedicated  game  teams

Page 64: 1,000,000 daily users and no cache (Splash 2011)

Shelf  Kles  :  An  ideal  candidate  for  using  

Shelf  Kles:{  plant1  =>  184,plant2  =>  141,plant3  =>  130,plant4  =>  112,

…  }

Page 65: 1,000,000 daily users and no cache (Splash 2011)

Shelf  Kles  :  An  ideal  candidate  for  using  Redis  

Redis  HashHGETALLHGETHSETHINCRBY…

Page 66: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  when  accessing  new  model

Page 67: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  but  only  once

true  if  id  could  be  addedelse  false

Page 68: 1,000,000 daily users and no cache (Splash 2011)

Typical  migraKon  throughput  over  3  days

Page 69: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  migraKon  run  unKl  everything  cools  down

Page 70: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  migraKon  run  unKl  everything  cools  down

2.Migrate  the  rest  manually

Page 71: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  migraKon  run  unKl  everything  cools  down

2.Migrate  the  rest  manually

3. Remove  migraKon  code

Page 72: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  migraKon  run  unKl  everything  cools  down

2.Migrate  the  rest  manually

3. Remove  migraKon  code

4.Wait  unKl  no  fallback  necessary

Page 73: 1,000,000 daily users and no cache (Splash 2011)

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  migraKon  run  unKl  everything  cools  down

2.Migrate  the  rest  manually

3. Remove  migraKon  code

4.Wait  unKl  no  fallback  necessary

5. Remove  SQL  table

Page 74: 1,000,000 daily users and no cache (Splash 2011)

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paredise  (or  not?)

Conclusion

Page 75: 1,000,000 daily users and no cache (Splash 2011)

Again:  Tiles  are  part  of  the  core  game  loop

Core  game  loop1)  plant2)  wait3)  harvest

Page 76: 1,000,000 daily users and no cache (Splash 2011)

Size  ma^ers  for  migraKons

MigraKon  check  overloadMigra<on  only  on  startup

Page 77: 1,000,000 daily users and no cache (Splash 2011)

Size  ma^ers  for  migraKons

MigraKon  check  overloadMigra<on  only  on  startup

Overlooked  an  edge  caseOnly  migrate  1%  of  usersCon<nue  if  everything  is  ok

Page 78: 1,000,000 daily users and no cache (Splash 2011)

In-­‐memory  DBs  don’t  like  to  dump  to  disk

Dumping  to  diskSAVE  is  blockingBGSAVE  needs  free  RAM

Page 79: 1,000,000 daily users and no cache (Splash 2011)

In-­‐memory  DBs  don’t  like  to  dump  to  disk

Dumping  to  diskSAVE  is  blockingBGSAVE  needs  free  RAM

Latency  increase  by  100%

Page 80: 1,000,000 daily users and no cache (Splash 2011)

In-­‐memory  DBs  don’t  like  to  dump  to  disk

Dumping  to  diskSAVE  is  blockingBGSAVE  needs  free  RAM

Latency  increase  by  100%

=>  BGSAVE  on  slaves  every  15  minutes

Page 81: 1,000,000 daily users and no cache (Splash 2011)

Redis  replicaKon  starts  with  a  BGSAVE

BGSAVE  on  master

Slave  imports  dumped  file

Page 82: 1,000,000 daily users and no cache (Splash 2011)

Redis  replicaKon  starts  with  a  BGSAVE

BGSAVE  on  master

Slave  imports  dumped  file

=>  No  RAM  means  no  new  slaves

Page 83: 1,000,000 daily users and no cache (Splash 2011)

Redis  had  a  memory  fragmenKon  problem

24  GB

44  GB

in  8  days

Page 84: 1,000,000 daily users and no cache (Splash 2011)

Redis  had  a  memory  fragmenKon  problem

24  GB

38  GB

in  3  days

Page 85: 1,000,000 daily users and no cache (Splash 2011)

If  MySQL  is  a  truck

Fast  enough

Disk  based

Robust

Fast  enough                    disk  based                    robust

Page 86: 1,000,000 daily users and no cache (Splash 2011)

If  MySQL  is  a  truck,  Redis  is  a  race  car

Super  fast

RAM  based

Fragile

Super  fast                    RAM  based                    fragile

Page 87: 1,000,000 daily users and no cache (Splash 2011)

Big  and  staKc  data  in  MySQL,  rest  goes  to  Redis

60  GB  data

50%  writes

256  GB  data

10%  writeshSp://www.flickr.com/photos/erix/245657047/

Page 88: 1,000,000 daily users and no cache (Splash 2011)

Lots  of  boxes,  but  automaKon  helps  a  lot!

app app app app app app app app app app app appapp

app app app app app app app app app app app appapp

app app app app app app app app app app app appapp

lb lb

redis redis redis redis redisdb db db db db

Page 89: 1,000,000 daily users and no cache (Splash 2011)

We  reached  1  million  daily  users!

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

1,000,000  -­‐  Big  party!

Page 90: 1,000,000 daily users and no cache (Splash 2011)

We  started  archiving  inacKve  users

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

50%  DBreducKon

Page 91: 1,000,000 daily users and no cache (Splash 2011)

We  even  survived  a  complete  data  center  loss

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

EBS  nomore!

Page 92: 1,000,000 daily users and no cache (Splash 2011)

We  improved  our  MySQL  schema  on-­‐the-­‐fly

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

30%  DBreducKon

Page 93: 1,000,000 daily users and no cache (Splash 2011)

Will  we  reach  2  million  daily  users?

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

Page 94: 1,000,000 daily users and no cache (Splash 2011)

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paredise  (or  not?)

Conclusion

Page 95: 1,000,000 daily users and no cache (Splash 2011)

You  do  not  know  the  future

Plan  ahead

Page 96: 1,000,000 daily users and no cache (Splash 2011)

You  do  not  know  the  future

Plan  ahead

Learn

Page 97: 1,000,000 daily users and no cache (Splash 2011)

You  do  not  know  the  future

Plan  ahead

Learn

Adapt

Page 98: 1,000,000 daily users and no cache (Splash 2011)

of  sonware

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

EvoluKon  every  week

EVOLUTION

Page 99: 1,000,000 daily users and no cache (Splash 2011)

of  sonware

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

EvoluKon  every  week

EVOLUTION

Page 100: 1,000,000 daily users and no cache (Splash 2011)

EvoluKon  every  week

of  sonware

EVOLUTION

Page 101: 1,000,000 daily users and no cache (Splash 2011)

EvoluKon  every  week,  RevoluKon  if  necessary

of  sonware

REVOLUTION

Page 102: 1,000,000 daily users and no cache (Splash 2011)

EVOLUTION

EvoluKon  every  week,  RevoluKon  if  necessary

of  sonware

REVOLUTION

Page 103: 1,000,000 daily users and no cache (Splash 2011)

EvoluKon  every  week,  RevoluKon  if  necessary

of  sonware

!"

#!!$!!!"

%$!!!$!!!"

%$#!!$!!!"

&$!!!$!!!"

'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"

REVOLUTION

Page 104: 1,000,000 daily users and no cache (Splash 2011)

Each  new  game  is  a  revoluKon

Page 105: 1,000,000 daily users and no cache (Splash 2011)

Each  new  game  is  a  revoluKon

Page 106: 1,000,000 daily users and no cache (Splash 2011)

Each  new  game  is  a  revoluKon

Page 107: 1,000,000 daily users and no cache (Splash 2011)

Each  new  game  is  a  revoluKon

Page 108: 1,000,000 daily users and no cache (Splash 2011)

Each  new  game  is  a  revoluKon

Coomingsoon

Page 109: 1,000,000 daily users and no cache (Splash 2011)

Works  for  teams  ...

Page 110: 1,000,000 daily users and no cache (Splash 2011)

Works  for  teams  and  for  companies

!""#$%&"'()"*+,

Page 111: 1,000,000 daily users and no cache (Splash 2011)

Thank  you!

Jesper  Richter-­‐Reichhelm@jrirei

slideshare.net/woogawooga.com/jobs