30
TO 70 MILLION PLAYERS SCALING LoL CHAT Michal Ptaszek, @michalptaszek Riot Games

Scaling LoL Chat to 70M Players

Embed Size (px)

DESCRIPTION

Would you ever play an online game if you were not able to communicate with your teammates? Isn’t it fun if you can make new friends, arrange pre-made games and celebrate your victories with people you like to play with? Riot Games’ League of Legends handles millions of online players at any given time. Each chat server is responsible for routing over 1 billion real time events a day. In order to support the overwhelming user base and be prepared future growth, as well as pave the road for the upcoming features, chat infrastructure had to be designed and built with the utmost care, so that it would never fail the players. In this talk I would like to present how we achieved linear scalability, improved the overall fault tolerance, created a framework for real time code upgrades and got ready for the new features we want to ship. I will also discuss in detail why we chose to use Erlang as a foundation for the system, and why we migrated our data from MySQL to Riak.

Citation preview

Page 1: Scaling LoL Chat to 70M Players

TO 70 MILLION PLAYERSSCALING LoL CHAT

Michal Ptaszek, @michalptaszekRiot Games

Page 2: Scaling LoL Chat to 70M Players

WHAT’S PLANNED

1 2 3 4

GAME CHAT TECH LESSONS LEARNED

Q&A

5

Page 3: Scaling LoL Chat to 70M Players

WHAT IS LEAGUE OF LEGENDS?

2009LAUNCH

TEAMORIENTED

100+CHAMPS

MODERNFANTASY

Page 4: Scaling LoL Chat to 70M Players

MESSAGING SERVICEPrivate player chat and group chats.

PRESENCE SERVICEFriend lists, availability and status.

SOCIAL GRAPH SERVICEInternal service for store, match history, leagues.

WHAT IS IT?CHAT

Page 5: Scaling LoL Chat to 70M Players

WHAT IS IT?CHAT

Page 6: Scaling LoL Chat to 70M Players

CHAT BY THE NUMBERS

67 million monthly players

27 million daily

players

7.5 million concurrent

players

1 billion events

routed per server, per

day

Page 7: Scaling LoL Chat to 70M Players

CHAT AT 10K FEETSTABLE, SCALABLE CHAT SERVICE

PROTOCOL DATA STORESERVER

Page 8: Scaling LoL Chat to 70M Players

STABLE, SCALABLE CHAT SERVICE

DATA STORESERVERPROTOCOL

CHAT AT 10K FEET

Page 9: Scaling LoL Chat to 70M Players

PROTOCOL: XMPP

Decentralized Architecture

Openness

Extensibility

Availability of Client

Libraries

Security Wide Adoption

Page 10: Scaling LoL Chat to 70M Players

STABLE, SCALABLE CHAT SERVICE

CHAT AT 10K FEET

DATA STOREPROTOCOL SERVER

Page 11: Scaling LoL Chat to 70M Players

SERVER: EJABBERD

‣ Open source Jabber/XMPP server

‣ Relatively nice scalability and performance with default configuration

‣ Wide adoption and active, helpful community

‣ Very good as a starting point for our own server solution

▾ We were aware that one day we would need to start customizing it

‣ Written in Erlang programming language

Page 12: Scaling LoL Chat to 70M Players

Which gives us...

TECHNOLOGY: ERLANG/OTPErlang is...

A functional language

Built with concurrency and distribution in mind

Able to scale extremely well

Capable of reloading code on the fly

A declarative style of programming

An easier way to build our distributed applications

More time to focus on coding

Less downtime

Page 13: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - PHILOSOPHY

Share nothing approach; enables massive, near linear horizontal scalability. ARCHITECTURE

When something is massively broken - do not fix it!LET ITCRASH

Implementation of self-healing properties, which bring the system to a well-known, stable state.

FAULTTOLERANCE

Page 14: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - ARCHITECTURE

ETL Queries

External Traffic (5223)

Internal Traffic

SecondaryRiak Cluster

Server

Ejabberd

Server

Ejabberd LB

Riak Riak

Page 15: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 1 - MAKE IT WORK‣ Over time mostly rewritten

‣ Removed unwanted and unneeded parts

‣ Optimized certain flow paths

‣ Make it compatible with industry standards

‣ Wrote over 600 tests to cover it

Alice BobInvite

Alice BobAccept

Alice BobInvite

Alice BobAccept

Alice Bob

Page 16: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 1 - MAKE IT WORK‣ Over time mostly rewritten

‣ Removed unwanted and unneeded parts

‣ Optimized certain flow paths

‣ Make it compatible with industry standards

‣ Wrote over 600 tests to cover it

Alice BobInvite

Alice BobAccept

Alice Bob

Page 17: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks

‣ Avoid shared, mutable state

‣ “Make it work, make it right, make it fast”

MUCrouter

MUCroom

usersessionuser

sessionusersession

usersessionuser

sessionusersession

usersessionuser

sessionusersession

MUCroom

MUCroom

Page 18: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks

‣ Avoid shared, mutable state

‣ “Make it work, make it right, make it fast”

MUCroom

usersessionuser

sessionusersession

usersessionuser

sessionusersession

usersessionuser

sessionusersession

MUCroom

MUCroom

Page 19: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks

‣ Avoid shared, mutable state

‣ “Make it work, make it right, make it fast”

session table

Alice

Session Table: JID -> Session Handler

CharlieBob

Page 20: Scaling LoL Chat to 70M Players

SERVER: EJABBERD - IMPLEMENTATION

PHASE 3 - MAKE IT FAST‣ Patched VM and stdlibs

‣ Sacrificing generic nature of Erlang/OTP framework in favor of better scalability and fault tolerance

‣ Better traceability and profiling functions

‣ More visibility into the system

‣ Improved logging for code reloading and real time system upgrades

Page 21: Scaling LoL Chat to 70M Players

STABLE, SCALABLE CHAT SERVICE

CHAT AT 10K FEET

SERVERPROTOCOL DATA STORE

Page 22: Scaling LoL Chat to 70M Players

NOSQL

DATA STORE: RIAK

SCALE Linearly scalable

No growth headaches

FAULTTOLERANCE No SPoF Higher

uptime

SCHEMA-LESSFaster feature

iterations

More shipped features

‣ Distributed, fault-tolerant, key-value store

‣ Masterless, fully peer-to-peer architecture

‣ AP in CAP theorem, with eventual consistency

‣ Low, predictable latency

‣ Extreme scalability

‣ Multi data center replication

Page 23: Scaling LoL Chat to 70M Players

LESSONS LEARNEDUNDERSTAND YOUR SYSTEM

‣ Over 500 real-time counters, rates, histograms collected each minute

‣ Make sure to know counter values for “correct” and “abnormal” conditions

‣ Alerts and logs for long running operations

‣ Integration with Graphite, Zabbix and Nagios

Page 24: Scaling LoL Chat to 70M Players

IMPLEMENT FEATURE TOGGLES

LESSONS LEARNED

‣ Safety valve for things that might cause problems

‣ Partial deployments allowing features to be enabled only for certain groups of people

Alice Bob Charlie

group reordering feature

Bob

whitelist: Bob

Page 25: Scaling LoL Chat to 70M Players

SUPPORT CODE RELOADING

‣ Patching bugs on the fly

‣ Changing server configuration

‣ Collecting data for future analysis

‣ No downtime deploys

LESSONS LEARNED

buggycode

fixedcode

serverrestart

buggycode

fixedcode

Page 26: Scaling LoL Chat to 70M Players

GET YOUR LOGGING RIGHT

LESSONS LEARNED

‣ Proper logging and tracing facilities

‣ Debug modes for selected users

‣ Tools for analysis of the collected data

Alice

ejabberd.log slow_db.log

muc_audit.logroster_audit.log

trace_alice.log

Honu

Page 27: Scaling LoL Chat to 70M Players

‣ Automatic verification of the latest builds

‣ Collecting historical results for comparison

‣ Measuring the impact of new features and changes to the code

‣ Simulating various failures

ALWAYS LOAD TEST YOUR CODE

LESSONS LEARNED

Page 28: Scaling LoL Chat to 70M Players

THINGS WILL FAIL

LESSONS LEARNED

‣ Prepare for the worst

‣ It’s just a matter of time for crash to happen

‣ It’s not only our code that fails

‣ Unlikely events happen every second under given scale

Page 29: Scaling LoL Chat to 70M Players

CHAT IS DOING GREAT!The quality uptime is over 99% each month, and is increasing, with hundreds of servers deployed all over the world.

SCALE AND PERFORMANCEEach server offer reliable, low latency to the players, routing over 1B events a day with low resource utilization.

CHAT IS EVOLVINGRolling out Riak worldwide, making LoL Chat available outside of the client, explore possibilities around using social graph data, and more...

SITUATIONCURRENT

Page 30: Scaling LoL Chat to 70M Players

THANK YOU!ANY QUESTIONS?