SlideShare a Scribd company logo
1 of 30
Download to read offline
SCALING LoL CHAT 
TO 70 MILLION PLAYERS 
Michal Ptaszek, @michalptaszek 
Riot Games
WHAT’S PLANNED 
1 2 3 4 
GAME CHAT TECH LESSONS 
LEARNED 
5 
Q&A
WHAT IS LEAGUE OF LEGENDS? 
2009 
LAUNCH 
TEAM 
ORIENTED 
100+ 
CHAMPS 
MODERN 
FANTASY
MESSAGING SERVICE 
Private player chat and group chats. 
PRESENCE SERVICE 
Friend lists, availability and status. 
SOCIAL GRAPH SERVICE 
Internal service for store, match history, leagues. 
CHAT 
WHAT IS IT?
CHAT 
WHAT IS IT?
CHAT BY THE NUMBERS 
67 million 
monthly 
players 
27 million 
daily 
players 
7.5 million 
concurrent 
players 
1 billion 
events 
routed per 
server, per 
day
CHAT AT 10K FEET 
STABLE, SCALABLE CHAT SERVICE 
PROTOCOL DATA 
SERVER STORE
CHAT AT 10K FEET 
STABLE, SCALABLE CHAT SERVICE 
DATA 
PROTOCOL SERVER STORE
PROTOCOL: XMPP 
Decentralized 
Architecture 
Openness 
Extensibility 
Availability of 
Client 
Libraries 
Security Wide 
Adoption
CHAT AT 10K FEET 
STABLE, SCALABLE CHAT SERVICE 
DATA 
PROTOCOL SERVER STORE
SERVER: EJABBERD 
‣ Open source Jabber/XMPP server 
‣ Relatively nice scalability and performance with default configuration 
‣ Wide adoption and active, helpful community 
‣ Very good as a starting point for our own server solution 
▾ We were aware that one day we would need to start customizing it 
‣ Written in Erlang programming language
TECHNOLOGY: ERLANG/OTP 
Erlang is... 
Which gives us... 
A functional language 
Built with concurrency and 
distribution in mind 
Able to scale extremely well 
Capable of reloading code on the fly 
A declarative style of programming 
An easier way to build our 
distributed applications 
More time to focus on coding 
Less downtime
SERVER: EJABBERD - PHILOSOPHY 
Share nothing approach; enables massive, near linear 
horizontal scalability. ARCHITECTURE 
Implementation of self-healing properties, which bring the 
system to a well-known, stable state. 
FAULT 
TOLERANCE 
When something is massively broken - do not fix it! LET IT 
CRASH
SERVER: EJABBERD - ARCHITECTURE 
ETL Queries 
Secondary 
Riak Cluster 
External Traffic (5223) 
Internal Traffic 
Riak Riak 
Ejabberd 
Server 
Ejabberd LB 
Server
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 1 - MAKE IT WORK 
‣ Over time mostly rewritten 
‣ Removed unwanted and unneeded 
parts 
‣ Optimized certain flow paths 
‣ Make it compatible with industry 
standards 
‣ Wrote over 600 tests to cover it 
Invite 
Alice Bob 
Accept 
Alice Bob 
Invite 
Alice Bob 
Accept 
Alice Bob 
Alice Bob
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 1 - MAKE IT WORK 
‣ Over time mostly rewritten 
‣ Removed unwanted and unneeded 
parts 
‣ Optimized certain flow paths 
‣ Make it compatible with industry 
standards 
‣ Wrote over 600 tests to cover it 
Invite 
Alice Bob 
Accept 
Alice Bob 
Alice Bob
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 2: MAKE IT RIGHT 
‣ Removed clear bottlenecks 
‣ Avoid shared, mutable state 
‣ “Make it work, make it right, make it 
fast” 
MUC 
router 
user 
sesussioenr 
sesussioenr 
session 
MUC 
room 
user 
sesussioenr 
sesussioenr 
session 
user 
sesussioenr 
sesussioenr 
session 
MUC 
room 
MUC 
room
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 2: MAKE IT RIGHT 
‣ Removed clear bottlenecks 
‣ Avoid shared, mutable state 
‣ “Make it work, make it right, make it 
fast” 
user 
sesussioenr 
sesussioenr 
session 
MUC 
room 
user 
sesussioenr 
sesussioenr 
session 
user 
sesussioenr 
sesussioenr 
session 
MUC 
room 
MUC 
room
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 2: MAKE IT RIGHT 
‣ Removed clear bottlenecks 
‣ Avoid shared, mutable state 
‣ “Make it work, make it right, make it 
fast” 
Session Table: 
JID -> Session Handler 
session table 
Alice 
Bob Charlie
SERVER: EJABBERD - IMPLEMENTATION 
PHASE 3 - MAKE IT FAST 
‣ Patched VM and stdlibs 
‣ Sacrificing generic nature of 
Erlang/OTP framework in favor of 
better scalability and fault tolerance 
‣ Better traceability and profiling 
functions 
‣ More visibility into the system 
‣ Improved logging for code reloading 
and real time system upgrades
CHAT AT 10K FEET 
STABLE, SCALABLE CHAT SERVICE 
PROTOCOL SERVER DATA 
STORE
NOSQL 
DATA STORE: RIAK 
SCALE Linearly 
scalable 
No growth 
headaches 
FAULT 
Higher 
TOLERANCE No SPoF uptime 
SCHEMA-LESS 
Faster 
feature 
iterations 
More 
shipped 
features 
‣ Distributed, fault-tolerant, 
key-value store 
‣ Masterless, fully peer-to-peer 
architecture 
‣ AP in CAP theorem, with 
eventual consistency 
‣ Low, predictable latency 
‣ Extreme scalability 
‣ Multi data center 
replication
LESSONS LEARNED 
UNDERSTAND YOUR SYSTEM 
‣ Over 500 real-time 
counters, rates, histograms 
collected each minute 
‣ Make sure to know counter 
values for “correct” and 
“abnormal” conditions 
‣ Alerts and logs for long 
running operations 
‣ Integration with Graphite, 
Zabbix and Nagios
IMPLEMENT FEATURE TOGGLES 
LESSONS LEARNED 
‣ Safety valve for 
things that might 
cause problems 
‣ Partial deployments 
allowing features to 
be enabled only for 
certain groups of 
people 
Alice Bob Charlie 
group reordering 
feature 
whitelist: Bob 
Bob
SUPPORT CODE RELOADING 
‣ Patching bugs on the 
fly 
‣ Changing server 
configuration 
‣ Collecting data for 
future analysis 
‣ No downtime 
deploys 
LESSONS LEARNED 
buggy 
code 
fixed 
code 
server 
restart 
buggy 
code 
fixed 
code
GET YOUR LOGGING RIGHT 
LESSONS LEARNED 
‣ Proper logging and 
tracing facilities 
‣ Debug modes for 
selected users 
‣ Tools for analysis of 
the collected data 
Alice 
ejabberd.log slow_db.log 
trace_alice.log 
roster_audit.log muc_audit.log 
Honu
ALWAYS LOAD TEST YOUR CODE 
‣ Automatic verification 
of the latest builds 
‣ Collecting historical 
results for comparison 
‣ Measuring the impact 
of new features and 
changes to the code 
‣ Simulating various 
failures 
LESSONS LEARNED
THINGS WILL FAIL 
LESSONS LEARNED 
‣ Prepare for the worst 
‣ It’s just a matter of 
time for crash to 
happen 
‣ It’s not only our code 
that fails 
‣ Unlikely events 
happen every second 
under given scale
CHAT IS DOING GREAT! 
The quality uptime is over 99% each month, and is increasing, with hundreds 
of servers deployed all over the world. 
SCALE AND PERFORMANCE 
Each server offer reliable, low latency to the players, routing over 1B events 
a day with low resource utilization. 
CHAT IS EVOLVING 
Rolling out Riak worldwide, making LoL Chat available outside of the client, 
explore possibilities around using social graph data, and more... 
CURRENT 
SITUATION
THANK YOU! 
ANY QUESTIONS?

More Related Content

What's hot

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Hacking and Defending APIs - Red and Blue make Purple.pdf
Hacking and Defending APIs - Red and Blue make Purple.pdfHacking and Defending APIs - Red and Blue make Purple.pdf
Hacking and Defending APIs - Red and Blue make Purple.pdfMatt Tesauro
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring MicroservicesWeaveworks
 
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudUsing S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudDatabricks
 
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
OWASP API Security Top 10 Examples
OWASP API Security Top 10 ExamplesOWASP API Security Top 10 Examples
OWASP API Security Top 10 Examples42Crunch
 
Being Well-Architected in the Cloud
Being Well-Architected in the CloudBeing Well-Architected in the Cloud
Being Well-Architected in the CloudAmazon Web Services
 
Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeDataWorks Summit
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWSKRUG - AWS한국사용자모임
 
[AWSマイスターシリーズ] Amazon ElastiCache
[AWSマイスターシリーズ] Amazon ElastiCache[AWSマイスターシリーズ] Amazon ElastiCache
[AWSマイスターシリーズ] Amazon ElastiCacheAmazon Web Services Japan
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)용호 최
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operationsAmazon Web Services Japan
 
WAF ASM / Advance WAF - Brute force lior rotkovitch f5 sirt v5 clean
WAF ASM / Advance WAF - Brute force   lior rotkovitch  f5 sirt v5 cleanWAF ASM / Advance WAF - Brute force   lior rotkovitch  f5 sirt v5 clean
WAF ASM / Advance WAF - Brute force lior rotkovitch f5 sirt v5 cleanLior Rotkovitch
 
How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure LatencyC4Media
 
Checkmarx meetup API Security - API Security top 10 - Erez Yalon
Checkmarx meetup API Security -  API Security top 10 - Erez YalonCheckmarx meetup API Security -  API Security top 10 - Erez Yalon
Checkmarx meetup API Security - API Security top 10 - Erez YalonAdar Weidman
 

What's hot (20)

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Hacking and Defending APIs - Red and Blue make Purple.pdf
Hacking and Defending APIs - Red and Blue make Purple.pdfHacking and Defending APIs - Red and Blue make Purple.pdf
Hacking and Defending APIs - Red and Blue make Purple.pdf
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring Microservices
 
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudUsing S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
 
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
OWASP API Security Top 10 Examples
OWASP API Security Top 10 ExamplesOWASP API Security Top 10 Examples
OWASP API Security Top 10 Examples
 
Being Well-Architected in the Cloud
Being Well-Architected in the CloudBeing Well-Architected in the Cloud
Being Well-Architected in the Cloud
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army Knife
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
 
[AWSマイスターシリーズ] Amazon ElastiCache
[AWSマイスターシリーズ] Amazon ElastiCache[AWSマイスターシリーズ] Amazon ElastiCache
[AWSマイスターシリーズ] Amazon ElastiCache
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)
 
Count min sketch
Count min sketchCount min sketch
Count min sketch
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
 
WAF ASM / Advance WAF - Brute force lior rotkovitch f5 sirt v5 clean
WAF ASM / Advance WAF - Brute force   lior rotkovitch  f5 sirt v5 cleanWAF ASM / Advance WAF - Brute force   lior rotkovitch  f5 sirt v5 clean
WAF ASM / Advance WAF - Brute force lior rotkovitch f5 sirt v5 clean
 
How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
Checkmarx meetup API Security - API Security top 10 - Erez Yalon
Checkmarx meetup API Security -  API Security top 10 - Erez YalonCheckmarx meetup API Security -  API Security top 10 - Erez Yalon
Checkmarx meetup API Security - API Security top 10 - Erez Yalon
 

Viewers also liked

Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 
Let's Chat about Chat - RICON 2014
Let's Chat about Chat - RICON 2014 Let's Chat about Chat - RICON 2014
Let's Chat about Chat - RICON 2014 Michał Ptaszek
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014lpgauth
 
ประกาศสอบ
ประกาศสอบประกาศสอบ
ประกาศสอบnongplongschool
 
Marketing Portfolio
Marketing PortfolioMarketing Portfolio
Marketing PortfolioGary Little
 
Presentacion IniciativasEC3 Daniel Torres
Presentacion IniciativasEC3 Daniel TorresPresentacion IniciativasEC3 Daniel Torres
Presentacion IniciativasEC3 Daniel TorresEC3metrics Spin-Off
 
Oris Watches
Oris WatchesOris Watches
Oris Watchesbelwatc7
 
3r tema 1 com som . cos humà
3r tema 1 com som . cos humà3r tema 1 com som . cos humà
3r tema 1 com som . cos humànalsina
 
Presentación SocietalImpact Daniel Torres
Presentación SocietalImpact Daniel Torres Presentación SocietalImpact Daniel Torres
Presentación SocietalImpact Daniel Torres EC3metrics Spin-Off
 
Junior java standard edition developer
Junior java standard edition developerJunior java standard edition developer
Junior java standard edition developerDmitriy Neguritsa
 
V miss u sweetheart!!
V miss u sweetheart!!V miss u sweetheart!!
V miss u sweetheart!!Vijayta Verma
 
Kruche presentation 2015
Kruche presentation 2015Kruche presentation 2015
Kruche presentation 2015Kruche!
 

Viewers also liked (13)

Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 
Let's Chat about Chat - RICON 2014
Let's Chat about Chat - RICON 2014 Let's Chat about Chat - RICON 2014
Let's Chat about Chat - RICON 2014
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
ECU Masterclass slides August 2014
ECU Masterclass slides August 2014ECU Masterclass slides August 2014
ECU Masterclass slides August 2014
 
ประกาศสอบ
ประกาศสอบประกาศสอบ
ประกาศสอบ
 
Marketing Portfolio
Marketing PortfolioMarketing Portfolio
Marketing Portfolio
 
Presentacion IniciativasEC3 Daniel Torres
Presentacion IniciativasEC3 Daniel TorresPresentacion IniciativasEC3 Daniel Torres
Presentacion IniciativasEC3 Daniel Torres
 
Oris Watches
Oris WatchesOris Watches
Oris Watches
 
3r tema 1 com som . cos humà
3r tema 1 com som . cos humà3r tema 1 com som . cos humà
3r tema 1 com som . cos humà
 
Presentación SocietalImpact Daniel Torres
Presentación SocietalImpact Daniel Torres Presentación SocietalImpact Daniel Torres
Presentación SocietalImpact Daniel Torres
 
Junior java standard edition developer
Junior java standard edition developerJunior java standard edition developer
Junior java standard edition developer
 
V miss u sweetheart!!
V miss u sweetheart!!V miss u sweetheart!!
V miss u sweetheart!!
 
Kruche presentation 2015
Kruche presentation 2015Kruche presentation 2015
Kruche presentation 2015
 

Similar to Scaling LoL Chat to 70M Players

Using Kubernetes to deliver a “serverless” service
Using Kubernetes to deliver a “serverless” serviceUsing Kubernetes to deliver a “serverless” service
Using Kubernetes to deliver a “serverless” serviceDoKC
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaHenning Jacobs
 
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SF
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SFWebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SF
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SFAlexandre Gouaillard
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
 
Compliance Automation with InSpec - Chef NYC Meetup - April 2017
Compliance Automation with InSpec - Chef NYC Meetup - April 2017Compliance Automation with InSpec - Chef NYC Meetup - April 2017
Compliance Automation with InSpec - Chef NYC Meetup - April 2017adamleff
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Henning Jacobs
 
Increasing velocity via serless semantics
Increasing velocity via serless semanticsIncreasing velocity via serless semantics
Increasing velocity via serless semanticsKfir Bloch
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)Arnaud Bouchez
 
Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment NetApp
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...Josef Adersberger
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...QAware GmbH
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015 LivePerson
 
M|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With AutomationM|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With AutomationMariaDB plc
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMorgan Tocker
 

Similar to Scaling LoL Chat to 70M Players (20)

Using Kubernetes to deliver a “serverless” service
Using Kubernetes to deliver a “serverless” serviceUsing Kubernetes to deliver a “serverless” service
Using Kubernetes to deliver a “serverless” service
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
 
Into The Box 2018 Ortus Keynote
Into The Box 2018 Ortus KeynoteInto The Box 2018 Ortus Keynote
Into The Box 2018 Ortus Keynote
 
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SF
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SFWebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SF
WebRTC Infrastructure scalability notes - Geek'n Kranky - June 2014 @ Google SF
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
 
Compliance Automation with InSpec - Chef NYC Meetup - April 2017
Compliance Automation with InSpec - Chef NYC Meetup - April 2017Compliance Automation with InSpec - Chef NYC Meetup - April 2017
Compliance Automation with InSpec - Chef NYC Meetup - April 2017
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
 
Increasing velocity via serless semantics
Increasing velocity via serless semanticsIncreasing velocity via serless semantics
Increasing velocity via serless semantics
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)
 
Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015
 
M|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With AutomationM|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With Automation
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics Improvements
 

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 

Recently uploaded (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 

Scaling LoL Chat to 70M Players

  • 1. SCALING LoL CHAT TO 70 MILLION PLAYERS Michal Ptaszek, @michalptaszek Riot Games
  • 2. WHAT’S PLANNED 1 2 3 4 GAME CHAT TECH LESSONS LEARNED 5 Q&A
  • 3. WHAT IS LEAGUE OF LEGENDS? 2009 LAUNCH TEAM ORIENTED 100+ CHAMPS MODERN FANTASY
  • 4. MESSAGING SERVICE Private player chat and group chats. PRESENCE SERVICE Friend lists, availability and status. SOCIAL GRAPH SERVICE Internal service for store, match history, leagues. CHAT WHAT IS IT?
  • 6. CHAT BY THE NUMBERS 67 million monthly players 27 million daily players 7.5 million concurrent players 1 billion events routed per server, per day
  • 7. CHAT AT 10K FEET STABLE, SCALABLE CHAT SERVICE PROTOCOL DATA SERVER STORE
  • 8. CHAT AT 10K FEET STABLE, SCALABLE CHAT SERVICE DATA PROTOCOL SERVER STORE
  • 9. PROTOCOL: XMPP Decentralized Architecture Openness Extensibility Availability of Client Libraries Security Wide Adoption
  • 10. CHAT AT 10K FEET STABLE, SCALABLE CHAT SERVICE DATA PROTOCOL SERVER STORE
  • 11. SERVER: EJABBERD ‣ Open source Jabber/XMPP server ‣ Relatively nice scalability and performance with default configuration ‣ Wide adoption and active, helpful community ‣ Very good as a starting point for our own server solution ▾ We were aware that one day we would need to start customizing it ‣ Written in Erlang programming language
  • 12. TECHNOLOGY: ERLANG/OTP Erlang is... Which gives us... A functional language Built with concurrency and distribution in mind Able to scale extremely well Capable of reloading code on the fly A declarative style of programming An easier way to build our distributed applications More time to focus on coding Less downtime
  • 13. SERVER: EJABBERD - PHILOSOPHY Share nothing approach; enables massive, near linear horizontal scalability. ARCHITECTURE Implementation of self-healing properties, which bring the system to a well-known, stable state. FAULT TOLERANCE When something is massively broken - do not fix it! LET IT CRASH
  • 14. SERVER: EJABBERD - ARCHITECTURE ETL Queries Secondary Riak Cluster External Traffic (5223) Internal Traffic Riak Riak Ejabberd Server Ejabberd LB Server
  • 15. SERVER: EJABBERD - IMPLEMENTATION PHASE 1 - MAKE IT WORK ‣ Over time mostly rewritten ‣ Removed unwanted and unneeded parts ‣ Optimized certain flow paths ‣ Make it compatible with industry standards ‣ Wrote over 600 tests to cover it Invite Alice Bob Accept Alice Bob Invite Alice Bob Accept Alice Bob Alice Bob
  • 16. SERVER: EJABBERD - IMPLEMENTATION PHASE 1 - MAKE IT WORK ‣ Over time mostly rewritten ‣ Removed unwanted and unneeded parts ‣ Optimized certain flow paths ‣ Make it compatible with industry standards ‣ Wrote over 600 tests to cover it Invite Alice Bob Accept Alice Bob Alice Bob
  • 17. SERVER: EJABBERD - IMPLEMENTATION PHASE 2: MAKE IT RIGHT ‣ Removed clear bottlenecks ‣ Avoid shared, mutable state ‣ “Make it work, make it right, make it fast” MUC router user sesussioenr sesussioenr session MUC room user sesussioenr sesussioenr session user sesussioenr sesussioenr session MUC room MUC room
  • 18. SERVER: EJABBERD - IMPLEMENTATION PHASE 2: MAKE IT RIGHT ‣ Removed clear bottlenecks ‣ Avoid shared, mutable state ‣ “Make it work, make it right, make it fast” user sesussioenr sesussioenr session MUC room user sesussioenr sesussioenr session user sesussioenr sesussioenr session MUC room MUC room
  • 19. SERVER: EJABBERD - IMPLEMENTATION PHASE 2: MAKE IT RIGHT ‣ Removed clear bottlenecks ‣ Avoid shared, mutable state ‣ “Make it work, make it right, make it fast” Session Table: JID -> Session Handler session table Alice Bob Charlie
  • 20. SERVER: EJABBERD - IMPLEMENTATION PHASE 3 - MAKE IT FAST ‣ Patched VM and stdlibs ‣ Sacrificing generic nature of Erlang/OTP framework in favor of better scalability and fault tolerance ‣ Better traceability and profiling functions ‣ More visibility into the system ‣ Improved logging for code reloading and real time system upgrades
  • 21. CHAT AT 10K FEET STABLE, SCALABLE CHAT SERVICE PROTOCOL SERVER DATA STORE
  • 22. NOSQL DATA STORE: RIAK SCALE Linearly scalable No growth headaches FAULT Higher TOLERANCE No SPoF uptime SCHEMA-LESS Faster feature iterations More shipped features ‣ Distributed, fault-tolerant, key-value store ‣ Masterless, fully peer-to-peer architecture ‣ AP in CAP theorem, with eventual consistency ‣ Low, predictable latency ‣ Extreme scalability ‣ Multi data center replication
  • 23. LESSONS LEARNED UNDERSTAND YOUR SYSTEM ‣ Over 500 real-time counters, rates, histograms collected each minute ‣ Make sure to know counter values for “correct” and “abnormal” conditions ‣ Alerts and logs for long running operations ‣ Integration with Graphite, Zabbix and Nagios
  • 24. IMPLEMENT FEATURE TOGGLES LESSONS LEARNED ‣ Safety valve for things that might cause problems ‣ Partial deployments allowing features to be enabled only for certain groups of people Alice Bob Charlie group reordering feature whitelist: Bob Bob
  • 25. SUPPORT CODE RELOADING ‣ Patching bugs on the fly ‣ Changing server configuration ‣ Collecting data for future analysis ‣ No downtime deploys LESSONS LEARNED buggy code fixed code server restart buggy code fixed code
  • 26. GET YOUR LOGGING RIGHT LESSONS LEARNED ‣ Proper logging and tracing facilities ‣ Debug modes for selected users ‣ Tools for analysis of the collected data Alice ejabberd.log slow_db.log trace_alice.log roster_audit.log muc_audit.log Honu
  • 27. ALWAYS LOAD TEST YOUR CODE ‣ Automatic verification of the latest builds ‣ Collecting historical results for comparison ‣ Measuring the impact of new features and changes to the code ‣ Simulating various failures LESSONS LEARNED
  • 28. THINGS WILL FAIL LESSONS LEARNED ‣ Prepare for the worst ‣ It’s just a matter of time for crash to happen ‣ It’s not only our code that fails ‣ Unlikely events happen every second under given scale
  • 29. CHAT IS DOING GREAT! The quality uptime is over 99% each month, and is increasing, with hundreds of servers deployed all over the world. SCALE AND PERFORMANCE Each server offer reliable, low latency to the players, routing over 1B events a day with low resource utilization. CHAT IS EVOLVING Rolling out Riak worldwide, making LoL Chat available outside of the client, explore possibilities around using social graph data, and more... CURRENT SITUATION
  • 30. THANK YOU! ANY QUESTIONS?