SlideShare a Scribd company logo
1 of 26
Download to read offline
Akka Cluster and
Auto-scaling
Ikuo Matsumura
CyberAgent, Inc.
2017/02/26
Akka Cluster
⾮中央集権的なノード群構築を⾏うAkka拡張
10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介
• Decentralized cluster membership service
• no single point of failure, bottleneck
• distribute actors over multiple JVMs
• Applied to build a sub-system on AD serving
• Tens of servers
• Operations about 10 months
Requirements in our case
• Host a lot of Entity with low cost
• Fit existing Akka application
• Down-time is acceptable to some extent*
Akkaベースで多数のEntityを低コストで配備したい
多少のダウンタイムは許容できる
*online machine learning
Our application of Akka Cluster
永続ActorをCluster Shardingで配備
データ保管にコモディティサービスを使⽤
frontend frontend frontend
entities
entities
…
…
entities
frontend
• Existing app
• Tens of nodes
• Auto-scaling
• New sub-system
• Several nodes
ElastiCache
(Journal)
S3
(Snapshot)
data stores
…
Challenges
• Strategy on unreachables removal
• Lifecycle of journals
運⽤する中でつまづいた2つの課題についてお話します
Membership Lifecycle in Cluster Specification
クラスタメンバーのライフサイクルの概観
http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle
joining
up
down
removed
join
leaving
exiting
unreachable
leave
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
join
unreachable
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-in発⽣時、unreachable のままになる
joining
up
down
unreachable
failure detector
leader action
Joinning and Leader Action
leader actionが⾏えなくなる。
結果、新規メンバが他のメンバと通信できないままに。
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-inをトリガにしたdown指定が必要
joining
up
down
unreachable
leader action
scale-in
trigger
mark as down
Joinning and Leader Action
unreachableを除くことでleader actionが再開可能に
joining
up
down
unreachable
leader action
Leader actions blocked by unreachables
leader actionが⾏えない状態のログの例
Members that are “up” but have not seen the current state
“Leader can currently not perform its duties”
Causes and actions on unreachables
Type of failures Example
Possible
external action
network partitions -
wait for recovery or
abandon a part
machine crashes
scale-in mark as down
quarantined in
akka remote layer
restart
an actor system
unresponseive
process
long GC restart a JVM
CPU starvation by
credit shortage in EC2
re-create an instance
failure detector はエラーの原因までは区別できない
原因に応じてクラスタ外部からの回復措置が要る
Split Brain Resolver* (commercial add-on)
• Mark members as “down” when a part of the
cluster become unreachable for some time
• Strategies
• Static Quorum
• Keep Majority - default in Lagom
• Keep Oldest
• Keep Referee
* http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html
商⽤add-onである程度包括的に⾃動のdown指定が可能
⼀定時間メンバの状態・到達可能性に変化がない時に発動
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Caution
• Side-effect caused by app restart
• ddata is experimental (at Akka 2.4)
• Use Akka 2.4.8 or higher*
再起動による副作⽤やAkkaのバージョンに注意が必要
*has a fix on distributed pub-sub akka#20847
To keep cluster membership healthy
1. Trigger mark-as-down (or leave) on scale-in
2. Automate restart/recreation of

AcotrSystem, JVM, server instance
3. Setup a fallback mechanism such as

split brain resolver, or

rejoining into a new cluster
unreachable対策のまとめ
Challenges
• Strategy of unreachables removal
• Lifecycle of journals
次に、2つ⽬の課題についてお話しします。
Journal
entities
entities
…
entities
ElastiCache
(Journal)
S3
(Snapshot)
Event Sourcingにおけるイベントストアに対応するAPI
Journalをキャッシュのように運⽤する想定をした
Cleanup old journals in Redis plug-in*
JournalのDeleteMessageでは⼀部データが残るケースがある
snapshotとのsequenceNrの⼀貫性に注意
key in Redis removed on deleteMessages
journal:$persistenceId Yes
journal:$persistenceId.highestSequenceNr No
* https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala
Deleting highstSequenceNr could cause loading old version of snapshot.
→ Keep only the latest snapshot.
Event Sourcing and Ecosystem
“it stores a complete history of the events
associated with the aggregates in your domain”
Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ]
本来のイベントストアはイベントの完全な履歴を持つ想定
そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
Summary
• Lessons learned from devops of an Akka Cluster app
• Strategy on unreachables removal
• scale-in trigger
• automatic restart/recreation
• fallback mechanism;

split-brain resolver / rejoining
• Lifecycle of journals
• cost of deviation from Event Sourcing
unreachableメンバを取り除く仕組みを各種⼊れておく
Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
Reference
[CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian
Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian,
2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx
[PSE] Persistence - Schema Evolution, Akka Documentation, http://
doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html

More Related Content

What's hot

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
 
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)petabridge
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterKonstantin Tsykulenko
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
 
The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)Robert Scholte
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsAnshum Gupta
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Productionbilyushonak
 
CloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatCloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatAndrew Bayer
 
CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!Andrew Bayer
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworksAnshum Gupta
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-frameworkAbdhesh Kumar
 
MySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMatt Lord
 
Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Andrew Bayer
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
MySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMatt Lord
 
MySQL Operator for Kubernetes
MySQL Operator for KubernetesMySQL Operator for Kubernetes
MySQL Operator for KubernetesKenny Gryp
 

What's hot (20)

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
獨體模式
獨體模式 獨體模式
獨體模式
 
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
 
The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIs
 
Testing at Stream-Scale
Testing at Stream-ScaleTesting at Stream-Scale
Testing at Stream-Scale
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
CloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatCloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCat
 
CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!
 
Real World Java 9
Real World Java 9Real World Java 9
Real World Java 9
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworks
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-framework
 
MySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB Clusters
 
Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
MySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMySQL Group Replication - an Overview
MySQL Group Replication - an Overview
 
MySQL Operator for Kubernetes
MySQL Operator for KubernetesMySQL Operator for Kubernetes
MySQL Operator for Kubernetes
 

Viewers also liked

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriTIS Inc.
 
Developing an Akka Edge1-3
Developing an Akka Edge1-3Developing an Akka Edge1-3
Developing an Akka Edge1-3saaaaaaki
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski
 
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching PatternDeadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching Patternchibochibo
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs FreePawel Szulc
 
Scala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jsScala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jstakezoe
 
Van laarhoven lens
Van laarhoven lensVan laarhoven lens
Van laarhoven lensNaoki Aoyama
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesPawel Szulc
 
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleReducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleConnie Chen
 
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)Eugene Yokota
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Masahito Zembutsu
 
ScalaにまつわるNewsな話
ScalaにまつわるNewsな話ScalaにまつわるNewsな話
ScalaにまつわるNewsな話Yosuke Mizutani
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics TokyoAdam Gibson
 
Scala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaScala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaKazuhiro Sera
 
Arquitectura barroca
Arquitectura barrocaArquitectura barroca
Arquitectura barrocaMaria Carmona
 
Sbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞSbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞYoshitaka Fujii
 

Viewers also liked (20)

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuri
 
Scala Matsuri 2017
Scala Matsuri 2017Scala Matsuri 2017
Scala Matsuri 2017
 
Developing an Akka Edge1-3
Developing an Akka Edge1-3Developing an Akka Edge1-3
Developing an Akka Edge1-3
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
 
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching PatternDeadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs Free
 
Scala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jsScala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.js
 
Van laarhoven lens
Van laarhoven lensVan laarhoven lens
Van laarhoven lens
 
Pratical eff
Pratical effPratical eff
Pratical eff
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data types
 
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleReducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
 
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
 
ScalaにまつわるNewsな話
ScalaにまつわるNewsな話ScalaにまつわるNewsな話
ScalaにまつわるNewsな話
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
 
Scala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaScala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscala
 
Arquitectura barroca
Arquitectura barrocaArquitectura barroca
Arquitectura barroca
 
究極のPHP本完成
究極のPHP本完成究極のPHP本完成
究極のPHP本完成
 
Sbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞSbtのマルチプロジェクトはいいぞ
Sbtのマルチプロジェクトはいいぞ
 

Similar to Akka Cluster Auto-scaling Lessons Learned

Cassandra Bootstap from Backups
Cassandra Bootstap from BackupsCassandra Bootstap from Backups
Cassandra Bootstap from BackupsInstaclustr
 
Cassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsCassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsInstaclustr
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatAccelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatNetApp
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018harvraja
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast EssentialsRahul Gupta
 
Openstack study-nova-02
Openstack study-nova-02Openstack study-nova-02
Openstack study-nova-02Jinho Shin
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
 
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitAtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitWojciech Seliga
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10Kenny Gryp
 
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesMySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesKenny Gryp
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
 
Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Arun Gupta
 
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax
 
Effective Testing in DSE
Effective Testing in DSEEffective Testing in DSE
Effective Testing in DSEpedjak
 

Similar to Akka Cluster Auto-scaling Lessons Learned (20)

Cassandra Bootstap from Backups
Cassandra Bootstap from BackupsCassandra Bootstap from Backups
Cassandra Bootstap from Backups
 
Cassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsCassandra Bootstrap from Backups
Cassandra Bootstrap from Backups
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatAccelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast Essentials
 
AppFabric Velocity
AppFabric VelocityAppFabric Velocity
AppFabric Velocity
 
Openstack study-nova-02
Openstack study-nova-02Openstack study-nova-02
Openstack study-nova-02
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitAtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesMySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
 
Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds
 
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
 
Effective Testing in DSE
Effective Testing in DSEEffective Testing in DSE
Effective Testing in DSE
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Akka Cluster Auto-scaling Lessons Learned

  • 1. Akka Cluster and Auto-scaling Ikuo Matsumura CyberAgent, Inc. 2017/02/26
  • 2. Akka Cluster ⾮中央集権的なノード群構築を⾏うAkka拡張 10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介 • Decentralized cluster membership service • no single point of failure, bottleneck • distribute actors over multiple JVMs • Applied to build a sub-system on AD serving • Tens of servers • Operations about 10 months
  • 3. Requirements in our case • Host a lot of Entity with low cost • Fit existing Akka application • Down-time is acceptable to some extent* Akkaベースで多数のEntityを低コストで配備したい 多少のダウンタイムは許容できる *online machine learning
  • 4. Our application of Akka Cluster 永続ActorをCluster Shardingで配備 データ保管にコモディティサービスを使⽤ frontend frontend frontend entities entities … … entities frontend • Existing app • Tens of nodes • Auto-scaling • New sub-system • Several nodes ElastiCache (Journal) S3 (Snapshot) data stores …
  • 5. Challenges • Strategy on unreachables removal • Lifecycle of journals 運⽤する中でつまづいた2つの課題についてお話します
  • 6. Membership Lifecycle in Cluster Specification クラスタメンバーのライフサイクルの概観 http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle joining up down removed join leaving exiting unreachable leave
  • 7. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down join unreachable
  • 8. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down unreachable leader action
  • 9. Joinning and Leader Action Scale-in発⽣時、unreachable のままになる joining up down unreachable failure detector leader action
  • 10. Joinning and Leader Action leader actionが⾏えなくなる。 結果、新規メンバが他のメンバと通信できないままに。 joining up down unreachable leader action
  • 11. Joinning and Leader Action Scale-inをトリガにしたdown指定が必要 joining up down unreachable leader action scale-in trigger mark as down
  • 12. Joinning and Leader Action unreachableを除くことでleader actionが再開可能に joining up down unreachable leader action
  • 13. Leader actions blocked by unreachables leader actionが⾏えない状態のログの例 Members that are “up” but have not seen the current state “Leader can currently not perform its duties”
  • 14. Causes and actions on unreachables Type of failures Example Possible external action network partitions - wait for recovery or abandon a part machine crashes scale-in mark as down quarantined in akka remote layer restart an actor system unresponseive process long GC restart a JVM CPU starvation by credit shortage in EC2 re-create an instance failure detector はエラーの原因までは区別できない 原因に応じてクラスタ外部からの回復措置が要る
  • 15. Split Brain Resolver* (commercial add-on) • Mark members as “down” when a part of the cluster become unreachable for some time • Strategies • Static Quorum • Keep Majority - default in Lagom • Keep Oldest • Keep Referee * http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html 商⽤add-onである程度包括的に⾃動のdown指定が可能 ⼀定時間メンバの状態・到達可能性に変化がない時に発動
  • 16. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 17. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 18. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 19. Caution • Side-effect caused by app restart • ddata is experimental (at Akka 2.4) • Use Akka 2.4.8 or higher* 再起動による副作⽤やAkkaのバージョンに注意が必要 *has a fix on distributed pub-sub akka#20847
  • 20. To keep cluster membership healthy 1. Trigger mark-as-down (or leave) on scale-in 2. Automate restart/recreation of
 AcotrSystem, JVM, server instance 3. Setup a fallback mechanism such as
 split brain resolver, or
 rejoining into a new cluster unreachable対策のまとめ
  • 21. Challenges • Strategy of unreachables removal • Lifecycle of journals 次に、2つ⽬の課題についてお話しします。
  • 23. Cleanup old journals in Redis plug-in* JournalのDeleteMessageでは⼀部データが残るケースがある snapshotとのsequenceNrの⼀貫性に注意 key in Redis removed on deleteMessages journal:$persistenceId Yes journal:$persistenceId.highestSequenceNr No * https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala Deleting highstSequenceNr could cause loading old version of snapshot. → Keep only the latest snapshot.
  • 24. Event Sourcing and Ecosystem “it stores a complete history of the events associated with the aggregates in your domain” Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ] 本来のイベントストアはイベントの完全な履歴を持つ想定 そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
  • 25. Summary • Lessons learned from devops of an Akka Cluster app • Strategy on unreachables removal • scale-in trigger • automatic restart/recreation • fallback mechanism;
 split-brain resolver / rejoining • Lifecycle of journals • cost of deviation from Event Sourcing unreachableメンバを取り除く仕組みを各種⼊れておく Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
  • 26. Reference [CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian, 2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx [PSE] Persistence - Schema Evolution, Akka Documentation, http:// doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html