This document discusses Typesafe's Reactive Platform, which includes tools for monitoring, orchestrating, and improving the resilience of reactive applications. It provides an overview of the platform and its commercial features, and then focuses on two aspects in more depth: improving fault tolerance through Akka's split brain resolver strategies and system orchestration using ConductR to deploy and manage reactive microservices applications across clusters. The document encourages readers to get started with the platform through a free developer sandbox and contacting Typesafe for additional help and services.
12. Developers empowered
“You allowed us to come up with a
design that we could only dream of
before.”
“It’s hard to put into words how exciting
it has been to work on a project like
this.”
“You made programming fun again.”
“You saved my career.”
18. Mitigate Data Loss
Reduce Ops Burden
Improve Cluster Health
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
19. Protect Servers
Delight Customers
Block Bad Behavior
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
20. Unlock Data
Revitalize Architecture
Maximize Investments
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
21. Reduce Risk
Ease Maintenance
Improve Predictability
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
22. Eliminate Conflicts
Reduce Guesswork
Speed Development
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
23. Boost Productivity
Mitigate Production Risk
Speed Knowledge Transfer
OPEN
CORE
System
Orchestratio
n
Application
Monitoring
Application
Availability
Partition
Healing
Security
Notifications
Legacy
Integration
Expert
Support
Certified
Build
29. • Network partitions - fundamental problem in distrib
uted systems
• Akka SBR helps make decisions
• Pre-built strategies, when to down nodes in cluster
• Static Quorum (like Zookeeper)
• Keep Majority
• Keep Oldest
• Keep Referee
Akka Split Brain Resolver
32. Heartbeats Hey team, `n-1` is down!
I’ll take over `A`!
A
What network partitions look like to
Ops
33. Heartbeats
A A
Good if n-1 really is down
Bad if n-1 is just very unresponsive
Fundamentally, it is hard to distinguish the two states
in distributed systems
What network partitions look like to
Ops
Hey team, `n-1` is down!
I’ll take over `A`!
42. oldest node
down-if-all-alone
Keep Oldest
A
can’t see oldest node!
oldest node can change,
if “up until now oldest node” leaves the cluster
This is more dynamic than keep-referee.
Akka Split Brain Resolver
43. • No Brainer – Using Akka Cluster, deploy AWS
• Next Steps - read docs, download Reactive Platfor
m
Akka Split Brain Resolver
45. ConductR
• Message-driven apps run on 10s, 100s, 1000s of no
des
• Beyond 3 nodes, challenging for ops
• ConductR, eases deployment and management
• focused on resilience for your system, not infrastruct
ure
47. ConductR
• Hardcore resilience for systems
• Load balancing at scale
• Auto recovery failed apps/nod
es
• Advanced partition resolution
48. ConductR
• Smooth release process
• Sandbox for Dev and Ops
• Immutable, standardized
• Various packaging formats
(Docker, JVM)
49. ConductR
• Keep your existing tools
• Infrastructure agnostic
• Combine with Monitoring
• Consolidated logging
50. Without ConductR
• Build machines
• OS
• App server
• Apps lifecycle
• Add resilience
• Config Load balanc
er
• Config port
With ConductR• Build nodes w/ Conduc
tR
• OS
• ConductR
• Deploy apps/ services
to cluster via Conduct
R
Resilient from the core, not as an add-
on
51. ConductR
• No Brainer – Using Akka Cluster, deploy AWS, 3+ n
odes
• Next Steps - view interactive demo, enjoy sandbox
60. Sign up to get license ID
• Get Started on Typesafe.com
• Register for a free account
• Apply ID to existing project, or start a new
one
Getting Started with RP
61. Use with your new RP project
• Developer sandbox with Docker
• Full deployment evaluation also available
Experiment with ConductR
62. Use with your new RP project
• Developer sandbox with Docker
• Full deployment evaluation also available
Experiment with ConductR
63. GET IN TOUCH
Help is just a click away. Get in touch
with Typesafe about:
• Production licensing and subscriptions
• Additional services and support
• On-site, expert training
CONTACT US
Editor's Notes
OPENING GAMBIT – TAILOR TO FIT YOUR AUDIENCE
I’m sure you’ve heard the famous quote from Mark Andreessen, "Software is eating the world," Well, if you believe that is true, then the next logical question becomes, "How will your business react?"
Like you, nearly every developer and architect we speak with recognizes the need for their business to react with the speed and agility of a software startup in order to survive. However the majority businesses are not equipped to make that happen.
A primary reason is because existing developers are hamstrung by old tools.
That’s where Typesafe comes in. We've created an application development platform that is purpose built for this era multicore and cloud computing architectures so any developer in any size of enterprise can build highly distributed applications that react user demand, react to web-scale load, and react to inevitable failure - all by design. We call these Reactive systems.
During the next 45 minutes, we’ll set the stage for you to understand how our Platform can help you react like a software startup by sharing a few examples of customers that are leveraging the product to disrupt or enter new markets;
Then we’ll provide you with an overview of the technology so you better understand how we arm developers to build non-blocking, highly distributed applications with tools that delight developers. Next, we’ll dive deep into a handful of Reactive Platform features we’ve built to help you strengthen the resilience of your systesm.
Then we’ll close with some actionable next steps to get you started.
At the end of this session, we hope you will agree that getting started on the Reactive Platform is the right choice to enable your business to react like a software startup, to not only survive, but to thrive, in this new world.
During the next 45 minutes, we’ll set the stage for you to understand how our Platform can help you react like a software startup by sharing a few examples of customers that are leveraging the product to disrupt or enter new markets;
Then we’ll walk you through the technology so you better understand how we arm developers to build non-blocking, highly distributed applications with tools that make programming fun again.
We’ll tie it all together by exploring the benefits and results that have been achieved by other enterprises with goals similar to yours including – (swap for your audience: accelerating time to market, integrating with everything and delighting developers).
Then we’ll close with some actionable next steps to get you started.
At the end of this session, we hope you will agree that standardizing on the Reactive Platform is the right choice to enable your business to react like a software startup, to not only survive, but to thrive, in this new world.
There is more and more evidence that Mark Adreessen was right: software truly is eating the world.
Consequently, in today’s market, it’s no longer about the BIG fish eating the LITTLE fish.
It’s about the FAST fish eating the SLOW fish.
Perhaps the most quintessential fast fish success story is Netflix.
Back in 2000, Blockbuster declined several offers to purchase Netflix for a mere $50 million. In 2010, Blockbuster declared bankruptcy.
If companies ignore this, they do so at their own demise.
Because software is eating the world, more and more, you see traditional business recasting themselves as technology companies.
We all are familiar with the transition at Amazon to offer AWS.
Similarly, our customer MLB is extending its business model to become a media company offering a streaming video delivery service under a division called MLB Advanced Media. Just this past year, the streaming division brought in over $100 million in revenue.
The streaming platform first found its way into headlines as Fortune reported that HBO signed a contract with MLB Advanced Media to bring the highly-anticipated HBO Go streaming service to fruition with Game of Thrones as the centerpiece.
More recently, the Wall Street Journal reported that MLB Advanced Media was in talks with 40 potential partners, “including many traditional TV networks,” and revealed that Major League Baseball could begin the process of spinning off MLB Advanced Media into a separate company.
Another example of a traditional business recasting itself as a software company is GE.
GE's investment of $1.5 billion in the 'industrial Internet’ is predicted to add $15 trillion to world GDP. The company launched 14 new Industrial Internet predictive technologies to improve outcomes for aviation, oil & gas, transportation, healthcare and energy under its Predix platform, of which our technology plays a part.
There is so much development that lays ahead for GE – and nearly every company in business today – that attracting and retaining top talent has become a critical competitive differentiator.
For the first time in our lifetimes, we are seeing a top-tier company fighting for top talent on prime time television.
Gartner predicts 50% of all software developed in the next 5 years will require a new model
Gartner refers to this new model in a number of way - web-scale apps, Microservices Architecture….
We call this new model Reactive.
Reactive software applications are:
Message-Driven—processing messages in parallel, asynchronously, without blocking.
Elastic—scaling predictably and elastically on-demand, across cores, nodes, and clusters;
Resilient—recovering and repairing itself automatically for seamless business continuity.
Responsive—rich and engaging, providing instant feedback based on user interactions.
To fuel this new shift, we spearheaded The Reactive Manifesto with the goal of defining a common vocabulary, both in terms of business values and technical concepts, to make it easier for developers, users, businesses, and vendors to discuss, collaborate, and innovate around this new class of applications.
It seems to be resonating.
The Manifesto has thousands signatories, has been translated in to 5 languages, and has experienced broad industry adoption ....
But we didn’t stop there.
Because Reactive applications are at the heart of a fundamental shift from Data at Rest to Data in Motion
We are driving industry collaboration and innovation … again .. With the Reactive Streams specification
Handling streams of data—especially “live” data whose volume is not predetermined—requires special care in an asynchronous system.
The most prominent issue is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination.
Reactive Streams is an initiative to provide a
standard for asynchronous stream processing
with non-blocking back pressure on the JVM.
Is slated for inclusion in JDK 9
Developers around the world rejoiced.
So, what makes this all possible? Our product, the Reactive Platform.
Like most enterprises with mission-critical projects, you have shared a lot of common challenges that are addressed with our enterprise distribution, the Reactive Platform, which is licensed through our annual subscription.
Similar to other infrastructure vendors, we have a commercial product that offers enterprise features that deliver value above and beyond the open core.
At the heart of our Reactive Platform is Akka, a message-driven middleware or runtime with Scala and Java APIs designed to deal with a tremendous amount of scale, often orders of magnitude higher than other transaction rates you’ve experienced in your entire career. It allows apps to fail and self-heal, delivering exceptional resilience.
Play Framework is the web development framework with Scala and Java APIs that sits on top of Akka. It empowers developers to build completely non-blocking, asynchronous apps with an ease unparalleled on the JVM.
An addition to our Platform, Apache Spark - which is written in Scala and Akka - is quickly becoming a defacto standard for the fast data that’s fueling Reactive applications.
All of these technologies are written in Scala, the functional and object oriented programming language that makes us Reactive and helps developers write code that’s more concise than other options, so apps are less costly to maintain and easier to evolve.
Monitor Message-Driven Apps
Reactive Platform provides instrumentation for monitoring message-driven, actor-based systems. With the Typesafe Monitoring SPI, you have complete flexibility to integrate with third-party or in-house solutions to generate the visibility you need to
Enhance usability
Identify bottlenecks
Improve performance
ConductR makes the process of deploying a microservices architecture consistent and reliable. In the past, our customers had to use their own custom solutions that were quite complex to get right. With ConductR, instead of having to figure out how to launch and keep running dozens of microservices, you let ConductR provide a ‘platform’ for them all to:
Boost overall system resilience
Streamline rollouts
Increase predictability
Because so many of our enterprise customers were deploying Akka Cluster in Amazon AWS, we created a way to for you to Resolve Network Partitions Decisively
With Akka Split Brain Resolver, you can mitigate network partitions and the potential data loss that can occur with a set of advanced recovery scenarios for unreachable nodes in Akka Clusters.
This allows you to:
Mitigate data loss
Reduce the burden on your ops team
Improve the health of the clusters in your network
Protect Apps Against Abuse
Protect backend servers from becoming overwhelmed by badly behaved (or overly-enthusiastic) users or bots. Play User Quotas ensure your app remains readily available for high-value users.
Above and beyond the strong integration capabilities of the open core..
Reactive Platform includes a set of Slick database extensions that deliver asynchronous stream processing with non-blocking back-pressure when accessing Oracle, IBM DB2, and MS SQL Server.
Additionally, Play SOAP makes accessing legacy systems Reactive with a message-driven approach. Unlike traditional SOAP clients, Play SOAP provides non-blocking clients as a first class feature so you can
Unlock data
Revitalize your existing architecture
Maximize your historical investments
Eliminate Incompatibility & Security Risks
Reactive Platform provides automated alerts for security issues (with priority releases for patches) as well as friendly warnings about version incompatibilities and end of life dates, ensuring Dev or Ops can drop in patches without fear of breaking anything so you can
Reduce risk
Ease maintenance
Improve predictability
Reduce Development and Production Guesswork
Reactive Platform is certified for production environments by validating all software—including third-party libraries—against a comprehensive suite of integration test cases that are benchmarked for scale and performance under heavy load so you can:
Eliminate conflicts
Reduce guesswork
Speed development
We are on this journey together and your success is our top priority. That’s why our annual license for Reactive Platform includes development and production support.
Unlike other vendors that barricade their experts behind layers of escalation, your team receives direct access to the creators and committers of our amazing technology so you can
Boost the productivity of your team
Mitigate production risk with business hour or 24/7 support
Speed knowledge transfer for both dev and ops
While the Reactive Platform can be used to build nearly any type of application, we are seeing three recurring use cases in our enterprise customer base.
First and foremost is the need to integrate with everything, especially with tsunami of connected devices driven by mobile and IoT initiatives.
Next is the transformative shift From Data at Rest to Data in Motion, or Fast Data, which is being fueled by the rapid adoption of Apache Spark.
Finally, and perhaps most importantly, nearly every customer we speak with is looking for ways to break their brittle Monoliths into agile Microservice-based systems, which Gartner has identified as a Top 10 Strategic Trend for the enterprise.
The default behaviour of Akka Cluster is “Manual Downing”: A node needs to issue cluster.down(address). This decision can be powered by external monitoring,or DevOps teams observing the cluster. It is Safe.however involves the most human/automated work
A naive implementation exists called “auto-downing”. It is not very safe to be used in real clusters. Definitely not a good choice for apps using Persistence. It is not recommended for production deployments.
Network partitions fundamental problem in all distributed systems
Akka SBR helps to make decisions; not magic wand
Set of pre-built strategies, when to down nodes in a cluster
Static Quorum (like Zookeeper)
Keep Majority
Keep Oldest
Keep Referee
cluster sharding is the actual feature people want - “just balance this stuff on my cluster”.
It works by consistent hashing.
These message-driven fully asynchronous apps can be running on 10s of nodes, 100s of nodes, or even thousands - but for the ops guys anything beyond 3 nodes is a bit of a problem to manage
Consequently, anything beyond 3 nodes is the focus – or sweet spot – for ConductR.
It’s also application focused. In particular, Typesafe RP applications….of course.
By that, I mean it’s not focused on managing infrastructure – like bringing machines up or down, although it can certainly respond to its environment.
Typesafe ConductR is designed to address the challenges of deploying and managing a Reactive system in a distributed environment while prioritizing system responsiveness and uptime for users.
From a developer point of view you can create a distribution from the native sbt packager
Bundle it - even bundle it as a Docker target, so these bundles can run within Docker containers
And then we provide a number of tools for the ops guy.
Let’s dive in and see how it works….
Automated cluster seed node establishment
» Automates your cluster startup sequence
Dynamic service discovery
» Gossips across your cluster to discover services with dynamically changing locations
Scalable rolling updates» Support multiple versions of applications
No single point of failure» Decentralized, peer-to-peer, fully resilient
Load balancing at scale
» Maintain resilience with proxying
Automated recovery for failed applications
» Automatically restart the victim somewhere else on the cluster
Advanced network partition resolution
» Resolve issues decisively with automated handling strategies using Akka Split Brain Resolver
DevOps sandbox for smooth releases
» Runs locally to test & debug your app in a cluster before moving into production (like) environments
Immutable, standardized releases
» App bundles and configuration are uniquely fingerprinted for immutable versioning
Various packaging formats
» Supports JVM and Docker-based formats
Infrastructure agnostic » Deploy to Amazon EC2, bare metal and (soon) Mesos/ DCOS; support for Linux distributions including Debian and RHEL/Fedora
Typesafe Monitoring» Combine ConductR with Typesafe Monitoring to visualize, track and tune your system
Consolidated events and logging
» Integration with ElasticSearch and Kibana
ConductR is a no brainer if
You are using Akka Cluster
You are using AWS
Especially if running on multiple Availability Zones, where connectivity loss between AZs is common
Attempting to solve with Zookeeper
You are using Typesafe Reactive Platform (Akka, Play, or Scala)
Note: ConductR also support non RP apps, including non JVM applications
You have 3 or more nodes in your deployment
Reactive apps introduces a completely new communication model. This evolution from synchronous to asynchronous apps opens up a ton of new challenges:
(1) Context is lost when crossing asynchronous boundaries.
(2) Stack traces are less useful, as there may be no indication of where errors actually occurred in your own code.
(3) Collecting traces for all asynchronous steps is expensive.
Get the big picture in real-time» You can pinpoint runtime errors with snapshots to help you optimize
Configurable metrics to keep performance overhead low
» Avoid flooding your system and directly manage performance impact by focusing only on relevant actor metrics
Customizable thresholds for actor failures
» Configurable thresholds notify you about load- related effects, SLA metrics tracking, and extraordinary events
Design for performance
>>Performance doesn’t just happen. To make sure you can live up to your SLAs, you need to be able to locate and eliminate your bottleneck so that developers can build and maintain more performant systems.
From bird’s eye view to microscope
» Drill-down to the code level with runtime snapshots that provide stack information
Bulletproof resilience by integrating with ConductR
» Run stand-alone or integrate with ConductR for even more visibility into cluster start-up times, etc
Scala and Java 8 Futures
Tracking Scala and Java 8 Futures in Play is something that can help developers drill-down further into their code base and search for optimization points there.
Akka Streams and Data Flows
Highlighting data flows, such as application-specific transactions and user requests, is a feature that will give more insight to Akka Streams and other flow-based tools. This will require some abstraction and a lot of work.
Integration layer for in-house/custom monitoring systems
Based on feedback from early users, we hope to soon produce a Service Provider Interface (SPI) instrumentation layer for integrating Typesafe Monitoring with custom or in-house monitoring systems.
End-to-end web request tracing for
Play and Akka HTTP
By following the calls to your web application or REST service all the way from the request coming in to the response going out, you can gain insight into your true response times and where you're spending your time and resources.
Yes software is eating the world. And, in response, the world is going Reactive.
Bring hardcore resilience to your system.
Take on failures, and watch them self-heal.
Without resiliency, nothing else really matters (plus, we discovered that duct tape doesn’t work on code). ConductR keeps your systems operational in light of all sorts of trouble.
No single point of failure
ConductR won’t bottleneck your system. Unlike other tools, ConductR employs a decentralized, peer-to-peer, fully resilient architecture.
Load balancing at scale
Balance all the loads! Maintain resilience with proxying, ensure each node or cluster gets just the right amount of action and remains responsive.
Automated recovery for failed applications
Of course, your apps never fail. But if it does happen, ConductR will automatically restart the victim somewhere else on the cluster.
Automated node failure handling
Nodes these days, right? When ConductR spots a node failure, self-healing mechanisms are engaged.
Automated network partition resolution
Are nodes in your cluster down, or is it the network? Resolve issues decisively with automated handling strategies using Akka Split Brainer Resolver. Silver bullet not included.
Bring hardcore resilience to your system.
Take on failures, and watch them self-heal.
Without resiliency, nothing else really matters (plus, we discovered that duct tape doesn’t work on code). ConductR keeps your systems operational in light of all sorts of trouble.
No single point of failure
ConductR won’t bottleneck your system. Unlike other tools, ConductR employs a decentralized, peer-to-peer, fully resilient architecture.
Load balancing at scale
Balance all the loads! Maintain resilience with proxying, ensure each node or cluster gets just the right amount of action and remains responsive.
Automated recovery for failed applications
Of course, your apps never fail. But if it does happen, ConductR will automatically restart the victim somewhere else on the cluster.
Automated node failure handling
Nodes these days, right? When ConductR spots a node failure, self-healing mechanisms are engaged.
Automated network partition resolution
Are nodes in your cluster down, or is it the network? Resolve issues decisively with automated handling strategies using Akka Split Brainer Resolver. Silver bullet not included.
Bring hardcore resilience to your system.
Take on failures, and watch them self-heal.
Without resiliency, nothing else really matters (plus, we discovered that duct tape doesn’t work on code). ConductR keeps your systems operational in light of all sorts of trouble.
No single point of failure
ConductR won’t bottleneck your system. Unlike other tools, ConductR employs a decentralized, peer-to-peer, fully resilient architecture.
Load balancing at scale
Balance all the loads! Maintain resilience with proxying, ensure each node or cluster gets just the right amount of action and remains responsive.
Automated recovery for failed applications
Of course, your apps never fail. But if it does happen, ConductR will automatically restart the victim somewhere else on the cluster.
Automated node failure handling
Nodes these days, right? When ConductR spots a node failure, self-healing mechanisms are engaged.
Automated network partition resolution
Are nodes in your cluster down, or is it the network? Resolve issues decisively with automated handling strategies using Akka Split Brainer Resolver. Silver bullet not included.