My name is Victor Perepelitsky I'm an R&D Technical Leader at LivePerson leading the 'Real Time Event Processing Platform' team.
In this Meetup I talked about the journey of creating the platform from scratch - challenges, design decisions, technology choices and more.
During the last 3 years the team has built Real Time Event Processing Platform which is currently running in production with thousands of new and migrated customers. It is built to handle hundreds of thousands requests per/sec with low latency response time (under 30 ms round trip)
I went through different topics and stages of this journey and share details that led to specific choices and results.
“Stateful or Stateless”, “CEP”, “Rules engine”, “Automated performance testing”, “Locking”, “Timing” were a part of the menu.
2. System revolution
How we did it
Victor Perepelitsky
questions: www.meetup.com/ILTechTalks/events/226834931/
slideshare: www.slideshare.net/victorperepelitsky
email: victor.prp@gmail.com
3. LivePerson customer example
salesman visitor from UK
chat lines
get session state activity revents
chat lines
sales
manager
invite chat UK visitors
see reports
invite
3
4. LivePerson at a glance
4
● Account (brand) - LivePerson customer
● Visitor - individuals who interacts with the
business owner’s brand
● Agent - an account representative who may
interact with visitors (examples: technical
support, sales)
● Admin - an account representative who defined
the business goals and normally manages
agents in order to effectively reach them
5. LivePerson at a glance
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
get session state activity revents
chat lines
admin
define business rules
see reports
Admin scale
(under 100
req/sec)
invite
5
6. Legacy
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
get session state activity revents
chat lines
admin
define business rulessee reports
Admin scale
(under 100
req/sec)
Real Time Server
Offline and Reporting
6
7. Legacy - stateful + account sticky
session from
account B
RT server
E, F, G
RT server
A, C
RT server
B, D
web server web server
session from
account A
7
9. Legacy - pains
● Hard to scale
● Hard to add new features
● Poor resource utilization
● Poor manageability
● Poor QoS
● Huge friction with customers 9
10. Let's go back
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
get session state activity revents
chat lines
admin
define business rules
see reports
Admin scale
(under 100
req/sec)
invite
10
11. Proper system architecture
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
get session state activity revents
chat lines
admin
define business rulessee reports
Admin scale
(under 100
req/sec)
real time
offline
reporting config
11
12. The new dream
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
session state
activity reventschat lines
admin
business rules
see reports
Admin scale
(under 1K
req/sec)
chat
offline
reporting config
monitor and
engage
* Business App / Extension 12
13. Monitor and engage = shark
Shark manifesto
● Collects and makes available data about
individuals (visitors) as they interact with the
business owner’s brand (account)
● Acts in real-time to engage visitors (chat, ad,
call etc..)
● Is a platform for a business logic modules
(sharklets) which might be independently
developed and deployed
13
15. Platform requirements
● E2E latency within DC < 30 mills
● Good resources utilization (CPU > 50%)
● Efficient - At least 500 req/sec per node
● Sharklet development lifecycle is independent
● High Availability
○ uptime > 99.99999%
○ data loss < 0.01%
● Resilient - no service downtime when external resource is
unavailable (minimal degradation is allowed)
● Business logic correctness - 99.9%
15
18. Stateless
session 1 session 2 session 3 session 4
session
data
Each request
potentially
requires
access to
session data
store
18
19. Facts that helped us to decide
1. Legacy works as “Stateful without HA”
2. A small data loss has a tiny customer
impact (0.01% loss is good enough)
3. Stateless requires much more
resources and initial effort
4. We can add HA store in the future
19
23. Legacy - successful patterns
1. Requests are processed in memory
2. External resources are accessed
asynchronously to visitor requests
3. Customer Rules and Data
(AccountConfig) are kept in memory
and may be updated on background
23
24. Legacy - pains
1. Order of calls (inside code + rules)
2. Business logic are not pluggable
components
3. Http requests are tightly coupled
within logical levels (hard to move
toward other protocols as
WebSockets)
24
26. SYNC -
Fast CEP,
engagements
ASYNC -
slow actions,
external
resources
access
sharklet A
(sync
handlers)
sharklet A
(async
handlers)
web
visitor
agent
mobile
visitor
facadeadapter adapter adapter
Account
Runtime
Data
Message
BUS
external
resource
26
27. Shark - The Big Parts
1. Facade - decouples real world protocols
from the logical layers
2. CEP - avoids call order management
3. Sync - very fast in memory processing
4. Async - allows slow actions and ext
resources access
5. Account Runtime Store - allows in
memory access to customer
configuration
27
33. Drools - we tried to kill it
We had
● played with it - :)
● integrated into shark - :)
● made a POC using LivePerson logic - :)
● tested for performance - :(
33
44. Fundamental decisions
Stateful or stateless? -> stateful
What are the big parts? -> we have it
Basic technology stack -> choosed
CEP - Technology choice -> DIY (inhouse)
44
45. Fundamental decisions
Stateful or stateless? -> stateful
What are the big parts? -> we have it
Basic technology stack -> choosed
CEP - Technology choice -> DIY (inhouse)
Locking architecture
45
46. Locking - The model
The world
account A
session 1
session 1
session 1
session
4
46
47. Locking - Legacy pains
● You must be aware of locking when
writing a business logic
● Write lock on account freezes all account
operations
● Locking became the bottleneck
(Not a CPU)
● BUGs 47
48. Locking - Shark solution
● Read/Write lock for session
● Write business logic only - no locking
awareness
● No write lock on account - copy on write
48
49. SYNC -
A single proc
cycle uses
consistent
account data
copy
ASYNC -
updates
account data
using copy on
write pattern
sharklet A
(sync
handlers)
sharklet A
(async
handlers)
web
visitor
agent
mobile
visitor
facadeadapter adapter adapter
Account
Runtime
Data
external
resource
49
51. Fundamental decisions
Stateful or stateless? -> stateful
What are the big parts? -> we have it
Basic technology stack -> choosed
CEP - Technology choice -> DIY (inhouse)
Locking architecture -> decided
51
55. Dream = LiveEngage platform
agent visitor
Chat scale
(2K req/sec)
Visitor scale
(100K req/sec)
chat lines
session state
activity reventschat lines
admin
business rules
see reports
Admin scale
(under 1K
req/sec)
chat
offline
reporting config
monitor and
engage
* Business App / Extension 55
56. Rules - from definition to runtime
visitor
activity revents
admin
business rules
config
monitor and
engage
* Business App / Extension
if the
visitor
meets the
conditions
-> invite
to chat
56
58. What is rules engine
Rules engine serves as pluggable software
component which executes business rules
These rules are externalized or separated
from application code
58
65. GRF - Generic Rules Framework
Conditions and outcomes are
building blocks that can be used
for complex rules creation
hard coded building blocks
TimeOnPage
GeoLocation
InviteToChat
rule
if (
timeOnPage(5)
and
geoLocation(“US”)
)execute{
inviteToChat()
}
65
66. GRF + CEP = RulesEngine
GeoLocation condition
trigger when (geo data is changed)
evaluate(geo, accountConfig){
if (geo == accountConfig.geo)
TRUE
else
FALSE
}
Condition type
implementor defines
the evaluation
trigger instead of
automatic detection
66
69. SYNC -
Detects which
conditions
should be
evaluated and
trigger GRF
ASYNC -
loades rules to
shark rules
engine
sharklet A
(sync
handlers)
sharklet A
(async
handlers)
web
visitor
agent
mobile
visitor
facadeadapter adapter adapter
Account
Runtime
Data
Message
BUS
Account
Config
Rules
Engine
69
71. SYNC -
CEP, Rules,
Report-Sharklet
ASYNC -
integrated with
account config
sharklet B sharklet B
web
visitor
agent
mobile
visitor
facadeadapter adapter adapter
Rules
Engine
Account
Config
Account
Runtime
Data
Message
BUS
sharklet A sharklet A
Account
Config
Service
71
73. The dream comes true
agent visitor
chat lines
session state
activity reventschat lines
admin
business rules
see reports
chat
offline
reporting config
monitor and
engage
* Business App / Extension 73
82. Testing methodology
● Unit test - use it
● Integration test - invest here
● System test - try to minimize effort
● Performance
○ Integration - worth it
○ System - choose your tests
82
85. Testing methodology
How did we test platform?
We had
● built main code with tests in mind
● mocked our clients
85
86. Java 8
● We moved to java 8 one year ago
● It was easy :)
● Pushed us to
○ more expressive code
○ functional style
○ immutability
search on youtube - LivePerson Functional Java 8
86
87. Notes about G1
● Designed for big heaps and
minimizes big pauses
● Is considered to be the default GC
in java 9
● We have tested our system with G1
when 12 GB was used and
○ received good results (no big GC
paused)
87
89. We are happy now
● Horizontal scalability
● Independent and safe business
logic development
● Fast development cycles (platform,
sharklets, data-model)
● Efficient resource utilization
● Less BUGs (Easier to fix)
● Better QoS
● Overall confidence
89
91. Future challenges and ideas
● Better High availability
● Deployment with no downtime
● Management tools
● 100K accounts
91
92. Tips
● Define scope and requirements
● Company commitment is a must
● Work with your clients
● Treat test code as if it runs in
production
● Automated perf tests - it helps
● Sometimes DIY is the best solution
● Respect legacy - combine old ideas
with new technologies
● Understand the complexity and find
the simplest solution 92