This slide deck explains a bit how to deal best with state in scalable systems, i.e. pushing it to the system boundaries (client, data store) and trying to avoid state in-between.
Then it picks arbitrarily two scenarios - one in the frontend part and one in the backend part of a system and shows concrete techniques to deal with them.
In the frontend part is examined how to deal with session state of servlet containers in scalable scenarios and introduces the concept of a shared session cache layer. Also an example implementation using Redis is shown.
In the backend part it is examined how to deal with potential data inconsistencies that can occur if maximum availability of the data store is required and eventual consistency is used. The normal way is to resolve inconsistencies manually implementing business specific logic or - even worse - asking the user to resolve it. A pure technical solution called CRDTs (Conflict-free Replicated Data Types) is then shown. CRDTs, based on sound mathematical concepts, are self-stabilizing data structures that offer a generic way to resolve inconsistencies in an eventual consistent data store. Besides some theory also some examples are shown to provide a feeling how CRDTs feel in practice.
7. State …
• … avoids load distribution
• … requires replication
• … makes systems brittle
• … makes scaling down hard
8. General design rule
• Move state either to the frontend and/or to the backend
• Make everything else stateless
• Use caches if that’s too slow (but be aware of their tradeoffs)
Frontend
Client
Backend
Store
Application
Stateful
Stateful
Stateless
9. Frontend client state examples
• Cookies (fondly handcrafted)
• Play framework
• JavaScript client
• HTML5 storage
10. And what if we are bound to
a web framework that
heavily uses session state?
19. What we really want
• Scaling (up) behavior of sticky sessions
• Node independence of session replication
à Isolated nodes with shared session cache
21. Isolated nodes with shared session cache
• Easy load distribution
• Easy up- and down-scaling
• Tolerant against node failures
• Independent scaling of application and cache nodes
23. Scalability vs. Latency
You need to give up some latency for an individual user
to provide acceptable latency for many users
Latencydistribution
Not scalable
Acceptable latency
Single user
Many users
Scalable
Acceptable latency
Latencydistribution
Single user
Many users
27. Example library
• Redis Session Manager for Apache Tomcat
https://github.com/jcoleman/tomcat-redis-session-manager
• A bit outdated, but easy to use
• Usage
• Install and start Redis
• Copy tomcat-redis-session-manager.jar and
jedis-<version>.jar to Tomcat lib directory
• Update Tomcat configuration
• Restart Tomcat
29. Tradeoffs
• Easy to set up and use
• No code required, only configuration
• Works smoothly
• Not best for web frameworks
that store lots of session state
• Does not work well with AJAX
• Does not work with single page applications
à Another reason to prefer ROCA style?
30. General design rule
• Move state either to the frontend and/or to the backend
• Make everything else stateless
• Use caches if that’s too slow (but be aware of their tradeoffs)
Frontend
Client
Backend
Store
Application
Stateful
Stateful
Stateless
31. Challenges of scalable datastores
• Requires relaxed consistency guarantees
• Leads to (temporal) anomalies and inconsistencies
short-term due to replication or long-term due to partitioning
It might happen. Thus, it will happen!
33. Strict Consistency (CA)
• Great programming model
no anomalies or inconsistencies need to be considered
• Does not scale well
best for single node databases
„We know ACID – It works!“
„We know 2PC – It sucks!“
Use for moderate data amounts
34. And what if I need more data?
• Distributed datastore
• Partition tolerance is a must
• Need to give up strict consistency (CP or AP)
35. Strong Consistency (CP)
• Majority based consistency model
can tolerate up to N nodes failing out of 2N+1 nodes
• Good programming model
Single-copy consistency
• Trades consistency for availability
in case of partitioning
Paxos (for sequential consistency)
Quorum-based reads & writes
36. And what if I need more availability?
• Need to give up strong consistency (CP)
• Relax required consistency properties even more
• Leads to eventual consistency (AP)
37. Eventual Consistency (AP)
• Gives up some consistency guarantees
no sequential consistency, anomalies become visible
• Maximum availability possible
can tolerate up to N-1 nodes failing out of N nodes
• Challenging programming model
anomalies usually need to be resolved explicitly
Gossip / Hinted Handoffs
CRDT
38. Conflict-free Replicated Data Types
• Eventually consistent, self-stabilizing data structures
• Designed for maximum availability
• Tolerates up to N-1 out of N nodes failing
State-based CRDT: Convergent Replicated Data Type (CvRDT)
Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
40. Convergent Replicated Data Type
State-based CRDT – CvRDT
• All replicas (usually) connected
• Exchange state between replicas, calculate new state on target replica
• State transfer at least once over eventually-reliable channels
• Set of possible states form a Semilattice
• Partially ordered set of elements where all subsets have a Least Upper Bound (LUB)
• All state changes advance upwards with respect to the partial order
41. Commutative Replicated Data Type
Operation-based CRDT - CmRDT
• All replicas (usually) connected
• Exchange update operations between replicas, apply on target replica
• Reliable broadcast with ordering guarantee for non-concurrent
updates
• Concurrent updates must be commutative
45. State-based G-Counter (grow only)
(Naïve approach)
Data
Integer i
Init
i ≔ 0
Query
return i
Update
increment(): i ≔ i + 1
Merge( j)
i ≔ max(i, j)
46. State-based G-Counter (grow only)
(Naïve approach)
R1
R3
R2
i = 1
U
i = 1
U
i = 0
i = 0
i = 0
I
I
I
M
i = 1
i = 1
M
47. State-based G-Counter (grow only)
(Vector-based approach)
Data
Integer V[] / one element per replica set
Init
V ≔ [0, 0, ... , 0]
Query
return ∑i V[i]
Update
increment(): V[i] ≔ V[i] + 1 / i is replica set number
Merge(V‘)
∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
48. State-based G-Counter (grow only)
(Vector-based approach)
R1
R3
R2
V = [1, 0, 0]
U
V = [0, 0, 0]
I
I
I
V = [0, 0, 0]
V = [0, 0, 0]
U
V = [0, 0, 1]
M
V = [1, 0, 0]
M
V = [1, 0, 1]
49. State-based PN-Counter (pos./neg.)
• Simple vector approach as with G-Counter does not work
• Violates monotonicity requirement of semilattice
• Need to use two vectors
• Vector P to track incements
• Vector N to track decrements
• Query result is ∑i P[i] – N[i]
50. State-based PN-Counter (pos./neg.)
Data
Integer P[], N[] / one element per replica set
Init
P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0]
Query
Return ∑i P[i] – N[i]
Update
increment(): P[i] ≔ P[i] + 1 / i is replica set number
decrement(): N[i] ≔ N[i] + 1 / i is replica set number
Merge(P‘, N‘)
∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i])
∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
51. Non-negative Counter
Problem: How to check a global invariant with local information only?
• Approach 1: Only dec if local state is > 0
• Concurrent decs could still lead to negative value
• Approach 2: Externalize negative values as 0
• inc(negative value) == noop(), violates counter semantics
• Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0
• Works, but may be too strong limitation
• Approach 4: Synchronize
• Works, but violates assumptions and prerequisites of CRDTs
52. More datatypes
• Register
• Set
• Dictionary (Map)
• Tree
• Graph
• Array
• List
plus several representations for each datatype
53. Limitations of CRDTs
• Very weak consistency guarantees
Strives for „quiescent consistency“
• Eventually consistent
Not suitable for high-volume ID generator or alike
• Not easy to understand and model
• Not all data structures representable
Use if availability is extremely important
54. Further reading on CRDTs
1. Shapiro et al., Conflict-free Replicated Data
Types, Inria Research report, 2011
2. Shapiro et al., A comprehensive study of
Convergent and Commutative Replicated
Data Types,
Inria Research report, 2011
3. Basho Tag Archives: CRDT,
https://basho.com/tag/crdt/
4. Leslie Lamport,
Time, clocks, and the ordering of events in a
distributed system,
Communications of the ACM (1978)
55. Wrap-up
• Scalability means scaling up and down
• State is the enemy of scalability
• Move state to the frontend and backend
• Shared session state layer
• CRDTs
The best way to deal with state is to avoid it