Những năm gần đây, cùng với sự bùng nổ của các startup cùng các loại công nghệ như máy học, lượng dữ liệu phát sinh cần thu thập và xử lý trong các hệ thống ngày càng tăng cao.
Chính vì vậy, đối với các hệ thống lớn thì việc lưu trữ và xử lý dữ liệu trên một node database đã không đáp ứng được nữa, đòi hỏi phải sử dụng nhiều node kết nối với nhau để hình thành database cluster.
Đối với các database cluster nói riêng và hệ thống Distributed System nói chung, có khá nhiều chủ đề thú vị để đào sâu. Trong buổi thảo luận này, chúng ta sẽ giới hạn trong việc khảo sát về cách ba hệ thống Redis, Elastic Search và Cassandra tổ chức cluster cũng như sự trade-off giữa tính nhất quán (consistency) và khả năng đáp ứng (availability) của ba hệ thống này.
- Speaker: Lộc Võ - Lead Software Engineer @ Grab
3. About Me
● Joint Grab for 2 years, currently working as the lead engineer of Segmentation Platform project
● Lead the Research Database group in Grokking Lab
3
4. About Segmentation Platform (SegP)
● Technology
○ Programming Languages: golang, java, scala
○ Batch processing (spark, scala),
○ Caching (redis),
○ Message queue (SQS, Kafka),
○ Relational database (MySQL),
○ Non-relational (Cassandra, DynamoDB, Elastic Search),
● Team's scope
○ Features development. Coordinate with business owners to develop a platform for
segmentation. Similar to segments.io but for internal users.
○ Batch data processing.
○ Real-time traffic. Build and maintain grpc apis to serve online traffic
4
5. What we'll discuss in this talk
● CAP theorem
● The cluster architect of Redis, Elastic Search, Cassandra
● How C-A tradeoff reflected in their designs
5
7. Consistency
The system is considered consistent if v1 is
returned to client 2 if the read request (2)
happened after the write request (1)
Client 1
Client 2
DB
System
(1) Update v=v1
(2) Get v
V value is
currently v0
7
8. Availability
When 1 request is sent, one algorithm is being
designed to handle that request which some
steps.
If the system can't go through the algorithm
designed for that request, they're considered
"not available" to that client.
Client 1
Client 2
DB
System
500
DB
System
(3) System return 2xx or 4xxx
(1) Client send request
(2) System
went
through the
algorithm
defined for
this request
(2) System
cannot go
through the
algorithm
defined for
this request
8
9. Network partition
Network partition happened when some of
the nodes cannot communicate properly to
each other and they believe that the others
was offline.
For example, Node 1 cannot communicate
with Node 2, hence Node 1 thought that
Node 2 is offline. But Node 2 still alive, and
still serve requests.
Node 1 Node 2
Client 1 Client 2
9
10. CAP Theorem
A distributed database has three very desirable properties:
1. Tolerance towards Network Partition
2. Consistency
3. Availability
The CAP theorem states: You can have at most two of these properties for any shared-data system
Consistency
Availability Partition tolerance
10
12. What is Redis
12
- Stands for Remote Dictionary Server
- Is a fast, open-source, in-memory key-value data store for use as a database, cache,
message broker, and queue.
- Delivers sub-millisecond response times enabling millions of requests per second for real-
time applications in Gaming, Ad-Tech, Financial Services, Healthcare, and IoT.
- Popular choice for caching, session management, gaming, leaderboards, real-time analytics,
geospatial, ride-hailing, chat/messaging, media streaming, and pub/sub apps.
13. Redis cluster - Multi-master
Key is hashed into (1-16384). Depends on
the hash value, the value will be read (and
write into the node assigned that token
accordingly.)
Client
Redis node
1
Redis node
2
key -> value
5 -> "ho chi minh"
6 -> "ha noi"
token 1->8000
token 8001-
>16384
key -> hash
5 -> 18
6 -> 8003
13
6 -> "ha noi"
5 -> "ho chi
minh"
14. Redis cluster - Master/Replica
Redis uses asynchronous replication, with
asynchronous replica-to-master
acknowledges of the amount of data
processed.
A master can have multiple replicas.
Client write to master, but can read from
replica
Client
1
Redis
master
Redis
replica
Redis
replica
Client
2
Write
command
async updates
Read
command
Ref: https://redis.io/topics/replication
async updates
14
15. C-A tradeoff
Redis uses asynchronous replication
by default. Which means, by default,
it's AP.
If network partition happened between
master and replica, we'll see
inconsistent data.
Client
1
Redis
master
Redis
replica
Redis
replica
Client
2
Write
command
async updates
async updates
Read
command
return stale
data
15
17. What is Elasticsearch
17
● Elasticsearch is a distributed, open source search and analytics engine for all types of data,
including textual, numerical, geospatial, structured, and unstructured.
● Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V.
(now known as Elastic).
18. Roles in Elasticsearch Cluster
Coordinator node
Master nodeData node
Data node
Client 1
Client 2
Manages the overall
operation of a cluster
and keeps track of
the cluster state
Stores and searches
data. Performs all
data-related
operations (indexing,
searching,
aggregating) on local
shards.
Delegates client
requests to the
shards on the data
nodes, collects and
aggregates the
results into one final
result, and sends this
result back to the
client.
18
19. Steps for primary shards:
● Validate incoming operation and reject it if
structurally invalid
● Execute the operation locally
● Forward the operation to each replica in the current
in-sync copies set.
● Once all replicas have successfully performed the
operation and responded to the primary, the
primary acknowledges the successful completion
of the request to the client
Primary and Replica shards
user1 auth a1 login
from
homepa
ge
Destination node
Coordination node
19
p1
r11
r12
r21
p2
r22
Primary shards to
forward the
operation to replica
20. If network partition happen, the primary shard cannot
write to replica shard which lead to the primary shard
becomes unavailable.
By default, ElasticSearch is more CP.
C-A tradeoff
Destination node
20
p1
r11
r12
r21
p2
r22
Primary shards to
forward the
operation to replica
22. What is Cassandra
22
Apache Cassandra is an open source, distributed NoSQL database that began internally at
Facebook and was released as an open-source project in July 2008.
Cassandra delivers continuous availability (zero downtime), high performance, and linear
scalability that modern applications require, while also offering operational simplicity and effortless
replication across data centers and geographies.
23. Cassandra Ring Cluster
-> token: 44
1-20
21-40
41-60
61-80
81-100
101-120
121-140
141-160
Coordinator node
- Each nodes will be assigned a range of token
- Client could connect to any nodes to write, that
node will become the coordinator node
- Partition keys will be hashed into a token.
Coordinator will base on the token to know which
node we can store the data
user1 auth a1 login
from
homepa
ge
Destination node
23
24. Replication Factor
-> token: 44
1-20
21-40
41-60
61-80
81-100
101-120
121-140
141-160
Coordinator node
Replication node
- Replication Factor (RF) = number of copies we
want to store
- Replication node will be defined by the Replication
Strategy
- Simple strategy = next two nodes will be the
replication node
user1 auth a1 login
from
homepa
ge
24
25. Data Consistency
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
- Client 1 connect to C1 to read, C1 write data to three nodes, but
failed at node B.
- Client 2 also connect to C2 to read data,
What would happen?
25
26. Consistent Level (Write)
Level Read Write
One Returns a response from
the closest replica, as
determined by the snitch.
By default, a read repair
runs in the background to
make the other replicas
consistent.
A write must be written to
the commit log and
memtable of at least one
replica node.
Quorum Returns the record after a
quorum of replicas has
responded.
A write must be written to
the commit log and
memtable on a quorum of
replica nodes
All Returns the record after all
replicas have responded.
The read operation will fail if
a replica does not respond.
A write must be written to
the commit log and
memtable on all replica
nodes in the cluster for that
partition.
26
27. Write with CL=ALL
C1
C2
C
A
B
Client 1
write data with
token 44
v2
v1
v2
Write with CL=ALL
- All replica succeeded -> success
- 1 replica failed -> failed
Result: Failed
27
28. Write with CL=QUORUM
C1
C2
C
A
B
Client 1
write data with
token 44
v2
v1
v2
Quorum = (RF + 1) / 2 = 2
- Two replicas succeeded -> success
- Less than two success -> failed
Result: Success
28
29. Consistent Level (Read)
Level Read Write
One Returns a response from
the closest replica, as
determined by the snitch.
By default, a read repair
runs in the background to
make the other replicas
consistent.
A write must be written to
the commit log and
memtable of at least one
replica node.
Quorum Returns the record after a
quorum of replicas has
responded.
A write must be written to
the commit log and
memtable on a quorum of
replica nodes
All Returns the record after all
replicas have responded.
The read operation will fail if
a replica does not respond.
A write must be written to
the commit log and
memtable on all replica
nodes in the cluster for that
partition.
29
30. Write=QUORUM, Read=One
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
Potentially inconsistent read. If client 2 read
node B, client 2 will receive stale-data.
W (Quorum) + R (1) -> eventual consistent
30
32. Write=All, Read=One
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
Potentially inconsistent read. If client 2 read
node B, client 2 will receive stale-data.
W (All) + R (1) -> consistent
32
33. Summarize Read and Write CL
WRITE READ Consistent Read Availability Write Availability
All All Consistent Low Low
Quorum All Consistent Low Medium
One All Consistent Low High
All Quoru
m
Consistent Medium Low
Quorum Quoru
m
Consistent Medium Medium
One Quoru
m
Inconsistent Medium High
All One Consistent High Low
Quorum One Inconsistent High Medium
One One Inconsistent High High
33