More Related Content Similar to Cassandra勉強会 (20) More from Shun Nakamura (8) Cassandra勉強会2. Cassandra
SPOF
Read Write
Cassandra- A Decentralized Structured Storage System, LADIS 09’
Avinash Lakshman,Prashant Malik(Facebook)
3. : Cassandra Architecture
1. Read
Consistent Hashing Read Repair
Bloom Filters
2.
Delete
Anti-Entropy Tombstones
3. 5.
Quorum Protocol Gossip Protocol
4. 6.
Write SEDA
HintedHandoff
7.
Compaction
4. : Client Side
Cassandra API
Client Tools
: loadbalance, compact, flush, …
http://lunarium.info/arc/index.php/Cassandra
GUI Google Code
import/export JSON
RPC
Thrift
Avro Cassandra0.7
5. Cassandra- A Decentralized Structured Storage System, LADIS
09’
(cassandra0.5 )
Apache Cassandra Glossary
Cassandra
http://io.typepad.com/glossary.html
:http://mocchira.posterous.com/apache-
cassandra-glossarys-japanese-translati
Cassandra
http://www.publickey1.jp/blog/10/cassandra.html
Slideshare
6. 1.
1. Random Partitioning
×
2. Order Preserving Partitioning
3. CollatingOrder-PreservingPartitioning
×
7. 1. :Random Partioning
Consistent Hashing
Token: (MD5 hash)
0~2127 hash ring Token
Token < ( ) Token
Data Token ring
Data
Zero-hop DHT A
OK
A
Data: ‘key’
md5(‘key’)=>
Replication:2
10. 1. :Order Partioning
Order Preserving Partitioning(OPP)
Hash
Token UTF8
Range Slice
CollatingOrder-PreservingPartitioning(COPP)
OPP
English(US)
0.5
11. 2.
Coordinator
Coordinator N-1 Successor
3
Rack Unaware
coordinator ring N-1
Rack Aware
1 DC N-2 DC Rack
Datacenter Aware
DC
conf/datacentors.properties
12. 2. : Anti-Entropy
Anti-Entropy( )
CF Merkle Tree
Leaf Row (Hash )
Hash
I/O
Merkle Tree
check
13. 2. : ZooKeeper
Apache ZooKeeper (Facebook?)
Cassandra
Facebook ( )
N-1
local disk Zookeeper cache
ZooKeeper fault-Tolerance
Zookeeper Cassandra Transaction
Cassandra
contrib/mutex/README
14. 3. :Consistency Level(0.6 )
Write Read
ANY ONE
1 QUORUM
ONE
×1 Return
QUORUM
× /2+1
DCQUORUM
ReadRepair
QUORUM
DC ALL
ALL
15. 3. : Quorum Protocol
System Eventual Consitency
W + R > N
:N
:W
:R
Quorum
W=R=Quorum(=N/2+1)
W=ONE(=1), R=ALL(=N)
W=ALL, R=ONE
16. 4. :
Data
Proxy
Client
1. Proxy
2. Data
3. Client
Proxy
1. Key Date
Network Proximity
2. Data Message
3. Consistency Level Client
Data
1. Message service.StorageService
2. Proxy
17. 4. :
RowMutationVerbHandler: Write
ReadVerbHandler: Read
RangeSlice,Read Repair,Bootstarp,Gossip
org.apache.cassandra.service.StorageService
18. 4. : Write
,
RowMutationVerbHandler
( I/O )
“Always Writable” Disk I/O Lock free
Data Node commit log
<RowKey, CF> Map (ConcurrentSkiplistMap)
async flush
MemTable
sync
•
Memory
•
Disk
RowKey
•
Commit • Serialized RowMutation
Log
• SSTable SSTable Read Only
Flush
• SSTable Flush
• Indexes
• Row Data
Proxy
• Bloom Filter
19. 4. : Hinted Handoff
Write
( Node, Proxy Node)
Gossip
Hint SystemTable CF
Consistency Level any( )
any Hint
Hinted Handoff Read Repair
Write Msg
Commit
No
Log
Mem/SSTable
With Hint?
Yes
Gossip
Write Hint
20. 4. : Compaction
Compaction: SSTable File File
Read ( )
( )
2
Minor Compaction
SSTable
P[bytes]×4[ ] Q, Q×4 R, R×4 S(P=Memtable )
Major Compaction
CF SSTable
tombstone
JVM GC
21. 4. : Read
ReadVerbHandler
Lock Mem&SSTable
SSTable Read Only Write Lock
0.6
Row Cache: 1 CF
Key Cache: SSTable
ClosestDataNode
SSTable
Proxy
(Key Cache)
Real Data
Row Disk
Data
Merge
MemTable
Mem
• Closest (Row Cache)
• Digest Query
• Consistency Level DataNode
Return Digest Query
• Digest(MD5)
Read Repair
Row
Cache
22. 4. :Read Repair
Read Repair( )
Digest
(on ProxyNode)
1. Read Repair
2. ( )
3. ( )
Eventual Consistency
Closest Node Version
…
Read Repair
23. 4. : Bloom Filter
Bloom Filter
W D “ ” or D
“ ” false positive
Cassandra …
SSTable Row Key
Key lookup disk check IO
24. Bloom Filter
W D “ ” or " "
Step0
k hash F1~Fk
m ArrayW, ArrayD (0 )
ArrayD[Fi(d)mod(m)] 1 foreach(D as d, i=1,…,k)
Step1
ArrayW[Fi(W) mod(m) ] 1 foreach(i = 1,…,k)
Step2
Arrayw ArrayD Arrayw ArrayD
D W ” ”
D W ” ”
O(k)
25. 4. : Delete
1.
2.
tombstone & JVM GC
tombstone
Tombstone GC
(GC Time :10 )
2.
1.
GC Time
26. 5. /
Gossip Protocol
(JOIN,DEAD,AVAIL)
1.
2.
3.
27. 5. /
Cassandra Gossip
1. 1 Gossip
2.
endpoint1 Gossip
: unreachableN /(liveN + 1) Gossip
Gossip
3. 1 Gossip Seed or liveN < SeedN Seed
Gossip
Seed :
static .
Gossip
ApplicationState(JOIN,DEAD,AVAIL)
HeartBeatState
28. 6. : SEDA[1/2]
SEDA(Staged Event-Driven Architecture)
Message Passing
=>
×
×
29. 6. : SEDA[2/2]
Cassandra
Event Queue+Thread Pool
StageManager Thread Pool Executor
public final static String READ_STAGE = "ROW-READ-STAGE";
public final static String MUTATION_STAGE = "ROW-MUTATION-STAGE";
public final static String STREAM_STAGE = "STREAM-STAGE";
public final static String GOSSIP_STAGE = "GS";
public static final String RESPONSE_STAGE = "RESPONSE-STAGE";
public final static String AE_SERVICE_STAGE = "AE-SERVICE-STAGE";
private static final String LOADBALANCE_STAGE = "LOAD-BALANCER-STAGE”;
Event Handler
VerbHandler
TCP
UDP
java.util.concurrent
30. 7.Cassandra
) YCSB(Yahoo Cloud Serving Benchmark)
Benchmarking Cloud Serving Systems with YCSB, SOCC’ 10
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
Tier1.Performance: v.s.
Tier2.Scalability: v.s.
Operation ( , ,…)
( ,Zipf ,…)
Cassandra
Hbase(Google)
MySQL Sharding
PNUTS(Yahoo)