SlideShare a Scribd company logo
1 of 30
Download to read offline
Cassandra Architecture
Cassandra
 

  SPOF
 
  Read                                 Write



          Cassandra- A Decentralized Structured Storage System, LADIS 09’

                           Avinash Lakshman,Prashant Malik(Facebook)
: Cassandra Architecture
1.                                Read
    Consistent Hashing              Read Repair
                                     Bloom Filters
2. 
                                  Delete
      Anti-Entropy                  Tombstones
3.                            5. 
      Quorum Protocol            Gossip Protocol
4.                            6. 
    Write                              SEDA
       HintedHandoff
                              7. 
       Compaction
: Client Side
  Cassandra API
  Client Tools
                     : loadbalance, compact, flush, …
   
       http://lunarium.info/arc/index.php/Cassandra
    GUI          Google Code
             import/export JSON

  RPC
    Thrift
    Avro Cassandra0.7
  Cassandra- A Decentralized Structured Storage System, LADIS
  09’
   
                                       (cassandra0.5   )

  Apache Cassandra Glossary
    Cassandra
    http://io.typepad.com/glossary.html
                 :http://mocchira.posterous.com/apache-
       cassandra-glossarys-japanese-translati

  Cassandra
    http://www.publickey1.jp/blog/10/cassandra.html
          Slideshare
   
1.                                   
 
     1.  Random Partitioning

       ×
     2.  Order Preserving Partitioning
     3.  CollatingOrder-PreservingPartitioning

       ×
1.                                              :Random Partioning
  Consistent Hashing
       Token:               (MD5 hash)
       0~2127 hash ring                Token
              Token < (                   )            Token
       Data Token               ring
          Data

  Zero-hop DHT                                                  A          
      
                                   OK                           
                                                                       A
                   Data: ‘key’
md5(‘key’)=>
                                            


                                        Replication:2
Consistent Hashing
  Consistent Hashing



     ×                                                        
     ×

 
     1.                     position               (like in Dynamo)
     2.  ring          load
                          position        (like in Chord)

  Cassandra
          
          
          
                                    loadbalance
               
Ring, Loadbalance
1.                               :Order Partioning
  Order Preserving Partitioning(OPP)
    Hash
    Token UTF8
    Range Slice
   

   
  CollatingOrder-PreservingPartitioning(COPP)
         OPP
                       English(US)

                                             0.5
2.                                
                             Coordinator

  Coordinator                  N-1       Successor            

  3                                  
       Rack Unaware
          coordinator ring         N-1
       Rack Aware
          1      DC           N-2        DC          Rack 
       Datacenter Aware
               DC
          conf/datacentors.properties
2.                           : Anti-Entropy
  Anti-Entropy(                   )
      
      
  CF         Merkle Tree
      
          Leaf             Row   (Hash   )
                                                   Hash
       I/O
                                             Merkle Tree
                     check
      
2.                                              : ZooKeeper
  Apache ZooKeeper                  (Facebook?)
              Cassandra
       Facebook                 (        )
           
                
                          N-1
                                             local disk Zookeeper   cache
            ZooKeeper fault-Tolerance


  Zookeeper                     Cassandra             Transaction

                   Cassandra
      
                      contrib/mutex/README
3.           :Consistency Level(0.6                 )


  Write                     Read
    ANY                       ONE
            1                 QUORUM
    ONE                         
              ×1                                Return
    QUORUM                      
              ×    /2+1
    DCQUORUM                    
                                     ReadRepair
       QUORUM
        DC                     ALL
    ALL                         
3.                 : Quorum Protocol
         System    Eventual Consitency

  W + R > N
              :N
                   :W
                   :R

          Quorum
       W=R=Quorum(=N/2+1)
       W=ONE(=1), R=ALL(=N)
       W=ALL, R=ONE
4.                     :                           
                 Data


                                               Proxy
  Client
   1.        Proxy
   2.                                                        Data
   3.                                          Client
  Proxy
   1.  Key     Date
                          Network Proximity
   2.    Data                   Message
   3.  Consistency Level                            Client
  Data
   1.          Message                    service.StorageService

     2.              Proxy
4.               :                                            
  RowMutationVerbHandler: Write
  ReadVerbHandler: Read
         RangeSlice,Read Repair,Bootstarp,Gossip
                                org.apache.cassandra.service.StorageService
4.                             : Write                     
                                        ,   
     RowMutationVerbHandler
                                   (           I/O   )

     “Always Writable” Disk I/O                  Lock free

         Data Node             commit log
                         
     <RowKey, CF> Map (ConcurrentSkiplistMap)
                                                                   async flush
                                            MemTable
                          sync
                                    • 
         Memory
                                                                   •                        
         Disk
                                                                    RowKey              
                            • 
                     Commit • Serialized RowMutation
                      Log
 •      SSTable                     SSTable   Read Only   
                                     Flush
                            •            SSTable   Flush
        • Indexes
                                                                 • Row Data
Proxy
                                                           • Bloom Filter
4.                           : Hinted Handoff
   
    Write
              (                 Node,        Proxy Node)
                     Gossip
         Hint        SystemTable                 CF
         Consistency Level any(         )
          
         any                   Hint

    Hinted Handoff                                    Read Repair
Write Msg
                                         Commit
                                No
                 Log
             Mem/SSTable
                  With Hint?



                       Yes
                                 Gossip
                                  Write Hint
4.                                  : Compaction
  Compaction:                        SSTable File    File

 
      
       Read             (                               )
                  (         )

  2
       Minor Compaction
                          SSTable
          P[bytes]×4[ ] Q, Q×4 R, R×4 S(P=Memtable          )
       Major Compaction
               CF               SSTable
            tombstone

 
                                          JVM GC

 
4.                             : Read                            
   ReadVerbHandler
   Lock            Mem&SSTable
      SSTable Read Only                   Write Lock

   0.6
      Row Cache: 1               CF
      Key Cache: SSTable
                                               ClosestDataNode
           SSTable
     Proxy
                                                                          (Key Cache)

                             Real Data
            Row                               Disk
                                                   Data
         Merge
                                                                      MemTable
 Mem
• Closest                                         (Row Cache)
•         Digest Query
• Consistency Level                                              DataNode
                         Return           Digest Query
• Digest(MD5)
         Read Repair
                                                                      Row
                                                                     Cache
4.                     :Read Repair
  Read Repair(           )
              Digest

               (on ProxyNode)
       1.                    Read Repair
       2.  (   )
       3.  (   )

  Eventual Consistency
    Closest Node               Version
               …
              Read Repair
4.                        : Bloom Filter
  Bloom Filter
   

          W         D              “           ” or D
       “       ”   false positive
   

  Cassandra        …
    SSTable    Row Key
    Key lookup   disk                  check      IO
Bloom Filter                                                       
         W         D            “            ” or         "        "
  Step0
       k    hash     F1~Fk
       m          ArrayW, ArrayD  (0         )
       ArrayD[Fi(d)mod(m)] 1 foreach(D as d, i=1,…,k)

  Step1
       ArrayW[Fi(W) mod(m) ]   1 foreach(i = 1,…,k)

  Step2
       Arrayw ArrayD           Arrayw                     ArrayD
           D    W       ”           ”
                                        D   W         ”       ”

  O(k)
4.               : Delete      
 
     1. 

     2. 
           tombstone & JVM GC
                         tombstone
       Tombstone                        GC
       (GC Time                  :10 )
               2.
         1.
       GC Time
5.                      /   
  Gossip Protocol
      
      
          (JOIN,DEAD,AVAIL)

 
     1. 

     2. 
     3. 
5.                                /                      
  Cassandra Gossip
  1.             1   Gossip
  2. 
              endpoint1    Gossip
           : unreachableN /(liveN + 1) Gossip
                                                Gossip
  3.  1 Gossip       Seed       or liveN < SeedN          Seed
                            Gossip
      Seed      :
        static          .

  Gossip
    ApplicationState(JOIN,DEAD,AVAIL)
    HeartBeatState
6.                         : SEDA[1/2]
  SEDA(Staged Event-Driven Architecture)
   
   
                         Message Passing
    =>                                     



                                 ×




                                 ×
6.                                                      : SEDA[2/2]
  Cassandra
     Event Queue+Thread Pool
        StageManager Thread Pool Executor
               public final static String READ_STAGE = "ROW-READ-STAGE";
               public final static String MUTATION_STAGE = "ROW-MUTATION-STAGE";
               public final static String STREAM_STAGE = "STREAM-STAGE";
               public final static String GOSSIP_STAGE = "GS";
               public static final String RESPONSE_STAGE = "RESPONSE-STAGE";
               public final static String AE_SERVICE_STAGE = "AE-SERVICE-STAGE";
               private static final String LOADBALANCE_STAGE = "LOAD-BALANCER-STAGE”;

     Event Handler
                           VerbHandler
       
                       TCP
                      UDP

  java.util.concurrent
7.Cassandra                             
         ) YCSB(Yahoo Cloud Serving Benchmark)
       Benchmarking Cloud Serving Systems with YCSB, SOCC’ 10
       http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
      
            Tier1.Performance:             v.s.
            Tier2.Scalability:      v.s.
      
            Operation        (        ,           ,…)
                         (       ,Zipf     ,…)
      
              Cassandra
              Hbase(Google)
              MySQL Sharding
              PNUTS(Yahoo)

More Related Content

What's hot

LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)
Pekka Männistö
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
DataStax
 

What's hot (20)

LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
 
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Bluestore
BluestoreBluestore
Bluestore
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 

Similar to Cassandra勉強会

High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
srisatish ambati
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
srisatish ambati
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFP
Acunu
 
Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14
Benoit Perroud
 

Similar to Cassandra勉強会 (20)

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
 
Cassandra at no_sql
Cassandra at no_sqlCassandra at no_sql
Cassandra at no_sql
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFP
 
NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15
 
Galera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction SlidesGalera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction Slides
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log Processor
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
 
Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
 
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
 

More from Shun Nakamura

More from Shun Nakamura (8)

HBase at LINE
HBase at LINEHBase at LINE
HBase at LINE
 
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
 
シリコンバレーに行ってきた!
シリコンバレーに行ってきた!シリコンバレーに行ってきた!
シリコンバレーに行ってきた!
 
MyCassandra (Full English Version)
MyCassandra (Full English Version)MyCassandra (Full English Version)
MyCassandra (Full English Version)
 
MyCassandra
MyCassandraMyCassandra
MyCassandra
 
読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)
読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)
読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)
 
読み出し性能と書き込み性能を選択可能なクラウドストレージ (DEIM2011-C3-3)
読み出し性能と書き込み性能を選択可能なクラウドストレージ (DEIM2011-C3-3)読み出し性能と書き込み性能を選択可能なクラウドストレージ (DEIM2011-C3-3)
読み出し性能と書き込み性能を選択可能なクラウドストレージ (DEIM2011-C3-3)
 
ComSys WIP
ComSys WIPComSys WIP
ComSys WIP
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Cassandra勉強会

  • 2. Cassandra     SPOF     Read Write Cassandra- A Decentralized Structured Storage System, LADIS 09’ Avinash Lakshman,Prashant Malik(Facebook)
  • 3. : Cassandra Architecture 1.    Read   Consistent Hashing   Read Repair   Bloom Filters 2.    Delete   Anti-Entropy   Tombstones 3.  5.    Quorum Protocol   Gossip Protocol 4.  6.    Write   SEDA   HintedHandoff 7.    Compaction
  • 4. : Client Side   Cassandra API   Client Tools   : loadbalance, compact, flush, …     http://lunarium.info/arc/index.php/Cassandra   GUI Google Code   import/export JSON   RPC   Thrift   Avro Cassandra0.7
  • 5.   Cassandra- A Decentralized Structured Storage System, LADIS 09’     (cassandra0.5 )   Apache Cassandra Glossary   Cassandra   http://io.typepad.com/glossary.html   :http://mocchira.posterous.com/apache- cassandra-glossarys-japanese-translati   Cassandra   http://www.publickey1.jp/blog/10/cassandra.html   Slideshare  
  • 6. 1.   1.  Random Partitioning × 2.  Order Preserving Partitioning 3.  CollatingOrder-PreservingPartitioning ×
  • 7. 1. :Random Partioning   Consistent Hashing   Token: (MD5 hash)   0~2127 hash ring Token   Token < ( ) Token   Data Token ring Data   Zero-hop DHT A     OK A Data: ‘key’ md5(‘key’)=> Replication:2
  • 8. Consistent Hashing   Consistent Hashing × ×   1.  position (like in Dynamo) 2.  ring load position (like in Chord)   Cassandra         loadbalance  
  • 10. 1. :Order Partioning   Order Preserving Partitioning(OPP)   Hash   Token UTF8   Range Slice       CollatingOrder-PreservingPartitioning(COPP)   OPP   English(US) 0.5
  • 11. 2.   Coordinator   Coordinator N-1 Successor   3   Rack Unaware   coordinator ring N-1   Rack Aware   1 DC N-2 DC Rack   Datacenter Aware   DC   conf/datacentors.properties
  • 12. 2. : Anti-Entropy   Anti-Entropy( )       CF Merkle Tree     Leaf Row (Hash )   Hash   I/O   Merkle Tree check  
  • 13. 2. : ZooKeeper   Apache ZooKeeper (Facebook?)   Cassandra   Facebook ( )       N-1   local disk Zookeeper cache   ZooKeeper fault-Tolerance   Zookeeper Cassandra Transaction   Cassandra     contrib/mutex/README
  • 14. 3. :Consistency Level(0.6 )   Write   Read   ANY   ONE   1   QUORUM   ONE     ×1   Return   QUORUM     × /2+1   DCQUORUM   ReadRepair   QUORUM DC   ALL   ALL  
  • 15. 3. : Quorum Protocol   System Eventual Consitency   W + R > N   :N   :W   :R   Quorum   W=R=Quorum(=N/2+1)   W=ONE(=1), R=ALL(=N)   W=ALL, R=ONE
  • 16. 4. : Data Proxy   Client 1.  Proxy 2.  Data 3.  Client   Proxy 1.  Key Date   Network Proximity 2.  Data Message 3.  Consistency Level Client   Data 1.  Message service.StorageService 2.  Proxy
  • 17. 4. :   RowMutationVerbHandler: Write   ReadVerbHandler: Read   RangeSlice,Read Repair,Bootstarp,Gossip org.apache.cassandra.service.StorageService
  • 18. 4. : Write ,   RowMutationVerbHandler   ( I/O )   “Always Writable” Disk I/O Lock free Data Node commit log <RowKey, CF> Map (ConcurrentSkiplistMap) async flush MemTable sync •  Memory •  Disk RowKey •  Commit • Serialized RowMutation Log •  SSTable SSTable Read Only Flush •  SSTable Flush • Indexes • Row Data Proxy • Bloom Filter
  • 19. 4. : Hinted Handoff     Write ( Node, Proxy Node)   Gossip   Hint SystemTable CF   Consistency Level any( )     any Hint   Hinted Handoff Read Repair Write Msg Commit No Log Mem/SSTable With Hint? Yes Gossip Write Hint
  • 20. 4. : Compaction   Compaction: SSTable File File       Read ( )   ( )   2   Minor Compaction   SSTable   P[bytes]×4[ ] Q, Q×4 R, R×4 S(P=Memtable )   Major Compaction   CF SSTable   tombstone     JVM GC  
  • 21. 4. : Read   ReadVerbHandler   Lock Mem&SSTable   SSTable Read Only Write Lock   0.6   Row Cache: 1 CF   Key Cache: SSTable ClosestDataNode SSTable Proxy (Key Cache) Real Data Row Disk Data Merge MemTable Mem • Closest (Row Cache) •  Digest Query • Consistency Level DataNode Return Digest Query • Digest(MD5) Read Repair Row Cache
  • 22. 4. :Read Repair   Read Repair( )   Digest   (on ProxyNode) 1.  Read Repair 2.  ( ) 3.  ( )   Eventual Consistency   Closest Node Version …   Read Repair
  • 23. 4. : Bloom Filter   Bloom Filter     W D “ ” or D “ ” false positive     Cassandra …   SSTable Row Key   Key lookup disk check IO
  • 24. Bloom Filter   W D “ ” or " "   Step0   k hash F1~Fk   m ArrayW, ArrayD (0 )   ArrayD[Fi(d)mod(m)] 1 foreach(D as d, i=1,…,k)   Step1   ArrayW[Fi(W) mod(m) ] 1 foreach(i = 1,…,k)   Step2   Arrayw ArrayD Arrayw ArrayD D W ” ”   D W ” ”   O(k)
  • 25. 4. : Delete   1.  2.    tombstone & JVM GC   tombstone   Tombstone GC (GC Time :10 )   2.   1. GC Time
  • 26. 5. /   Gossip Protocol     (JOIN,DEAD,AVAIL)   1.  2.  3. 
  • 27. 5. /   Cassandra Gossip 1.  1 Gossip 2.  endpoint1 Gossip   : unreachableN /(liveN + 1) Gossip   Gossip 3.  1 Gossip Seed or liveN < SeedN Seed Gossip   Seed : static .   Gossip   ApplicationState(JOIN,DEAD,AVAIL)   HeartBeatState
  • 28. 6. : SEDA[1/2]   SEDA(Staged Event-Driven Architecture)       Message Passing   => × ×
  • 29. 6. : SEDA[2/2]   Cassandra   Event Queue+Thread Pool   StageManager Thread Pool Executor   public final static String READ_STAGE = "ROW-READ-STAGE";   public final static String MUTATION_STAGE = "ROW-MUTATION-STAGE";   public final static String STREAM_STAGE = "STREAM-STAGE";   public final static String GOSSIP_STAGE = "GS";   public static final String RESPONSE_STAGE = "RESPONSE-STAGE";   public final static String AE_SERVICE_STAGE = "AE-SERVICE-STAGE";   private static final String LOADBALANCE_STAGE = "LOAD-BALANCER-STAGE”;   Event Handler   VerbHandler     TCP   UDP   java.util.concurrent
  • 30. 7.Cassandra   ) YCSB(Yahoo Cloud Serving Benchmark)   Benchmarking Cloud Serving Systems with YCSB, SOCC’ 10   http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf     Tier1.Performance: v.s.   Tier2.Scalability: v.s.     Operation ( , ,…)   ( ,Zipf ,…)     Cassandra   Hbase(Google)   MySQL Sharding   PNUTS(Yahoo)