1. Apache Cassandra
What is Apache Cassandra?
Apache Cassandra is an open source non relational distributed
database that manages large amounts of data across commodity
servers.
It is column oriented database.
It was initially released in July 2008.
It comes under Availability and Partition Tolerance.
2. Why Apache Cassandra was implemented?
Avinash Lakshman and Prashant Malik initially
developed Apache Cassandra at Facebook to power the
Facebook inbox search feature.
3. Components of Apache Cassandra
• Node: A Cassandra node is a place where data is stored.
• Data center: Data center is a collection of related nodes.
• Cluster: A cluster is a component which contains one or more data centers.
• Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is
written to the commit log.
• Memtable: A memtable is a memory-resident data structure. After commit log, the data will be written
to the mem-table. Sometimes, for a single-column family, there will be multiple memtables.
• SSTable: It is a disk file to which the data is flushed from the memtable when its contents reach a
threshold value.
• Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an element
is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
5. Write Operations:
i. Cassandra stores the data in memory structure in memtable(RAM)
when the initial write request is generated from the client.
Concurrently the writes are written on Commit log(disk)as well
which are permanent even if the light goes off for the node.
ii. The data from the memtable(RAM) is flushed to the SSTables(Disk)
and the partition index is also created that points to the location of
data in the disk. The flushing of data from memtable(RAM) to
SSTables(Disk) is done using the configurable threshold or when the
commit log threshold commitlog_total_space_in_mb is exceeded.
iii. The Data is written on the SSTables tables which are immutable
which means when the memtable is flushed the data is not
overwritten in SSTables despite a new file being created. The
partitions are stored on multiple SSTables so that they can be easily
searched.
6.
7. Read Operations:
i. The Read request will be made from the client.
ii. The request data will be checked in the memtable(RAM). If the
requested data is present then data will be read from memtable(RAM)
and merged with SSTables(DISK) files to send final data to the client.
iii. If the row cache is enabled then it will be checked to find the data.
iv. Bloom Filters are loaded in the Heap memory that will be checked to
find out the SSTables file that can store the requested partition data.
Since Bloom Filters works on probabilistic function and can return false
positives. In some cases Bloom Filters does not return the SSTable file
then Cassandra further checks in the partition key cache.
v. Partition Key Cache is used to store the partition index in heap memory
and the partition index of data will be searched in that. If the Partition
Key is present in the Partition Key Cache then Cassandra will go to
compression offset to find the Disk that has the data. If the Partition Key
is not present in the Partition Key Cache then the partition summary is
searched to find user-requested data.
8. vi. Partition Index is used to store the Partition key of the data that will
be used in the Compression offset map to find out the exact location
of the Disk which has stored the data.
vii. Compression offset map is used to hold the exact location of data. It
uses the Partition key to locate that. Once the Compression offset
map indicates the location where data is stored the further process is
to fetch the data and share it with the user.
9. Features of Apache Cassandra:
Distributive
Scalability
Fault Tolerance
Query Language
10. Virtual Nodes:
A virtual node is the data storage layer within a server. There are
256 virtual nodes per server by default. Each node has a range of
tokens assigned. Every virtual node uses a sub-range of tokens from
the node they belong to. These virtual nodes provide greater
flexibility in the system. Consequently, It is easier for Cassandra to
add new nodes to the cluster when we need them. When our data
has unequally distributed tokens between nodes, we can easily
extend the storage capacity by extending virtual nodes to the more
loaded node.
11.
12. Advantages of Apache Cassandra:
Open source
Peer to Peer Architecture
Scalable
High Efficiency
Consistency adjustable
Schema Less
Easy to Learn and Use
Distributed and Decentralized
Ability to Analyse
13. Disadvantages of Apache Cassandra:
It does not support ACID and relational data properties.
Because it handles large amounts of data and many requests,
transactions slow down, meaning you get latency issues.
Data is modelled around queries and not structure, resulting in the
same information stored multiple times.
Since Cassandra stores vast amounts of data, users may experience
JVM memory management issues.
It offers no join or subquery support.
Cassandra does not support aggregates
Cassandra was optimized from the start for fast writes, reading got
the short end of the stick, so it tends to be slower.
14. Finally, it was lacks official documentation from Apache, so you need
to look for it among third party companies.