SlideShare a Scribd company logo
1 of 71
Column-Oriented Databases
In Depth
Ciao
ciao
Vai a fare
ciao ciao
Dr. Fabio Fumarola
Outline
• Column-Oriented Introduction
• Bigtable:
– Features
– Data Model
– Rows and Column Family
– Timestamp
– API
– Implementation
• Other Open Source solutions
2
Column-Oriented Introduction
• We analyzed that key-value database are simple
hash table, where:
– All the access are via primary keys and return an object as
result.
• Column-Oriented databases are motivated by the
necessity to model something more than object-
values.
3
Bigtable
• It is used pervasively at Google in several projects:
– Google Analytics, Google Finance, Google Plus,
– Personalized Search, Google Docs, and Google Maps.
• These products use Bigtable for a variety of
demanding workloads ranging from:
– Batch oriented processing to
– Low latency serving of data to end users
4
Bigtable: Main Characteristics
• It does not support a full relational data model
• It provides clients with a simple data model that
– Supports dynamic control over data layout and format
– Allows clients to reason about the locality of the data in
the underlying storage
• Data is indexed using row and column names that
can be arbitrary strings
• Data is stored as uninterpreted strings supporting
several kind of data.
5
Data Model
• A Bigtable is a sparse, distributed and persistent
multi-dimensional sorted map.
• The map is indexed at the first lever by a row-key
• And at second and third level by a column-key and a
timestamp respectively.
• Each value in the map is an array of bytes
(row:string, column:string, time:int64) → string
6
Example of mapping
7
Data Model
• They settled on this data model after reasoning on
potential uses of Bigtable.
• The example that drove their decision was storing
the crawling results and to enable analysis.
• The main goal is to “keep a copy of a large collection
of web pages and related information that could be
used by many different projects” (Cache).
8
Webtable example
• The URLs is used as row keys
• various aspects of web pages as column names,
• and store the contents of the web pages in the contents:
column under the timestamps when they were fetched.
9
Data Model: Rows
• The row keys are arbitrary strings (up to 64KB)
• Every read or write of data under a single row key is
atomic (no considering the number of different
columns being read or written).
• Data are maintained in lexicographic order by row
key.
• Ordered set of row-key are grouped in ranges and
dynamically partitioned
10
Data Model: Rows
• Each row range is called tablet,
• Tablets are the unit of distribution and load
balancing.
• As a result, reads of short row ranges are
efficient and typically require communication
with only a small number of machines.
• For example, in Webtable, pages in the same
domain are grouped together into contiguous
rows by reversing the hostname components
of the URLs.
11
Data Model: Column Families
• Column keys are grouped into sets called column
families.
• All data stored in a column family is usually of the
same type (values of the same family are
compressed together).
• The design philosophy is to have a low number of
column-families (< 101).
• In contrast, a table may have an unbounded number
of columns.
12
Data Model: Column Families
• A column key is named using the following syntax:
family:qualifier.
• Column family names must be printable, but
qualifiers may be arbitrary strings.
• An example of column family is language
• It uses only one column key in the language family,
and it stores each web page’s language ID.
13
Data Model: Column Families
• Another useful column family for this table is anchor;
• Each column key in this family represents a single anchor
• The qualifier is the name of the referring site; the cell
contents is the link text.
14
Timestamps
• Each cell in a Bigtable can contain multiple versions
of the same data;
• These versions are indexed by timestamp in
microseconds.
• Bigtable timestamps are 64-bit integers.
• Different versions of a cell are stored in decreasing
timestamp order, so that the most recent versions
can be read first.
15
Timestamps
• It normally stores 3 versions of the same data
• Stale data is garbage-collected automatically
• In our Webtable example, we set the timestamps of
the crawled pages stored in the contents.
16
API
• Operator supported:
– put(key, columnFamily, columnQualifier, value)
– get(key)
– Scan(startKey, endKey)
– delete(key)
• It also provides functions for changing cluster, table,
and column family metadata, such as access control
rights.
17
API: Get & Scan
• A Get is simply a Scan limited by the API to one row.
• A Scan fetches zero or more rows of a table.
• By default, a Scan reads the entire table from start to
end.
– Scan()
– Scan(byte[] startRow)
– Scan(byte[] startRow, byte[] stopRow)
18
API
• A scan can be limited in several ways:
– Start and Stop row
– Caching size: to limit the number of rows:
– Batch size: to limit the number of columns
– Max result size: to limit the number of bytes
– Setting the column names and qualifiers
• Bigtable can be used with MapReduce, a frame-
work for running large-scale parallel computations
developed at Google.
19
Building Blocks
• Google File System (GFS)
• Google SSTable
• Distributed lock service called Chubby (Zookeper)
20
Google File Systems
• Bigtable uses the distributed Google File System
(GFS) to store log and data files.
21
http://static.googleusercontent.com/media/research.google.com/it//
archive/gfs-sosp2003.pdf
Google SSTable
• File format is used internally to store Bigtable data.
• An SSTable provides a persistent, ordered immutable
map from keys to values.
• Provided operations are:
– lookup the value associated with a key
– To iterate over all key/value pairs in a specified key range
• Internally, each SSTable contains a sequence of
blocks (typically of 64KB size)
22
Google SSTable
• A block index (stored at the end of the SSTable) is
used to locate blocks.
• The index is loaded into memory when the SSTable is
opened.
• A lookup can be performed with a single disk seek:
we first find the appropriate block by performing a
binary search in the in-memory index, and then
reading the appropriate block from disk.
23
Chubby
• It is a highly-available and persistent distributed lock service.
• It consists by at least five replicas, one of which is elected as
master.
• Chubby provides a namespace that consists of directories and
small files.
24
http://static.googleusercontent.com/media/research.go
ogle.com/it//archive/chubby-osdi06.pdf
Chubby and Bigtable
• Bigtable uses Chubby for a variety of tasks:
– to ensure that there is at most one active master at any
time;
– to store the bootstrap location of Bigtable data;
– to discover tablet servers and finalize tablet server
deaths;
– to store Bigtable schema information (the column family
information for each table);
– and to store access control lists.
25
Implementation
• It is composed by three major components:
– a library that is linked into every client,
– one master server,
– and many tablet servers
• Tablet servers can be dynamically added (or
removed) from a cluster to accomodate changes in
workloads.
26
Implementation: Master
• It is responsible for:
– assigning tablets to tablet servers,
– detecting the addition and expiration of tablet servers,
– balancing tablet-server load,
– and garbage collection of files in GFS.
• In addition, it handles schema changes such as table
and column family creations.
27
Implementation: Tablet Server
• Each tablet server manages a set of tablets (typically
we have somewhere between ten to a thousand
tablets per tablet server).
• The tablet server handles read and write requests to
the tablets that it has loaded, and also splits tablets
that have grown too large.
28
Implementation: Clients
• clients communicate directly with tablet servers for
reads and writes.
• Because Bigtable clients do not rely on the master
for tablet location information, most clients never
communicate with the master.
• As a result, the master is lightly loaded in practice
(differently from HBase).
29
Implementation: Cluster
• A Bigtable cluster stores a number of tables.
• Each table consists of a set of tablets, and each
tablet contains all data associated with a row range.
• Initially, each table consists of just one tablet. As a
table grows, it is automatically split into multiple
tablets, each approximately 100-200 MB in size by
default.
30
Tablets
31
Tablet Location
• It is used a three-level hierarchy similar to a B+ tree
to store tablet location information
32
Tablet Location
• The first level is a file stored in Chubby that contains the
location of the root tablet.
• The root tablet contains the location of all tablets in a special
METADATA table.
33
Tablet Location
• The second level contains the location of a set of user tablets.
• The METADATA table stores the location of a tablet under a
row key.
34
Tablet Location
• The third level contains the data.
• The client library caches tablet locations.
• If the client does not know the location of a tablet then it
recursively moves up the tablet location hierarchy.
35
Tablet Location
• They also store secondary information in the
METADATA table, including a log of all events
pertaining to each tablet.
• This information is helpful for debugging and
performance analysis.
36
Tablet Assignment and Serving
37
Tablet Assignment
• Each tablet is assigned to one tablet server at a time.
• The master keeps track of the set of live tablet
servers, that is assigned and unassigned tablets.
• When a tablet is unassigned, and a tablet server with
sufficient room for the tablet is available, the master
assigns the tablet by sending a tablet load request to
the tablet server.
38
Tablet Assignment
• Bigtable uses Chubby to keep track of tablet servers.
• When a tablet server starts, it creates, and acquires
an exclusive lock on, a uniquely-named file in a
specific Chubby directory
• When a tablet server ends the master is responsible
for reassigning those tablets as soon as possible.
39
Tablet Serving
• The persistent state of a tablet is stored in GFS
• Updates are committed to a commit log that stores redo
records
40
It stores recently
committed updates
It stores the older
updates
Tablet Recover
• It is done by a tablet server reading tablet metadata
from the METADATA table.
• This table contains an event log with a set of redo
point, which are pointer to commit logs containing
data for the tablet.
• The server reads the indices of the SSTables into
memory and reconstructs the memtable by applying
all of the updates that have committed since the
redo points.
41
Tablet Serving
• When a read operation arrives at a tablet server, the
operation is executed on a merged view of the
sequence of SSTables and the memtable.
• This operation is efficient because SSTable and
memtable are lexicographically sorted data
structures.
• Incoming read and write operations can continue
while tablets are split and merged.
42
Compactions
• As write operations execute, the size of the
memtable increases (immutable).
• When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created, and
the frozen memtable is converted to an SSTable and
written to GFS.
• There are three type of compactions: Minor,
Merging, and Major.
43
Minor Compactions
It has two goals:
•it shrinks the memory usage of the tablet server,
•it reduces the amount of data that has to be read from
the commit log during recovery if this server dies.
Incoming read and write operations can continue while
compactions occur.
44
Merging Compactions
• Every minor compaction generates a new SSTable.
• Read operations might need to merge updates from
an arbitrary number of SSTables.
• This operations are bounded by allowing periodically
merging compaction.
• A merging compaction reads the contents of a few
SSTables and the memtable, and writes out a new
SSTable.
45
Major Compaction
• It is a compaction that rewrites all SSTables into
exactly one SSTable.
• SSTables produced by non-major compactions can
contain special deletion entries.
• A major compaction, on the other hand, produces an
SSTable that contains no deletion information or
deleted data.
• These major compactions allow Bigtable to reclaim
resources used by deleted data.
46
Refinements
47
Locality groups
• Clients can group multiple column families together
into a locality group.
• A separate SSTable is generated for each locality
group in each tablet.
• Segregating column families that are not typically
accessed together into separate locality groups
enables more efficient reads.
48
Compression
• A user-specified compression format is applied to
each SSTable block.
• A used compression approach is based on:
– first pass uses Bentley and McIlroy’s scheme
– The second pass uses a fast compression algorithm that
looks for repetitions in a small 16 KB window of the data.
• This scheme applied to Webtable achieved a 10-to-1
reduction in space.
49
Caching for read
To improve read performance, tablet servers use two
levels of caching.
•The Scan Cache is a higher-level cache that caches the
key-value pairs returned by the SSTable interface to the
tablet server code.
•The Block Cache is a lower-level cache that caches
SSTables blocks that were read from GFS.
50
Bloom filters
• The number of accesses are reduced by allowing
clients to specify that Bloom filters should be create
for SSTables.
• A Bloom filter allows us to ask whether an SSTable
contains any data for a specified row/column pair.
• http://billmill.org/bloomfilter-tutorial/
51
Commit-log
• All the operation of write and delete are persisted on
a global commit-log (inter-tablet).
• Using one log provides significant performance
benefits during normal operation, but it complicates
recovery.
• To recover the state for a tablet, the new tablet
server needs to reapply the mutations for that tablet
from the commit log written by the original tablet
server.
52
Commit-log
• To avoid duplicating log reads, it is at firs sorted by
table, row name, log sequence number .⟨ ⟩
• In the sorted output, all mutations for a particular
tablet are contiguous and can therefore be read
efficiently with one disk seek followed by a
sequential read.
• To parallelize the sorting, the log file is partitioned
into files of 64MB and sorted using MapReduce.
53
Exploiting immutability
• The Bigtable system is simplified by the fact that all of the
SSTables generated are immutable.
• This allows us to not do any sync of accesses.
• The only mutable data structure that is accessed by both
reads and writes is the memtable.
• The immutability of SSTables enables us to split tablets
quickly.
• Instead of generating a new set of SSTables for each child
tablet, we let the child tablets share the SSTables of the
parent tablet.
54
Lessons Lerned
55
Lessons
• Distributed systems are vulnerable to many types of
failures: memory and network corruption, large clock
skew, hung machines, extended and asymmetric
network partitions, bugs in other systems.
• Add new features only when you know how to use it.
56
Lessons
• A practical lesson learned from supporting Bigtable is
the importance of proper system-level monitoring.
• Clarity of Design. “Given both the size of our system
(about 100,000 lines of non-test code), as well as the
fact that code evolves over time in unexpected ways,
we have found that code and design clarity are of
immense help in code maintenance and debugging.”
57
Open Source Column Stores
58
Open Source Column Stores
• Hbase (http://hbase.apache.org/)
• Hypertable (http://hypertable.org/)
• Cassandra (http://cassandra.apache.org/)
• Accumulo (http://accumulo.apache.org/)
• Parquet (http://parquet.io/)
59
HBase
Apache HBase™ is the Hadoop database to use when you need
you need random, realtime read/write access to your Data.
•Automatic and configurable sharding of tables
•Automatic failover support between RegionServers.
•Convenient base classes for backing Hadoop MapReduce jobs
with Apache HBase tables.
•Easy to use Java API for client access.
•To be distributed, it has to run on top of hdfs
•Integrated with MapReduce
60
HBase Architecture
61
Data Location, Assignment and
Serving
62
Coprocessors
63
https://blogs.apache.org/hbase/entry/coprocessor_introduction
Hypertable
• Hypertable is an open source database system
inspired by publications on the design of Google's
BigTable.
• Hypertable runs on top of a distributed file system
such as the Apache Hadoop DFS, GlusterFS, or the
Kosmos File System (KFS). It is written almost entirely
in C++.
64
http://hypertable.com/documentation/
Cassandra
• Big-Table extension:
– All nodes are similar.
– Can be used as a distributed hash-table, with an "SQL-like"
language, CQL (but no JOIN!)
• Data can have expiration (set on INSERT)
• Map/reduce possible with Apache Hadoop
• Rich Data Model (columns, composites, counters,
secondary indexes, map, set, list, counters)
65
Cassandra Scalability
66
Reading from Cassandra
clientclient
memtable
sstable
sstable
sstable
Row cache
key cache
Writing to Cassandra
clientclient Commit
log (Disk)
Memtable
(memory)
sstable
Flush
Replication factor: 3Replication factor: 3
sstable sstablesstable
CQL
69
When not to use
70
When not to use
• There are not suitable for systems:
– that requires ACID transactions for writes and reads,
– That needs to aggregate data (SUM and AVG) you need to
do it at client side,
– Are in an initial stage. If we need to change the queries we
could have to change the columns.
71

More Related Content

What's hot

Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
Sean Murphy
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
Student
 

What's hot (20)

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Database replication
Database replicationDatabase replication
Database replication
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
NoSql
NoSqlNoSql
NoSql
 
Oracle Database Introduction
Oracle Database IntroductionOracle Database Introduction
Oracle Database Introduction
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Introduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsIntroduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operations
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 

Similar to 8. column oriented databases

Bigtable
BigtableBigtable
Bigtable
ptdorf
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
temp2004it
 

Similar to 8. column oriented databases (20)

Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and Boxwood
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
 
The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
Google - Bigtable
Google - BigtableGoogle - Bigtable
Google - Bigtable
 
Google Big Table
Google Big TableGoogle Big Table
Google Big Table
 
Big table
Big tableBig table
Big table
 
Big table
Big tableBig table
Big table
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable
Bigtable Bigtable
Bigtable
 
Bigtable_Paper
Bigtable_PaperBigtable_Paper
Bigtable_Paper
 
Bigtable
BigtableBigtable
Bigtable
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 

More from Fabio Fumarola

6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
3 Git
3 Git3 Git
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
Fabio Fumarola
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
Fabio Fumarola
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 

More from Fabio Fumarola (20)

11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
3 Git
3 Git3 Git
3 Git
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
08 datasets
08 datasets08 datasets
08 datasets
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Recently uploaded (20)

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

8. column oriented databases

  • 1. Column-Oriented Databases In Depth Ciao ciao Vai a fare ciao ciao Dr. Fabio Fumarola
  • 2. Outline • Column-Oriented Introduction • Bigtable: – Features – Data Model – Rows and Column Family – Timestamp – API – Implementation • Other Open Source solutions 2
  • 3. Column-Oriented Introduction • We analyzed that key-value database are simple hash table, where: – All the access are via primary keys and return an object as result. • Column-Oriented databases are motivated by the necessity to model something more than object- values. 3
  • 4. Bigtable • It is used pervasively at Google in several projects: – Google Analytics, Google Finance, Google Plus, – Personalized Search, Google Docs, and Google Maps. • These products use Bigtable for a variety of demanding workloads ranging from: – Batch oriented processing to – Low latency serving of data to end users 4
  • 5. Bigtable: Main Characteristics • It does not support a full relational data model • It provides clients with a simple data model that – Supports dynamic control over data layout and format – Allows clients to reason about the locality of the data in the underlying storage • Data is indexed using row and column names that can be arbitrary strings • Data is stored as uninterpreted strings supporting several kind of data. 5
  • 6. Data Model • A Bigtable is a sparse, distributed and persistent multi-dimensional sorted map. • The map is indexed at the first lever by a row-key • And at second and third level by a column-key and a timestamp respectively. • Each value in the map is an array of bytes (row:string, column:string, time:int64) → string 6
  • 8. Data Model • They settled on this data model after reasoning on potential uses of Bigtable. • The example that drove their decision was storing the crawling results and to enable analysis. • The main goal is to “keep a copy of a large collection of web pages and related information that could be used by many different projects” (Cache). 8
  • 9. Webtable example • The URLs is used as row keys • various aspects of web pages as column names, • and store the contents of the web pages in the contents: column under the timestamps when they were fetched. 9
  • 10. Data Model: Rows • The row keys are arbitrary strings (up to 64KB) • Every read or write of data under a single row key is atomic (no considering the number of different columns being read or written). • Data are maintained in lexicographic order by row key. • Ordered set of row-key are grouped in ranges and dynamically partitioned 10
  • 11. Data Model: Rows • Each row range is called tablet, • Tablets are the unit of distribution and load balancing. • As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. • For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. 11
  • 12. Data Model: Column Families • Column keys are grouped into sets called column families. • All data stored in a column family is usually of the same type (values of the same family are compressed together). • The design philosophy is to have a low number of column-families (< 101). • In contrast, a table may have an unbounded number of columns. 12
  • 13. Data Model: Column Families • A column key is named using the following syntax: family:qualifier. • Column family names must be printable, but qualifiers may be arbitrary strings. • An example of column family is language • It uses only one column key in the language family, and it stores each web page’s language ID. 13
  • 14. Data Model: Column Families • Another useful column family for this table is anchor; • Each column key in this family represents a single anchor • The qualifier is the name of the referring site; the cell contents is the link text. 14
  • 15. Timestamps • Each cell in a Bigtable can contain multiple versions of the same data; • These versions are indexed by timestamp in microseconds. • Bigtable timestamps are 64-bit integers. • Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first. 15
  • 16. Timestamps • It normally stores 3 versions of the same data • Stale data is garbage-collected automatically • In our Webtable example, we set the timestamps of the crawled pages stored in the contents. 16
  • 17. API • Operator supported: – put(key, columnFamily, columnQualifier, value) – get(key) – Scan(startKey, endKey) – delete(key) • It also provides functions for changing cluster, table, and column family metadata, such as access control rights. 17
  • 18. API: Get & Scan • A Get is simply a Scan limited by the API to one row. • A Scan fetches zero or more rows of a table. • By default, a Scan reads the entire table from start to end. – Scan() – Scan(byte[] startRow) – Scan(byte[] startRow, byte[] stopRow) 18
  • 19. API • A scan can be limited in several ways: – Start and Stop row – Caching size: to limit the number of rows: – Batch size: to limit the number of columns – Max result size: to limit the number of bytes – Setting the column names and qualifiers • Bigtable can be used with MapReduce, a frame- work for running large-scale parallel computations developed at Google. 19
  • 20. Building Blocks • Google File System (GFS) • Google SSTable • Distributed lock service called Chubby (Zookeper) 20
  • 21. Google File Systems • Bigtable uses the distributed Google File System (GFS) to store log and data files. 21 http://static.googleusercontent.com/media/research.google.com/it// archive/gfs-sosp2003.pdf
  • 22. Google SSTable • File format is used internally to store Bigtable data. • An SSTable provides a persistent, ordered immutable map from keys to values. • Provided operations are: – lookup the value associated with a key – To iterate over all key/value pairs in a specified key range • Internally, each SSTable contains a sequence of blocks (typically of 64KB size) 22
  • 23. Google SSTable • A block index (stored at the end of the SSTable) is used to locate blocks. • The index is loaded into memory when the SSTable is opened. • A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. 23
  • 24. Chubby • It is a highly-available and persistent distributed lock service. • It consists by at least five replicas, one of which is elected as master. • Chubby provides a namespace that consists of directories and small files. 24 http://static.googleusercontent.com/media/research.go ogle.com/it//archive/chubby-osdi06.pdf
  • 25. Chubby and Bigtable • Bigtable uses Chubby for a variety of tasks: – to ensure that there is at most one active master at any time; – to store the bootstrap location of Bigtable data; – to discover tablet servers and finalize tablet server deaths; – to store Bigtable schema information (the column family information for each table); – and to store access control lists. 25
  • 26. Implementation • It is composed by three major components: – a library that is linked into every client, – one master server, – and many tablet servers • Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads. 26
  • 27. Implementation: Master • It is responsible for: – assigning tablets to tablet servers, – detecting the addition and expiration of tablet servers, – balancing tablet-server load, – and garbage collection of files in GFS. • In addition, it handles schema changes such as table and column family creations. 27
  • 28. Implementation: Tablet Server • Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). • The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. 28
  • 29. Implementation: Clients • clients communicate directly with tablet servers for reads and writes. • Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. • As a result, the master is lightly loaded in practice (differently from HBase). 29
  • 30. Implementation: Cluster • A Bigtable cluster stores a number of tables. • Each table consists of a set of tablets, and each tablet contains all data associated with a row range. • Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default. 30
  • 32. Tablet Location • It is used a three-level hierarchy similar to a B+ tree to store tablet location information 32
  • 33. Tablet Location • The first level is a file stored in Chubby that contains the location of the root tablet. • The root tablet contains the location of all tablets in a special METADATA table. 33
  • 34. Tablet Location • The second level contains the location of a set of user tablets. • The METADATA table stores the location of a tablet under a row key. 34
  • 35. Tablet Location • The third level contains the data. • The client library caches tablet locations. • If the client does not know the location of a tablet then it recursively moves up the tablet location hierarchy. 35
  • 36. Tablet Location • They also store secondary information in the METADATA table, including a log of all events pertaining to each tablet. • This information is helpful for debugging and performance analysis. 36
  • 37. Tablet Assignment and Serving 37
  • 38. Tablet Assignment • Each tablet is assigned to one tablet server at a time. • The master keeps track of the set of live tablet servers, that is assigned and unassigned tablets. • When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server. 38
  • 39. Tablet Assignment • Bigtable uses Chubby to keep track of tablet servers. • When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory • When a tablet server ends the master is responsible for reassigning those tablets as soon as possible. 39
  • 40. Tablet Serving • The persistent state of a tablet is stored in GFS • Updates are committed to a commit log that stores redo records 40 It stores recently committed updates It stores the older updates
  • 41. Tablet Recover • It is done by a tablet server reading tablet metadata from the METADATA table. • This table contains an event log with a set of redo point, which are pointer to commit logs containing data for the tablet. • The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points. 41
  • 42. Tablet Serving • When a read operation arrives at a tablet server, the operation is executed on a merged view of the sequence of SSTables and the memtable. • This operation is efficient because SSTable and memtable are lexicographically sorted data structures. • Incoming read and write operations can continue while tablets are split and merged. 42
  • 43. Compactions • As write operations execute, the size of the memtable increases (immutable). • When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. • There are three type of compactions: Minor, Merging, and Major. 43
  • 44. Minor Compactions It has two goals: •it shrinks the memory usage of the tablet server, •it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur. 44
  • 45. Merging Compactions • Every minor compaction generates a new SSTable. • Read operations might need to merge updates from an arbitrary number of SSTables. • This operations are bounded by allowing periodically merging compaction. • A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. 45
  • 46. Major Compaction • It is a compaction that rewrites all SSTables into exactly one SSTable. • SSTables produced by non-major compactions can contain special deletion entries. • A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. • These major compactions allow Bigtable to reclaim resources used by deleted data. 46
  • 48. Locality groups • Clients can group multiple column families together into a locality group. • A separate SSTable is generated for each locality group in each tablet. • Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. 48
  • 49. Compression • A user-specified compression format is applied to each SSTable block. • A used compression approach is based on: – first pass uses Bentley and McIlroy’s scheme – The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. • This scheme applied to Webtable achieved a 10-to-1 reduction in space. 49
  • 50. Caching for read To improve read performance, tablet servers use two levels of caching. •The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. •The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. 50
  • 51. Bloom filters • The number of accesses are reduced by allowing clients to specify that Bloom filters should be create for SSTables. • A Bloom filter allows us to ask whether an SSTable contains any data for a specified row/column pair. • http://billmill.org/bloomfilter-tutorial/ 51
  • 52. Commit-log • All the operation of write and delete are persisted on a global commit-log (inter-tablet). • Using one log provides significant performance benefits during normal operation, but it complicates recovery. • To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server. 52
  • 53. Commit-log • To avoid duplicating log reads, it is at firs sorted by table, row name, log sequence number .⟨ ⟩ • In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read. • To parallelize the sorting, the log file is partitioned into files of 64MB and sorted using MapReduce. 53
  • 54. Exploiting immutability • The Bigtable system is simplified by the fact that all of the SSTables generated are immutable. • This allows us to not do any sync of accesses. • The only mutable data structure that is accessed by both reads and writes is the memtable. • The immutability of SSTables enables us to split tablets quickly. • Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet. 54
  • 56. Lessons • Distributed systems are vulnerable to many types of failures: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems. • Add new features only when you know how to use it. 56
  • 57. Lessons • A practical lesson learned from supporting Bigtable is the importance of proper system-level monitoring. • Clarity of Design. “Given both the size of our system (about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we have found that code and design clarity are of immense help in code maintenance and debugging.” 57
  • 58. Open Source Column Stores 58
  • 59. Open Source Column Stores • Hbase (http://hbase.apache.org/) • Hypertable (http://hypertable.org/) • Cassandra (http://cassandra.apache.org/) • Accumulo (http://accumulo.apache.org/) • Parquet (http://parquet.io/) 59
  • 60. HBase Apache HBase™ is the Hadoop database to use when you need you need random, realtime read/write access to your Data. •Automatic and configurable sharding of tables •Automatic failover support between RegionServers. •Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. •Easy to use Java API for client access. •To be distributed, it has to run on top of hdfs •Integrated with MapReduce 60
  • 62. Data Location, Assignment and Serving 62
  • 64. Hypertable • Hypertable is an open source database system inspired by publications on the design of Google's BigTable. • Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++. 64 http://hypertable.com/documentation/
  • 65. Cassandra • Big-Table extension: – All nodes are similar. – Can be used as a distributed hash-table, with an "SQL-like" language, CQL (but no JOIN!) • Data can have expiration (set on INSERT) • Map/reduce possible with Apache Hadoop • Rich Data Model (columns, composites, counters, secondary indexes, map, set, list, counters) 65
  • 68. Writing to Cassandra clientclient Commit log (Disk) Memtable (memory) sstable Flush Replication factor: 3Replication factor: 3 sstable sstablesstable
  • 70. When not to use 70
  • 71. When not to use • There are not suitable for systems: – that requires ACID transactions for writes and reads, – That needs to aggregate data (SUM and AVG) you need to do it at client side, – Are in an initial stage. If we need to change the queries we could have to change the columns. 71

Editor's Notes

  1. Gossip protocol
  2. Commit log for durability – sequential write Memtable – no disk access (no reads or seeks) Sstables written sequentially to the disk The operational design integrates nicely with the operating system page cache. Because Cassandra does not modify the data, dirty pages that would have to be flushed are not even generated.