1. 10: Taxonomy of Data and Storage
Zubair Nabi
zubair.nabi@itu.edu.pk
April 20, 2013
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 1 / 27
2. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 2 / 27
3. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 3 / 27
4. Introduction
Data is everywhere and is the driving force behind our lives
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
5. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
6. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
7. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
8. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
decision about different events
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
9. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
decision about different events
Datasets can easily be classified on the basis of their structure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
10. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
decision about different events
Datasets can easily be classified on the basis of their structure
1 Structured
2 Unstructured
3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
11. Structured Data
Formatted in a universally understandable and identifiable way
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
12. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
13. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schema
consisting of name, phone number, address, email address, etc.
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
14. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schema
consisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving around
data laid out across columns and rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
15. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schema
consisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving around
data laid out across columns and rows
Each field also has an associated type
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
16. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schema
consisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving around
data laid out across columns and rows
Each field also has an associated type
Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
17. Unstructured Data
Data without any conceptual definition or type
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
18. Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
19. Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
20. Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
21. Semi-structured Data
Occupies the space between the structured and unstructured data
spectrum
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
22. Semi-structured Data
Occupies the space between the structured and unstructured data
spectrum
For instance, while binary data has no structure, audio and video files
have meta-data which has structure, such as author, time of creation,
etc.
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
23. Semi-structured Data
Occupies the space between the structured and unstructured data
spectrum
For instance, while binary data has no structure, audio and video files
have meta-data which has structure, such as author, time of creation,
etc.
Can also be labelled as self-describing structure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
24. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 8 / 27
25. Database Management Systems (DBMS)
Used to store and manage data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
26. Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
27. Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
28. Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
29. Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
30. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
31. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
32. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies each
row
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
33. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies each
row
The same key field is used to connect one table to another
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
34. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies each
row
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her details
as data; another table might have the same key but different data, say
her purchases; yet another table with the same key might have a
breakdown of her preferences
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
35. Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies each
row
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her details
as data; another table might have the same key but different data, say
her purchases; yet another table with the same key might have a
breakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBM
DB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
36. Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation in
RDBMS
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
37. Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation in
RDBMS
Adds a layer of abstraction over relational algebra, which enables set
operations, selections, etc.
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
38. Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation in
RDBMS
Adds a layer of abstraction over relational algebra, which enables set
operations, selections, etc.
Due to its declarative nature, users operate in terms of their expected
output while the underlying system decides the actual query execution
plan
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
39. Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation in
RDBMS
Adds a layer of abstraction over relational algebra, which enables set
operations, selections, etc.
Due to its declarative nature, users operate in terms of their expected
output while the underlying system decides the actual query execution
plan
Instructions consist of a specific SQL statement and additional
parameters and operands
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
40. Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation in
RDBMS
Adds a layer of abstraction over relational algebra, which enables set
operations, selections, etc.
Due to its declarative nature, users operate in terms of their expected
output while the underlying system decides the actual query execution
plan
Instructions consist of a specific SQL statement and additional
parameters and operands
For instance, the SELECT operator retrieves certain records, INSERT
adds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
41. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
42. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
The schema defines the type and structure of the data and its relations
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
43. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done before
the database can be populated
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
44. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done before
the database can be populated
Another consequence of a strict schema is that it is non-trivial to
extend it
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
45. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done before
the database can be populated
Another consequence of a strict schema is that it is non-trivial to
extend it
For instance, adding a new attribute to an existing row necessitates
adding a new column to the entire table
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
46. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to
a relational database system
The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done before
the database can be populated
Another consequence of a strict schema is that it is non-trivial to
extend it
For instance, adding a new attribute to an existing row necessitates
adding a new column to the entire table
Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
47. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
48. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
49. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
50. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
51. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
52. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
RDBMS would require the creation of a new table each time such a
change takes place
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
53. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
RDBMS would require the creation of a new table each time such a
change takes place
Therefore, unstructured and semi-structured data does not fit the
relational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
54. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 14 / 27
55. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
56. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
57. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
58. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restarts
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
59. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restarts
Overkill in case of most user-facing applications
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
60. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restarts
Overkill in case of most user-facing applications
Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistency
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
61. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restarts
Overkill in case of most user-facing applications
Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistency
This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
62. Motivation
Different semantics:
RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restarts
Overkill in case of most user-facing applications
Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistency
This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency for
availability leading to higher throughput (in some cases an order of
magnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
63. Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can be
scaled up by just adding more machines and the underlying system
automatically re-distributes the data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
64. Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can be
scaled up by just adding more machines and the underlying system
automatically re-distributes the data
Commodity Hardware: A large number of RDBMS require specialized
and proprietary hardware for operation. In contrast, NoSQL databases
function over commodity off-the-shelf hardware
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
65. Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can be
scaled up by just adding more machines and the underlying system
automatically re-distributes the data
Commodity Hardware: A large number of RDBMS require specialized
and proprietary hardware for operation. In contrast, NoSQL databases
function over commodity off-the-shelf hardware
Programming Language Support: Over the years programming
languages have started providing abstractions for database support
(LINQ, etc.) while bypassing SQL. NoSQL databases provide
abstractions that directly map onto the language abstractions leading
to tighter coupling
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
66. Motivation (3)
The Rise of Cloud Computing: Cloud Computing applications require
horizontal scalability and low administration overhead. Both
requirements are naturally satisfied by NoSQL stores
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 17 / 27
67. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 18 / 27
68. Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
69. Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
70. Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
71. Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
4 Persistence: How persistent the data is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
72. Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorized
into three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics per
key
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
73. Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorized
into three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics per
key
2 Document Stores: Complex data structures to encapsulate document
key/value pairs
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
74. Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorized
into three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics per
key
2 Document Stores: Complex data structures to encapsulate document
key/value pairs
3 Column-Oriented Stores: Data laid out by column
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
75. Key/value Stores
Data is stored within a large hash map
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
76. Key/value Stores
Data is stored within a large hash map
Simple get/put API
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
77. Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
78. Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
79. Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,
and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
81. Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,
XML, PDF, etc.
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
82. Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,
XML, PDF, etc.
Documents themselves can be heterogeneous
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
83. Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,
XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
84. Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,
XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
85. Column-Oriented Stores
Data is stored and processed by column
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
86. Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
87. Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enabling
opportunities for efficient compression
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
88. Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enabling
opportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
89. Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enabling
opportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open source
clone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
90. Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 24 / 27
91. Introduction
A hybrid of traditional RDBMS and NoSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
92. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
93. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
94. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
95. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Classified into:
1 New Databases: Designed from scratch
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
96. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Classified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
97. Introduction
A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Classified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine
3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
98. New Databases
1 Query Distribution:
Each node holds a subset of the data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
99. New Databases
1 Query Distribution:
Each node holds a subset of the data
Queries are split and shipped to nodes that own the data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
100. New Databases
1 Query Distribution:
Each node holds a subset of the data
Queries are split and shipped to nodes that own the data
Examples include Google’s Spanner and NuoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
101. New Databases
1 Query Distribution:
Each node holds a subset of the data
Queries are split and shipped to nodes that own the data
Examples include Google’s Spanner and NuoDB
2 Pull Data:
A central node (possibly replicated) holds all data
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
102. New Databases
1 Query Distribution:
Each node holds a subset of the data
Queries are split and shipped to nodes that own the data
Examples include Google’s Spanner and NuoDB
2 Pull Data:
A central node (possibly replicated) holds all data
A set of processing nodes receives queries and pulls in required data
from the central node
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
103. New Databases
1 Query Distribution:
Each node holds a subset of the data
Queries are split and shipped to nodes that own the data
Examples include Google’s Spanner and NuoDB
2 Pull Data:
A central node (possibly replicated) holds all data
A set of processing nodes receives queries and pulls in required data
from the central node
Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
104. References
1 NoSQL Databases: https:
//oak.cs.ucla.edu/cs144/handouts/nosqldbs.pdf
2 NewSQL – The New Way to Handle Big Data: http://www.
linuxforu.com/2012/01/newsql-handle-big-data/
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 27 / 27