Distributed RDBMSs provide many scalability, availability and performance advantages.
This presentation examines steps to create a customized data distribution policy for your RDBMS that best suits your application’s needs to provide maximum scalability.
We will discuss:
1. The different approaches to data distribution
2. How to create your own data distribution policy, whether you are scaling an exisiting application or creating a new app.
3. How ScaleBase can help you create your policy
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distribution Policy
1. Distributed RDBMS
Data Distribution Policy: Part 2
Creating a data distribution policy
October 2014
2. 2
Data Distribution Policy: Part 2
Distributed RDBMSs provide many scalability, availability
and performance advantages.
This presentation examines steps to create a customized
data distribution policy for your RDBMS that best suits
your application’s needs to provide maximum scalability.
We will discuss:
• The different approaches to data distribution
• How to create your own data distribution policy, whether you
are scaling an exisiting application or creating a new app.
• How ScaleBase can help you create your policy
3. 3
Why is a Distributed Relational Database Good?
Distributed relational databases are a perfect match for
Cloud computing models and distributed Cloud
infrastructure.
They are the way forward for delivering web scale
applications and keeping ACID properties.
• Social apps
• Games
• Many concurrent users
• High transaction throughput
• Very large data volumes
4. What Is a Data Distribution Policy? – Recap
A data distribution policy describes the rules under which
data is distributed across a distributed RDBMS.
(a virtual database made up of many database instances, or “shards”).
A good data distribution policy aims to:
1. Maintain full relational database integrity
2. Distribute workloads in an even and predictable manner
3. Minimize the amount of joins across the array of
4
database instances
4. Yield database scalability
5. Two Broad Types of Data Distribution Policy
1. Arbitrary Distribution: Data is distributed across
5
database instances without any consideration for or
understanding of specific application requirements.
Arbitrary distribution is often used by NoSQL database
technologies.
2. Policy-Based Distribution: Data is distributed across
database instances in a way that specifically
understands all application requirements, data
relationships, transaction flows, and how the data is
used in reads and writes by the application.
6. PROs - PROs -
Predetermined (no forethought required) Ensures that a specific transaction
6
Two Broad Types of Data Distribution Policy
Data Distribution Policy
Data Distribution Policy
Arbitrary Data Distribution Policy Declarative Data Distribution Policy
Pros - Pros -
- Unsophisticated - Ensures that a specific transaction
Arbitrary Data Distribution Policy Declarative Data Distribution Policy
finds all the data it needs in one
specific database
- Predetermined (no forethought required) - Aligns with schema and DB structure
finds all the data it needs in one
specific database
Cons - - Highly efficient and scalable
- No intelligence about business, schema, use
cases
- Anticipates future requirements and growth
assumptions
- Leads to excessive use of database nodes Cons -
- Leads to excessive use of network - Requires forethought and analysis
CONs - Aligns with schema and DB structure
No intelligence about business, schema,
use cases
Highly efficient and scalable
Leads to excessive use of database nodes Anticipates future requirements and
growth assumptions
Leads to excessive use of network CONs -
Requires forethought and analysis
7. Distributed Databases: NoSQL vs. DRDBMS
• NoSQL databases abandoned the relational model to get
7
the scalability benefits of a distributed database. NoSQL
and document store type databases can use arbitrary
data distribution because their data model does not
provide for joins, sequential integrity or ACID.
• However, today RDBMSs can get massive web scale and
keep the time-tested relational database model, ACID and
SQL if you use a declarative, policy-based data
distribution approach.
• Academia has written about various types of distributed
relational databases for decades. But today they are a
reality. Declarative, policy-based data distribution is the
way forward.
8. Two Distributed RDBMS Use Cases
There are two typical development
and database scenarios in which
relational databases can evolve
into modern distributed relational
databases:
1. Scaling an existing application
2. Designing scalability in a new
8
application
9. Scaling an Existing Application:
Key Observations and Measurements
Problem: A monolithic MySQL
database is suffering from
scalability issues:
9
• inconsistent performance
• inconsistent availability
• transaction throughput bottlenecks
Solution: A distributed MySQL
database that retains its relational
principles by applying a declarative,
policy-based data distribution
process.
10. Scaling an Existing Application:
Key Observations and Measurements
In today’s public, private and hybrid cloud world that
leverages distributed infrastructure, for an existing
database reaching its scalability limits, scaling up – getting
bigger hardware – is a counterintuitive, temporary and
expensive approach.
A good data distribution policy:
1. Transforms a monolithic single-instance MySQL database into a
10
distributed MySQL database that retains its relational principles.
2. Aligns with the application's current database structure and
commands. Related data within various tables is identified and
amassed to stay localized in a single database instance.
3. Ensures “reads” and “writes” can be completed successfully using
only data from within one database instance.
11. Determining your Data Distribution Policy:
Reads and Writes
Reads (Queries):
• Examine the bits of data that are accessed in joins, sub-queries
11
or unions to find what data ought to be kept
together on one machine. This usually comes from
related tables that have the same foreign keys.
Writes (Transactions):
• Additions to the database need to be placed in the
appropriate partitioned database instance (or shard) with
their related data.
• A transaction is more efficient when it is contained to a
single database cluster. This practice eliminates the
need for a distributed transaction with 2-phase-commit.
12. Distribution Example: Reads and Writes
Reads (Queries):
• When identifying the ‘users’ in a database, the next step
12
would involve identifying the ‘orders’ related to those
‘users’, then the ‘items’ related to the ‘orders’.
Write (Transactions):
• An ‘order’ is made up of many ‘items’, which are
consequently added to the same shard as the ‘order’.
Efficiency dictates that we want to ensure that data
can be either read together, such as in queries, or
written together, such as in transactions.
“The data that plays together, should stay together.”
13. Scaling an Existing Application:
Denormalization – Not Recommended
A distribution key is the field according to which data is
directed. If a table does not contain the distribution key, the
routing process can become very difficult.
• Denormalization adds the distribution key to the tables in
13
which it is missing - however, this creates many
additional problems along the way. It is not
recommended.
• ScaleBase’s cascading key lookup solution easily
removes the need for denormalization whilst efficiently
resolving any data placement issues.
14. Scaling an Existing Application:
Null Columns
The fields that determine where to route the data and
commands cannot be empty (i.e. null) or updated during the
life of the row. To ensure this:
• Every piece of data must be “born” with a distribution key that it keeps
14
for the course of its entire life.
• It is not enough to simply have the distribution key category in all
tables; it needs to be populated, as part of the data in the table, as
well.
• A row can be inserted into a table, updated many times and deleted.
• It is vital to insert every table into the database with an updated
distribution key.
• If a row is inserted into the database with a ‘null’ shard key, it cannot
be placed into the distributed database.
15. Automating data Distribution Analysis:
ScaleBase’s Analysis Genie
If you want to add linear scalability to an existing
MySQL database, you can use ScaleBase’s
free SaaS tool, Analysis Genie.
• The Analysis Genie will help you define the
15
best data distribution policy tailored to your
application’s unique requirements.
• The results are based on a guided analysis of
the nature of your data, data relationships,
and the functional use of your data.
• You can iterate with different policies in a
simulated environment to achieve the highest
application / distributed database efficiency.
16. Designing Scalability in a New Application
New web-facing apps have to anticipate millions of users,
high-transaction rates, and ever-larger data volumes.
• The same data distribution principles applied to existing
16
applications are also be applied to new applications and
databases.
• Data is stored and accessed together on the same
database, whether it is for “reads” or “writes”.
17. Designing Scalability in a New Application
(Continued)
When designing a data
distribution policy, the
distribution key should be
selected according to how data
will be distributed.
You can then denormalize,
adding the distribution key to
every table, or distribute by
understanding the link between
the tables within each shard
from the beginning of the
design process.
17
18. Designing Scalability in a New Application
When designing a database, ask yourself about the life-cycle
18
of the rows of your data.
• Were they born with a populated distribution key?
Designing your application in a way that makes sure this is
taken care of avoids the unpleasant situations of null shard
keys.
19. Massive Database Scalability With ScaleBase
Analysis tools are not appropriate for new applications as
they do not have anything to track.
For this reason we’ve created a special guide:
• Building a New Application with Massive Database
19
Scalability – Getting Started with ScaleBase
This document demonstrates how to build a new application
that plans for massive database scalability right from the
start.
Provides a walkthrough of how to create a simple,
straightforward RDBMS data distribution policy.
20. Additional Distributed RDBMS Resources
To develop a custom made data distribution policy for your
RDBMS and application, we also recommend the following
resources:
• Four table Types You Need To Know To Scale Your
20
Relational Database
• Distributed Databases and Cascading Tables
• Discover your Application Scalability Score with
ScaleBase Analysis Genie
• Optimizing Sharding Policies to Scale Out MySQL –
Choosing the Best Data Distribution Policy (whitepaper)
21. ScaleBase Software
• ScaleBase is a distributed database built on MySQL and
21
optimized for the cloud. It deploys in minutes so your
database can handle an unlimited number of users,
humongous volumes of data, and faster transactions.
• It dynamically optimizes workloads and availability by
logically distributing data across public, private, and geo-distributed
clouds.
22. ScaleBase Software
22
“What differentiates ScaleBase is its ability to
add scalability without the need to migrate to
new database architecture or make any changes
to existing applications”
- Matt Aslett, The 451 Group
“ScaleBase allows us to effectively scale, without
downtime, and without having to rewrite our
application.”
- Sheeri Cabral, Mozilla
23. Try ScaleBase Today
ScaleBase software is available for free:
• ScaleBase Website
• Amazon Marketplace
• Rackspace Marketplace
• IBM Cloud marketplace
• ScaleBase’s free online Analysis Genie service
AWS Marketplace Guide and a AWS Getting Started
Tutorial are available from the documentation section of the
ScaleBase website.
23
Contact ScaleBase
sales@scalebase.com
24. Data Distribution Policy: Part 1 and 3
Data Distribution Policy Part 1:
• What a data distribution policy is
• The challenges faced when data is distributed via sharding
• What defines a good data distribution policy
• The best way to distribute data for your application and
24
workload
Data Distribution Policy Part 3:
• Three stages of your data distribution policy’s lifecycle.
• Adapting the distributed RDBMS to match application changes.
• Ensuring that your distributed relational database is flexible and
elastic enough to accommodate endless growth and change.