RDS for MySQL, No BS Operations and Patterns

Laine Campbell, CEO PalominoDB

RDS for MySQL
No BS Operations and Patterns

The Party Line

Relational Database Service
Fully Managed
Simple to Deploy
Easy to Scale
Reliable
Cost Effective

Fully Managed

Ignore the man behind the curtain
Backups
Provisioning
Patching
Performance Management
Failover
Replication

Fully Managed

Backups
Snapshot Based - Same as EBS
Snapshots cause spikes in latency
Avoided in Multi-AZ
Snapshots are taken from master
Or the standby in Multi-AZ
Set up automatic schedules
Point in Time Recovery via binlogs
User executed snapshots

RDS Backups

Can I snapshot a replica?
Nope. Backup from your master.

Of course, you can promote a replica, then snapshot it
for testbeds.

RDS Backups

I like RDS Backups
When using Multi-AZ

AND

When loads are minimal

It's like unicorns are flying my binlogs to heaven

Fully Managed

Provisioning
Rapid Master Launches
Master in a few minutes (or it's free?)
Standby in a different AZ? Push a button!

Rapid Replica Builds
Need more replicas? Push a button!

RDS Provisioning

Provisioning your master
Standalone - no failover or redundancy

Multi-AZ - standby in a separate availability zone

Pick your Version

Pick your maintenance window

RDS Provisioning

Overview of AZ and Regions
Amazon Regions equate to data-centers in different
geographical regions. (99.5% SLA based on more than
one AZ being unavailable)

Availability zones are isolated from one another in the
same region to minimize impact of failures.

RDS does not interact across regions.

RDS Provisioning

Can multiple AZs save me?
Amazon states AZs do not share :

● Cooling
● Network
● Security
● Generators
● Facilities

RDS Provisioning

Apr, 2011 - US East Region EBS Failed
* Incorrect network failover.
* Saturated intra-node communications.
* Cascading failures impacted EBS in all AZs.

Jul, 2012 - US East Partial Impact
* Electrical storms impacted multiple sites.
* Failover of metadata DB took too long.
* EBS I/O was frozen to minimize corruption.

RDS Provisioning

They can reduce risk.

Cross AZ latency can vary as much as 3x. (too slow
to allow mysql cluster across AZs)

A multi-az failover can create a degraded
performance condition when minimal latency is
required.

Multi-AZ Failover

From AWS Docs

RDS Provisioning
Multi-AZ Magical Failover
Replicates via unicorn express

Fails over quite often, with up to 30
seconds of downtime

You do not get to choose your failover AZ

Typical I/O write impact for synch replication
aka unicorn express

Multi-AZ Failover

From AWS Blog

RDS Provisioning

Pick Your Version
MySQL 5.1 or MySQL 5.5

:( No MariaDB :(
:( No XtraDB :(
:( No Drizzle :(
:( No TokuDB :(

RDS Provisioning

Pick Your Maintenance Window
30 minute window your software patching can occur
Can be different for different instances
You need to plan ahead for instances to be out of
service.

RDS Provisioning
They'll shut off my DB????

RDS Provisioning

Auto-Version Minor Upgrade
If you choose no, you will not experience automatic
upgrades (and thus downtime).
Some critical security patches can still be done.
RDS team is fairly good about communicating
upgrades.

RDS Provisioning

Basic Instance Types
Micro - 630 MB RAM, 2 ECU - Low I/O
Small - 1.7 GB RAM, 1 ECU - Med I/O
Large - 7.5 GB RAM, 4 ECU - High I/O
XLarge - 15 GB RAM, 8 ECU - High I/O

RDS Provisioning

Fancy Instance Types
High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/O
High Mem 2XL - 34 GB RAM, 13 ECU - High I/O
High Mem 4XL - 68 GB RAM, 26 ECU - High I/O

RDS Provisioning

Storage Provisioning
From 5 GB to 3 TB
At 300 GB, EBS Volumes start to get striped.
Striping = better performance
Provisioned IOPS (up to 30,000)
= more stable I/O and costs more too!

RDS Provisioning

Virtual Private Cloud (VPC)
Allows you to create your own virtual network
simulating traditional DC networks.

You must create a DB Subnet Group in VPC

VPC Subnets cannot cross availability zones.

VPC security group allows access control to your DB

RDS Provisioning

Virtual Private Cloud (VPC)
Mixed architectures with some VPC, and some non-
VPC creates major issues.

Auto-scaling becomes difficult.

Don't do it!

RDS Provisioning

Database Security Groups
Controls all MySQL access to RDS instances.

Defaults to "deny all"

Access can be granted by IP Range and EC2 sec groups.

RDS Provisioning

Database Security Groups
Don't grant access to 10.x.x.x, use a security group.

IPs entered with CIDR - Classless Inter-Domain Routing

Make sure you understand CIDR! (or you may have
unwelcome visitors!)

RDS Provisioning

Parameter Groups
Defines parameters used by your RDS instances.

There is a "default" group that you can modify.

One or more RDS instances can map to an individual
parameter group.

RDS Provisioning

Parameter Group Best Practices
Don't ever use the default group.

The default group doesn't allow dynamic parameter
changes. Everything requires a restart.

Build different groups for each mysql master/replica
grouping.

RDS Provisioning

Parameter Group Best Practices
Use different parameter groups for masters vs.
replicas.

Consider using different parameter groups for
different replica types (app query, ad hoc, ETL)

Remember to use test environments. Test!!!

RDS Provisioning

Why different parameter groups?
Granularity - Do you want to apply the same
parameter to everything in the cluster?

● Read Only?
● Slow Logging?
● innodb_flush_method

RDS Provisioning

Provisioning your Replicas
Does not have to be the same instance type as the
master.

Pick your availability zone (great for mapping replicas
to app servers in the same AZ.)

Don't forget to apply a different parameter group
than your master.

RDS Provisioning

Provisioning your Replicas
Adding a replica impacts your master performance.
(If not in multi-az)

You can only launch in serial - and it can take a
non-trivial amount of time to launch.

Adding many replicas can take awhile. Script it!

RDS Provisioning

What can I do with my replica?
Send queries to it

Promote it to a master

Poke it with a stick

Use it for special purposes (mysqldump, ETL, ad hoc)

RDS Provisioning

Sending queries to the replica?
Set up Route53 cnames - weighted round robin.

Internal elastic load balancer in the VPC.

VPC/Route53 does not do a mysql health check.

HAProxy can be leveraged.

RDS Provisioning

Replica master Promotion
This is a great way to build a test environment.

Can be leveraged for rolling migrations

But a replica can't have a replica! Must promote
first!

RDS Provisioning

Replica promotion for failover
This can be used instead of Multi-AZ. Why?

When using log_sync=0, a master failover in multi-az
may strand your replicas.

Old log doesn't close correctly. Replica cannot
proceed. And you can't move to the next log!

RDS Provisioning

All of my replicas must be rebuilt!

A Day in the Life

What does an RDS DBA do?

A Day in the Life

Need a replica?
Push a button or call an API.

Need to create a test environment?
Promote a replica, call an API.

New Cluster?

A Day in the Life

Need a backup?

Need to recover a database?

New Cluster?

A Day in the Life

Need to do a query review?
You don't have access to the logs at the file
system level.

You can look in the console or via API for some
initial diagnostics.

A Day in the Life

Query Reviews
Need to do a REAL query review?
Log to the csv table - slow_log
mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time:
', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), 'n', '# User@Host: ', user_host, 'n', '#
Query_time: ', TIME_TO_SEC(query_time), ' Lock_time: ', TIME_TO_SEC(lock_time), '
Rows_sent: ', rows_sent, ' Rows_examined: ', rows_examined, 'n', sql_text, ';' ) FROM mysql.
slow_log" > /tmp/mysql.slow_log.log

pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-
digest.txt

A Day in the Life

Query Reviews
No Microsecond Patch
Using long-query-time=0 logs all queries
But they record as 0 on time
You have no accurate profile of query time for < 1 sec.

You also can't use TCPDump on the MySQL Instance.
We often use this if logging everything will drop
performance on your DB instance to unacceptable levels.

WHICH IT CAN

A Day in the Life

Need to rotate logs?
call mysql.rds_rotate_slow_log;

call mysql.rds_rotate_general_log;

A Day in the Life

Need to kill a process?
call mysql.rds_kill_query (99);

kills the current query for this thread.

call mysql.rds_kill (99);

kills the thread.

A Day in the Life

Managing Replication
Need to stop replication? Break it yourself!

call mysql.rds_skip_repl_error;

Skips the current replication error.

A Day in the Life

Reviewing Status Trends
Global Status History

Event snapshots status into mysql.
rds_global_status_history;

You can trend this into many tools.

Monitoring MySQL

Cloudwatch
CPUUtilization
Database Connections
FreeStorageSpace
Network In/Out
Read/Write IOPs
Read/Write Bytes
Read/Write Latency

Monitoring MySQL

Where are the MySQL Metrics?
Cloudwatch doesn't expose them.

You can use: Cacti, Graphite, Zabbix, etc... for
trending.

Monitoring MySQL

Can I alert on cloudwatch metrics?
Cloudwatch allows you to set up your alerts.

But you probably want all metrics and alerts in the
same system, don't you?

Monitoring MySQL

Also cloudwatch is unreliable
It often doesn't poll at every interval.

Can miss/skip important events.

Monitoring MySQL

What can I use?
Nagios can poll mysql directly

Poll from graphite

Some things that suck

Moving data in and out
Want to do a dump and load upgrade?

Want to migrate to a new region?

Want to do multi-layer replication?

Migrations/Upgrades out of RDS
Take a replica out of service.
Dump your data.
Upgrade your binaries.
Load your data.
Give replicas to your replica.
Failover reads, then writes.
MINIMAL DOWNTIME

Migrations/Upgrades in RDS

Migrations/Upgrades in RDS
Dump a bunch of tables.
Load deltas via tons of scripting.
Keep the deltas on each table minimal.
Take a few hours downtime.
Sync the delta.
Test.
Go live and drink a lot.

This also applies to:
Moving data between regions.

Migration to EC2 from RDS.

Migrating to a datacenter from AWS

Patterns for RDS
Prototyping and Testing:
Rapid build and destroy.

Short lifecycles.

Quick testing lifecycles.

Patterns for RDS
Moderate Uptime SLAs:
Region Level SLA is 99.5% across two AZ's
(43.8 hours of downtime per year)

Add in failover times for multi-AZ master
(6 more hours)

Expect around 4 days of downtime without
multi-region

Patterns for RDS
That doesn't include:
Downtime from bad queries

Downtime from user error

Downtime from upgrades/migrations

Patterns for RDS
Relaxed Latency Requirements:
Multi-AZ can introduce cross-AZ latency
without AZ specific architectural design.

EBS storage can introduce unpredictable
Latency without P-IOPS

Snapshots of master, replica builds and multi-AZ
failovers can impact write latency.

Patterns for RDS
Relaxed Latency Requirements:
If you use write-through cache, this can be mitigated

If you use significant caching, this can be mitigated

If you use AZ aware design, this can be mitigated

Patterns for RDS
Dataset Specifics:
Small datasets can allow for rapid region migrations

Read only datasets can also allow for this

Data you don't mind losing can also allow for this

Patterns for RDS
No DBA(s):
You still need DBAs to design, tune and configure.

But RDS does reduce some DBA overhead.

With investment in automation, this overhead is not
significant.

Still, automation requires money/hours. If you have
no budget, RDS is a good way to start.

War Stories
Obama for America:
US-East Region

Multi-AZ

5 Clusters, 30 Instances

Provisioned IOPs, 1 TB Storage

Obama for America
Data Growth:
Opsview had no visibility to OS, and thus we
were surprised regularly by storage growth. Had
to build custom plugins.

Upgrading storage or instance size in multi-AZ
can cause an unpredictable downtime window.

Downtime is small, but the whole process can take
30 minutes and you don't know when the REAL
downtime will occur.

Obama for America
Hurricane Sandy:
Hurricane Sandy was poised to strike Virginia and
US East.

Luckily we had built out EC2 and data migration
scripts.

Took 3 days solid for the whole team to build out
US-West region.

Obama for America
Human Error:
While doing rolling DDL, sql_log_bin disabled at the
global level on master. (Damn you 5.5!!!!)

No access to binlogs made troubleshooting very
challenging.

An hour of troubleshooting because we blamed the
disk and had no visibility.

Had to rebuild all replicas in serial overnight once

Obama for America
Migration to P-IOPs:
Things that make you go hmmm....

War Stories
Call of Duty, Black Ops 2:
5 Clusters, 25 instances.

US East

Multi-AZ

Provisioned IOPs

CoD Black Ops 2
Hurricane Sandy:
Data migration scripts not setup for continuous
replication.

Had to draw a line in the sand on when to move
data.

Any additional data would be lost, if cutover
occurred.

CoD Black Ops 2
Multi-AZ Failover:
Writes required sync_binlog=0

Master failed over to standby.

All replicas stopped replicating.

DBA couldn't “change master”

Read load swarmed the master while we rebuilt.

CoD Black Ops 2
Provisioned IOPs:
Came out, super exciting!

Let's migrate!

Oh, no push button migration.

2 Senior DBAs, 3 weeks to build migration scripts
and test/migrate.

Q&A
Laine Campbell, CEO PalominoDB

http://www.slideshare.net/lainecampbell

RDS for MySQL, No BS Operations and Patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to RDS for MySQL, No BS Operations and Patterns

Similar to RDS for MySQL, No BS Operations and Patterns (20)

Recently uploaded

Recently uploaded (20)

RDS for MySQL, No BS Operations and Patterns