RDS for MySQL provides a fully managed MySQL database in the cloud. It handles backups, provisioning, patching, and failover automatically. While convenient, RDS has some limitations like inability to choose database versions, limited control over maintenance windows, and downtime required for migrations or upgrades. Careful planning is needed for workloads with high availability or latency requirements. Overall RDS reduces DBA overhead but still requires expertise for design, tuning, and automation.
2. The Party Line
Relational Database Service
Fully Managed
Simple to Deploy
Easy to Scale
Reliable
Cost Effective
3. Fully Managed
Ignore the man behind the curtain
Backups
Provisioning
Patching
Performance Management
Failover
Replication
4. Fully Managed
Backups
Snapshot Based - Same as EBS
Snapshots cause spikes in latency
Avoided in Multi-AZ
Snapshots are taken from master
Or the standby in Multi-AZ
Set up automatic schedules
Point in Time Recovery via binlogs
User executed snapshots
5. RDS Backups
Can I snapshot a replica?
Nope. Backup from your master.
Of course, you can promote a replica, then snapshot it
for testbeds.
6. RDS Backups
I like RDS Backups
When using Multi-AZ
AND
When loads are minimal
It's like unicorns are flying my binlogs to heaven
7. Fully Managed
Provisioning
Rapid Master Launches
Master in a few minutes (or it's free?)
Standby in a different AZ? Push a button!
Rapid Replica Builds
Need more replicas? Push a button!
8. RDS Provisioning
Provisioning your master
Standalone - no failover or redundancy
Multi-AZ - standby in a separate availability zone
Pick your Version
Pick your maintenance window
9.
10. RDS Provisioning
Overview of AZ and Regions
Amazon Regions equate to data-centers in different
geographical regions. (99.5% SLA based on more than
one AZ being unavailable)
Availability zones are isolated from one another in the
same region to minimize impact of failures.
RDS does not interact across regions.
11. RDS Provisioning
Can multiple AZs save me?
Amazon states AZs do not share :
● Cooling
● Network
● Security
● Generators
● Facilities
12. RDS Provisioning
Can multiple AZs save me?
Apr, 2011 - US East Region EBS Failed
* Incorrect network failover.
* Saturated intra-node communications.
* Cascading failures impacted EBS in all AZs.
Jul, 2012 - US East Partial Impact
* Electrical storms impacted multiple sites.
* Failover of metadata DB took too long.
* EBS I/O was frozen to minimize corruption.
13. RDS Provisioning
Can multiple AZs save me?
They can reduce risk.
Cross AZ latency can vary as much as 3x. (too slow
to allow mysql cluster across AZs)
A multi-az failover can create a degraded
performance condition when minimal latency is
required.
15. RDS Provisioning
Multi-AZ Magical Failover
Replicates via unicorn express
Fails over quite often, with up to 30
seconds of downtime
You do not get to choose your failover AZ
Typical I/O write impact for synch replication
aka unicorn express
17. RDS Provisioning
Pick Your Version
MySQL 5.1 or MySQL 5.5
:( No MariaDB :(
:( No XtraDB :(
:( No Drizzle :(
:( No TokuDB :(
18. RDS Provisioning
Pick Your Maintenance Window
30 minute window your software patching can occur
Can be different for different instances
You need to plan ahead for instances to be out of
service.
20. RDS Provisioning
Auto-Version Minor Upgrade
If you choose no, you will not experience automatic
upgrades (and thus downtime).
Some critical security patches can still be done.
RDS team is fairly good about communicating
upgrades.
21. RDS Provisioning
Basic Instance Types
Micro - 630 MB RAM, 2 ECU - Low I/O
Small - 1.7 GB RAM, 1 ECU - Med I/O
Large - 7.5 GB RAM, 4 ECU - High I/O
XLarge - 15 GB RAM, 8 ECU - High I/O
22. RDS Provisioning
Fancy Instance Types
High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/O
High Mem 2XL - 34 GB RAM, 13 ECU - High I/O
High Mem 4XL - 68 GB RAM, 26 ECU - High I/O
23. RDS Provisioning
Storage Provisioning
From 5 GB to 3 TB
At 300 GB, EBS Volumes start to get striped.
Striping = better performance
Provisioned IOPS (up to 30,000)
= more stable I/O and costs more too!
24. RDS Provisioning
Virtual Private Cloud (VPC)
Allows you to create your own virtual network
simulating traditional DC networks.
You must create a DB Subnet Group in VPC
VPC Subnets cannot cross availability zones.
VPC security group allows access control to your DB
25. RDS Provisioning
Virtual Private Cloud (VPC)
Mixed architectures with some VPC, and some non-
VPC creates major issues.
Auto-scaling becomes difficult.
Don't do it!
26. RDS Provisioning
Database Security Groups
Controls all MySQL access to RDS instances.
Defaults to "deny all"
Access can be granted by IP Range and EC2 sec groups.
27. RDS Provisioning
Database Security Groups
Don't grant access to 10.x.x.x, use a security group.
IPs entered with CIDR - Classless Inter-Domain Routing
Make sure you understand CIDR! (or you may have
unwelcome visitors!)
28. RDS Provisioning
Parameter Groups
Defines parameters used by your RDS instances.
There is a "default" group that you can modify.
One or more RDS instances can map to an individual
parameter group.
29. RDS Provisioning
Parameter Group Best Practices
Don't ever use the default group.
The default group doesn't allow dynamic parameter
changes. Everything requires a restart.
Build different groups for each mysql master/replica
grouping.
30. RDS Provisioning
Parameter Group Best Practices
Use different parameter groups for masters vs.
replicas.
Consider using different parameter groups for
different replica types (app query, ad hoc, ETL)
Remember to use test environments. Test!!!
31. RDS Provisioning
Why different parameter groups?
Granularity - Do you want to apply the same
parameter to everything in the cluster?
● Read Only?
● Slow Logging?
● innodb_flush_method
33. RDS Provisioning
Provisioning your Replicas
Does not have to be the same instance type as the
master.
Pick your availability zone (great for mapping replicas
to app servers in the same AZ.)
Don't forget to apply a different parameter group
than your master.
34. RDS Provisioning
Provisioning your Replicas
Adding a replica impacts your master performance.
(If not in multi-az)
You can only launch in serial - and it can take a
non-trivial amount of time to launch.
Adding many replicas can take awhile. Script it!
35. RDS Provisioning
What can I do with my replica?
Send queries to it
Promote it to a master
Poke it with a stick
Use it for special purposes (mysqldump, ETL, ad hoc)
36. RDS Provisioning
Sending queries to the replica?
Set up Route53 cnames - weighted round robin.
Internal elastic load balancer in the VPC.
VPC/Route53 does not do a mysql health check.
HAProxy can be leveraged.
37. RDS Provisioning
Replica master Promotion
This is a great way to build a test environment.
Can be leveraged for rolling migrations
But a replica can't have a replica! Must promote
first!
38. RDS Provisioning
Replica promotion for failover
This can be used instead of Multi-AZ. Why?
When using log_sync=0, a master failover in multi-az
may strand your replicas.
Old log doesn't close correctly. Replica cannot
proceed. And you can't move to the next log!
41. A Day in the Life
What does an RDS DBA do?
Need a replica?
Push a button or call an API.
Need to create a test environment?
Promote a replica, call an API.
New Cluster?
Push a button or call an API.
42. A Day in the Life
What does an RDS DBA do?
Need a backup?
Push a button or call an API.
Need to recover a database?
Push a button or call an API.
New Cluster?
Push a button or call an API.
43. A Day in the Life
Need to do a query review?
You don't have access to the logs at the file
system level.
You can look in the console or via API for some
initial diagnostics.
44. A Day in the Life
Query Reviews
Need to do a REAL query review?
Log to the csv table - slow_log
mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time:
', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), 'n', '# User@Host: ', user_host, 'n', '#
Query_time: ', TIME_TO_SEC(query_time), ' Lock_time: ', TIME_TO_SEC(lock_time), '
Rows_sent: ', rows_sent, ' Rows_examined: ', rows_examined, 'n', sql_text, ';' ) FROM mysql.
slow_log" > /tmp/mysql.slow_log.log
pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-
digest.txt
45. A Day in the Life
Query Reviews
No Microsecond Patch
Using long-query-time=0 logs all queries
But they record as 0 on time
You have no accurate profile of query time for < 1 sec.
You also can't use TCPDump on the MySQL Instance.
We often use this if logging everything will drop
performance on your DB instance to unacceptable levels.
WHICH IT CAN
46. A Day in the Life
Need to rotate logs?
call mysql.rds_rotate_slow_log;
call mysql.rds_rotate_general_log;
47. A Day in the Life
Need to kill a process?
call mysql.rds_kill_query (99);
kills the current query for this thread.
call mysql.rds_kill (99);
kills the thread.
48. A Day in the Life
Managing Replication
Need to stop replication? Break it yourself!
call mysql.rds_skip_repl_error;
Skips the current replication error.
49. A Day in the Life
Reviewing Status Trends
Global Status History
Event snapshots status into mysql.
rds_global_status_history;
You can trend this into many tools.
51. Monitoring MySQL
Where are the MySQL Metrics?
Cloudwatch doesn't expose them.
You can use: Cacti, Graphite, Zabbix, etc... for
trending.
52. Monitoring MySQL
Can I alert on cloudwatch metrics?
Cloudwatch allows you to set up your alerts.
But you probably want all metrics and alerts in the
same system, don't you?
55. Some things that suck
Moving data in and out
Want to do a dump and load upgrade?
Want to migrate to a new region?
Want to do multi-layer replication?
56. Some things that suck
Migrations/Upgrades out of RDS
Take a replica out of service.
Dump your data.
Upgrade your binaries.
Load your data.
Give replicas to your replica.
Failover reads, then writes.
MINIMAL DOWNTIME
58. Some things that suck
Migrations/Upgrades in RDS
Dump a bunch of tables.
Load deltas via tons of scripting.
Keep the deltas on each table minimal.
Take a few hours downtime.
Sync the delta.
Test.
Go live and drink a lot.
59. Some things that suck
This also applies to:
Moving data between regions.
Migration to EC2 from RDS.
Migrating to a datacenter from AWS
61. Patterns for RDS
Moderate Uptime SLAs:
Region Level SLA is 99.5% across two AZ's
(43.8 hours of downtime per year)
Add in failover times for multi-AZ master
(6 more hours)
Expect around 4 days of downtime without
multi-region
62. Patterns for RDS
That doesn't include:
Downtime from bad queries
Downtime from user error
Downtime from upgrades/migrations
63. Patterns for RDS
Relaxed Latency Requirements:
Multi-AZ can introduce cross-AZ latency
without AZ specific architectural design.
EBS storage can introduce unpredictable
Latency without P-IOPS
Snapshots of master, replica builds and multi-AZ
failovers can impact write latency.
64. Patterns for RDS
Relaxed Latency Requirements:
If you use write-through cache, this can be mitigated
If you use significant caching, this can be mitigated
If you use AZ aware design, this can be mitigated
65. Patterns for RDS
Dataset Specifics:
Small datasets can allow for rapid region migrations
Read only datasets can also allow for this
Data you don't mind losing can also allow for this
66. Patterns for RDS
No DBA(s):
You still need DBAs to design, tune and configure.
But RDS does reduce some DBA overhead.
With investment in automation, this overhead is not
significant.
Still, automation requires money/hours. If you have
no budget, RDS is a good way to start.
67. War Stories
Obama for America:
US-East Region
Multi-AZ
5 Clusters, 30 Instances
Provisioned IOPs, 1 TB Storage
68. Obama for America
Data Growth:
Opsview had no visibility to OS, and thus we
were surprised regularly by storage growth. Had
to build custom plugins.
Upgrading storage or instance size in multi-AZ
can cause an unpredictable downtime window.
Downtime is small, but the whole process can take
30 minutes and you don't know when the REAL
downtime will occur.
69. Obama for America
Hurricane Sandy:
Hurricane Sandy was poised to strike Virginia and
US East.
Luckily we had built out EC2 and data migration
scripts.
Took 3 days solid for the whole team to build out
US-West region.
70. Obama for America
Human Error:
While doing rolling DDL, sql_log_bin disabled at the
global level on master. (Damn you 5.5!!!!)
No access to binlogs made troubleshooting very
challenging.
An hour of troubleshooting because we blamed the
disk and had no visibility.
Had to rebuild all replicas in serial overnight once
72. War Stories
Call of Duty, Black Ops 2:
5 Clusters, 25 instances.
US East
Multi-AZ
Provisioned IOPs
73. CoD Black Ops 2
Hurricane Sandy:
Data migration scripts not setup for continuous
replication.
Had to draw a line in the sand on when to move
data.
Any additional data would be lost, if cutover
occurred.
74. CoD Black Ops 2
Multi-AZ Failover:
Writes required sync_binlog=0
Master failed over to standby.
All replicas stopped replicating.
DBA couldn't “change master”
Read load swarmed the master while we rebuilt.
75. CoD Black Ops 2
Provisioned IOPs:
Came out, super exciting!
Let's migrate!
Oh, no push button migration.
2 Senior DBAs, 3 weeks to build migration scripts
and test/migrate.
76. Q&A
Laine Campbell, CEO PalominoDB
http://www.slideshare.net/lainecampbell