Amazon RDS for Database Failover

Amazon RDS with multi-zone failover is commonly used solution.  However, it is important to understand its true capabilities when deployed. This especially important for for online, e-commerce application, where application downtime can translate to direct revenue

The Amazon RDS user guide is documented at:  http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html. It calls out several key issues:

  1. The failover mechanism automatically changes the DNS record of the DB instance to point to the standby DB instance;
  2. Failover times are typically 60-120 seconds.
  3. DB instances using Multi-AZ deployments may have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs;
  4. The high-availability feature is not a scaling solution for read-only scenarios; you cannot use a standby replica to serve read traffic;
  5. Amazon Aurora stores copies of the data in a DB cluster across multiple Availability Zones in a single region, regardless of whether the instances in the DB cluster span multiple Availability Zones;

What are the implications from an application perspective?

  1. If an application doesn’t reconnect AND re-query for DNS changes on a connection failover, even if RDS performs the failover, the application may go offline until restarted.
  2. With a failover time of 60-120 seconds (or more), most application connections will time-out and result in application failure anyway.  This is not an insignificant time for an application to deal with, even if they do everything right.  Often idle timeouts are set lower than this to reduce the number of idle connections, so the failover mechanism creates a conflict with general operations.
  3. Synchronous writes are great to guarantee no loss of data, but impact performance.  This is why multi-az deployments provided by AWS are still single region deployments.  If you wish to use multi-region setups, the recommended practice is to use read-replicas that leverage asynchronous replication, but this doesn’t provide an automatic failover mechanism.
  4. Read replicas are detailed at https://aws.amazon.com/rds/details/read-replicas/.
  5. This implies that there is only a single data store in a region for Aurora, and may (details are unclear) result in single points of failure.  Of note, Amazon region-level failures have occurred:  https://aws.amazon.com/message/2329B7/,

At Heimdall, we believe that if your business is important enough to design in redundancy, do it right.  With Heimdall, if you provision your DB and a read-replica in close, but different at regions, such as in Amazon us-west-1 and us-west-2, and leverage Heimdall’s failover orchestration capabilities for slave server promotion, you can account for region level RDS failure. At the same time, failover times are 10x faster than what the multi-az failover provides natively.  In addition, as Heimdall provides the connection management for the application, it can hold the connections while the failover is happening, preventing timeout failures that may otherwise render the application offline.  Remember that the cloud provider has an SLA that to be online for their customers.  Often the application fails during a DB failure and remain offline, but the SLA is not violated.  This will be marked as “application error” by Amazon. So Amazon does not claim the SLA is violated, but your application is down? That does not make sense.

In our next blog will review the MySQL failover solution. Until then…

Facebooktwittergoogle_plusredditpinterestlinkedinmail