Database scaling strategies can take two approaches. The first and simplest is database vertical scaling which relies on larger database servers to process more requests in parallel. While simple to implement, this approach eventually reaches a limit, either due to disk or network IO limitations or simply the cost of larger servers. The second approach, database horizontal scaling, relies on adding more database servers to a cluster to fulfill requests and splitting traffic between them.
For horizontal scaling, the main challenge is maintaining data consistency through multiple servers. And, support requires application code changes. To maintain consistency without sacrificing performance, database clusters use a single read/write server (write master), and multiple read-only servers (read replicas), and split the traffic via “read/write split”. Any changes made to the write master are synchronized to the read replicas. Data changes are only visible on the read slaves after a period of time, usually a few milliseconds, but potentially much longer. This period is termed the “replication lag”.
Typically, read/write split to multiple data sources is configured in the application. These data sources are referenced by the application explicitly in order to access the write server or read replicas. This introduces several challenges for the developer:
- Replication lag may be an issue. Hence, read-replicas are only used if the lag will not impact the results of the operation;
- It adds a burden of testing the application under high replication lag loads, which can cause intermittent and difficult to diagnose bugs in QA.
- If there is a topology change, it adds to the risk of causing an outage, as the write server may go offline, the read server promoted, but the application data source configuration no longer matches the actual cluster.
Even in cluster systems that support multiple active write servers, such as Galera clusters, it is often best to maintain a single central write server; this allows optimizations to be used that improve write performance versus trying to use all writeable servers as write servers. The benefit of multiple write servers is speed and ease of failover without additional orchestration.
Comparing solutions in the market:
Below are some read/write split tools available today:
However, there are drawbacks to the use of the above solutions. Many are not replication-lag-aware. Routing a read to a read slave will be done unconditionally. Some may track lag, and disable using the read slave if the lag becomes too high, but are not intelligent enough to determine WHEN it is safe for a specific query to be routed to the read-slave. For example, even if replication lag is high, a select against a table that hasn’t been written to for an hour would still be safe. None of the above solutions are designed to fully offload the decision making process from the developer when read/write splits are used. They just add a layer of abstraction so that the data sources are not configured at the application level. The user needs read/write split automation.
Automating Read/Write Split with Heimdall Data
The Heimdall Data Access Layer (HDAL) supports SQL caching, database failover and query routing. Query routing includes the automation of read/write split. For deployment, no code changes nor adding “hints” to the queries as comments are required. Each query is parsed to determine what tables are associated, and tracks the last write time for any given table. This information is used with the replication lag tracking, allowing the HDAL to automatically determine what queries are safe to execute on the read slaves.
For example, if the replication lag is detected to be 500ms, and a table was last written to one second (1000ms) ago, then it would be deemed safe to use the read replicas. If however, the last write was 250ms ago, then the write master would be used to complete the read operation, guaranteeing data freshness.
For further protection, HDAL can also add a fixed lag window value, which can be configured in a static manner. This allows the replication lag window to spike on the short-term without impacting the freshness of the data being returned and to add an additional layer of protection. Further control over the read/write split can be applied at a rule level based on matching regular expressions, if necessary.
Finally, in cases where the replication lag is not a concern, there is also the ability to create a read/write split rule for particular queries that should unconditionally be read from the read/write server. This may be as a result of a particular user connecting to the database as well, i.e. to prevent a reporting job from putting load on the write master.
Limitations and Risks
While the Heimdall solution has been designed to provide a strong out of the box capability for handling read/write split, it isn’t perfect due to technological limitations. There are a number of conditions that could potentially break the read routing capability. These include:
- All writes to the database need to be routed through Heimdall, so that the last write time can be tracked on a per-table basis;
- Triggers on table writes that modify other tables will be hidden from Heimdall, so require additional configuration to associated the changed table with writes to the first table;
- Stored procedures and views hide what tables are read from and written to, so when used, additional rules need to be applied to match the calls to the back-end tables used for proper tracking.
To properly handle a case where a write on one table has a side-effect on another table, a rule such as:
This will associate tableA with tableB and tableC, such that a write to tableA will be flagged as a write against tableB and tableC as well. Likewise, call statements to stored procedures and views can be tagged in the same way to associated the underlying tables for tracking.
Further work to automate the association between stored procedures, views and triggers is ongoing to further automate the configuration, simplifying it even further.
The Heimdall Read/Write split solution supports the following databases solutions:
- PostgreSQL Streaming Replication
- MySQL Master/Slave Replication
- MySQL Galera Master/Master Replication
- MS SQL Server Availability Groups w/ standby read-only servers
One database scaling technique is to use a database proxy, which does not require application changes. The other way is to refactor your code to a particular database vendor; while this may give you the highest benefit in the end, there is database vendor lock-in and development investment. You will have to weigh the cost of selecting the “ultimate solution” (refactor code) or “just good enough (proxy)”.
For more information about how Heimdall Data can help your deployment improve its application-to-database performance and scale, contact us at email@example.com today!