How can Endless Aisles maximize brand loyalty, revenue, and experience? It’s clear by now, consumers …
Subsetting data during replication can improve the overall performance of your sync scenario and reduce the size of target databases. This article will explain some of the ways SymmetricDS will allow you to subset your data and how to consider the performance implications with each case.
A common replication use case involves a centralized database that is synchronizing multiple client (remote) databases. A simple configuration would be to just send all the data to all locations. However there are often tables that do not need to be fully sent to each client node. This is when subsetting should be considered to avoid cluttering your network and client databases with unnecessary data that is not used.
Subsetting is the process to make decisions at the central node regarding which data should be sent to each client.
In general subsetting can be best accomplished in SymmetricDS through a variety of routers. Routing in SymmetricDS is the process by which changes are assigned batches and target nodes so that they can be replicated. There are 3 out of the box routers (a fourth coming in 3.11) that will likely cover all your subsetting needs and will be discussed with the pro’s and con’s of each below as well as a few alternatives.
Column match routers are the fastest and and simplest subsetting router to configure. They simply require a column on the table being replicated that will contain the value of the node ID each row is to be sent. This router can also be reused for all tables that utilize the same column name for subsetting.
For example if you have 10 tables that each have a LOCATION_ID column, a column match router (router expression LOCATION_ID=:NODE_ID) can be used to replicate all of these tables. This will result in only rows that match a registered node ID being sent to the appropriate client.
Lookup table routers use a dependency or relationship table to look up the target node’s external id. This allows a single row to have multiple targets if there are multiple matching rows in the lookup table. This does involve some additional database work because the necessary information used during routing must be queried from an additional table. In order to improve performance though the full lookup table is loaded into memory each time the routing job runs. This can eliminate more than one call to the database to lookup all data using this router. However since the table is loaded into memory it is not a great fit for large lookup tables (millions of rows).
NOTE in the example below the REP with ID of 1 is sent to all 3 target nodes based on the rules in the lookup table.
Sometimes the subsetting logic is more complicated than a single column or a single lookup table. As a result more advanced queries are needed to determine where a row of data should be sent. In these cases a subselect router might be an option. Be cautious while configuring a subselect router though as every row change that occurs will initiate an additional subselect database query to be ran to determine the target node(s). This might not be of much impact if the number of changes are trickling in and your systems are designed to handle the additional load. However with transactions that create a large number of changes on a single table that is configured for subselecting this could slow down the replication time of changes arriving at their destination (see convert to reload router below).
If the routers above are unable to meet your business case for subsetting there are more custom approaches that can be used. The scope of these will not be addressed in this blog article but more details are available in the documentation. In short these options allow for users to provide a programmatic approach to routing through various coding languages.
As a solution to the overhead that is caused using a subselect a new approach was designed in 3.11. The new convert to reload router supports use cases where a subselect is required (multiple tables involved in subsetting) and there are many changes coming in at once. This might happen with an overnight batch process that loads in a great deal of changes in a single run that all require complex logic in order to subset.
The convert to reload will capture the primary keys of these changes only and load them into a temp table. The subselect query can then be provided to run a single time to join the temp table to the necessary business tables so that all the data can assigned an appropriate target node with a single query instead of one per row. This will then generate a reload event for each of these rows in SymmetricDS to sync only the changes.