Subsetting Data During Replication

Subsetting data during replication can improve the overall performance of your sync scenario and reduce the size of target databases.   This article will explain some of the ways SymmetricDS will allow you to subset your data and how to consider the performance implications with each case.

What is data subsetting?

A common replication use case involves a centralized database that is synchronizing multiple client (remote) databases.   A simple configuration would be to just send all the data to all locations.   However there are often tables that do not need to be fully sent to each client node.   This is when subsetting should be considered to avoid cluttering your network and client databases with unnecessary data that is not used.

Subsetting is the process to make decisions at the central node regarding which data should be sent to each client.   

How does subsetting work in SymmetricDS?

In general subsetting can be best accomplished in SymmetricDS through a variety of routers.   Routing in SymmetricDS is the process by which changes are assigned batches and target nodes so that they can be replicated.    There are 3 out of the box routers (a fourth coming in 3.11) that will likely cover all your subsetting needs and will be discussed with the pro’s and con’s of each below as well as a few alternatives. 

Column Match Router

Column match routers are the fastest and and simplest subsetting router to configure.   They simply require a column on the table being replicated that will contain the value of the node ID each row is to be sent.   This router can also be reused for all tables that utilize the same column name for subsetting.

For example if you have 10 tables that each have a LOCATION_ID column, a column match router (router expression LOCATION_ID=:NODE_ID) can be used to replicate all of these tables.   This will result in only rows that match a registered node ID being sent to the appropriate client.

Pros

  • Best performance – Since the column used to subset the data is part of the captured data it is immediately available during the routing process without overhead.
  • Simple Configuration – A single key value pair is used as the router expression to determine which column is used.

Cons

  • Schema Dependent – Requires a node id column on any table that is to be subsetted.   Such a column may not already exist.
  • One to One Mapping – A single row may only be routed to either one node ID or one node group.   Sometimes more complicated logic is required to subset and this router will not be sufficient.

 

Lookup Table Router

Lookup table routers use a dependency or relationship table to look up the target node’s external id.   This allows a single row to have multiple targets if there are multiple matching rows in the lookup table.    This does involve some additional database work because the necessary information used during routing must be queried from an additional table.   In order to improve performance though the full lookup table is loaded into memory each time the routing job runs.   This can eliminate more than one call to the database to lookup all data using this router.   However since the table is loaded into memory it is not a great fit for large lookup tables (millions of rows).

Pros

  • Performance – Will perform better than a subselect router if the lookup table is not too large.   
  • Simple Configuration – Not quire a simple as column match router but instead requires just 4 key value pairs as the router expression to determine the target node(s).
  • Many To Many Mapping – Since this is based on an additional relationship table the same source row can be mapped to multiple target nodes in the relationship/lookup table.

Cons

  • Database Access – Does require an additional query of the lookup table and the time to load it into memory with each routing pass. 
  • Large Lookup Tables – Since the entire lookup table is loaded into memory with each execution of the routing table, extremely large tables (millions or rows) may not perform or require more memory for SymmetricDS to run properly.

NOTE in the example below the REP with ID of 1 is sent to all 3 target nodes based on the rules in the lookup table.

Subselect Router 

Sometimes the subsetting logic is more complicated than a single column or a single lookup table.  As a result more advanced queries are needed to determine where a row of data should be sent.   In these cases a subselect router might be an option.   Be cautious while configuring a subselect router though as every row change that occurs will initiate an additional subselect database query to be ran to determine the target node(s).    This might not be of much impact if the number of changes are trickling in and your systems are designed to handle the additional load.   However with transactions that create a large number of changes on a single table that is configured for subselecting this could slow down the replication time of changes arriving at their destination (see convert to reload router below).

Pros

  • Flexibility – Allows users to provide a custom sql statement to determine target node(s).   This might involve joining against multiple other tables.   
  • SQL Based – Sometimes it is easier to think about how data should move in your use case through a SQL statement.

Cons

  • Performance – The provided SQL statement will be run for every row that changes and is configured to use a subselect router.

Beanshell/Java/Extension Based Router

If the routers above are unable to meet your business case for subsetting there are more custom approaches that can be used.   The scope of these will not be addressed in this blog article but more details are available in the documentation.   In short these options allow for users to provide a programmatic approach to routing through various coding languages.

BONUS : Convert to Reload Router NEW in 3.11

As a solution to the overhead that is caused using a subselect a new approach was designed in 3.11.   The new convert to reload router supports use cases where a subselect is required (multiple tables involved in subsetting) and there are many changes coming in at once.   This might happen with an overnight batch process that loads in a great deal of changes in a single run that all require complex logic in order to subset. 

The convert to reload will capture the primary keys of these changes only and load them into a temp table.   The subselect query can then be provided to run a single time to join the temp table to the necessary business tables so that all the data can assigned an appropriate target node with a single query instead of one per row.    This will then generate a reload event for each of these rows in SymmetricDS to sync only the changes. 

Pros

  • Flexibility – Allows users to provide a custom sql statement to determine target node(s).   This might involve joining against multiple other tables.   
  • SQL Based – Sometimes it is easier to think about how data should move in your use case through a SQL statement.
  • Faster than subselect as it does not issue a sql statement per change but instead uses a temp table so that a single query can be issued

Cons

  • Maintenance – The user is responsible for creating the temp table and purging it