transformation

  • Choosing The Right Replication Approach

    There are three primary methodologies used in database replication: log based, trigger capture, and time based. Determining which of these approaches is best for your use case can be difficult without understanding the limitations of each type. This article will help identify the differences, pros, and cons of each approach.

    Log Based

    Log based replication is probably the most common and popular approach that many tools support in order to capture changes on a source database. This approach is often viewed as the least invasive approach since the database remains unchanged and only the log files provided by the database are used to determine changes. In general, this is true, however, there is some potential overhead if the database logging must be turned on or set up. Turning on or setting up logging on an active production environment might lead to unexpected changes in performance. If your production database is already configured to enable transaction logging then there would be no additional performance impact on the applications using the database already.

    Some other factors should be considered while setting up a log based replication model. The first is the turnover of the log files. Determining how frequently the logs fill up and are removed so that new logs can be created is important to understand. If your log replication tool is not running or is not able to keep up with the load at peak times before logs are rolled over, changes could be lost during these intervals.

    Another thing to consider in a log based approach is the number of tables that would contribute to the logging versus how many tables are being configured for replication. For example, if there are 500 tables in the database that are potentially taking on changes but only 10 of these tables need to be replicated, there might be a great deal of overhead in processing the logs for a small set of tables compared to the total set of tables contributing to the logging.   

    Pros 

    • Potentially low or no overhead on the existing database.
    • No need for alterations to an existing database structure.

    Cons

    • If logging is not turned on already there may be a performance impact once activated
    • If logs are filled and removed to clear space for new logs to be written there is the potential for loss of data
    • If the amount of tables involved in replication is significantly less than the total table set the replication tool performance in the replication tool might be affected.

     

    Trigger Based

    Trigger based replication is built on the concept of utilizing database triggers to capture each change as part of the transaction that the change occurs on. Often this is viewed as more invasive than log based because the source tables need to be altered to include the triggers. It is important to understand in this approach a replication tool will be responsible for maintaining these triggers in the event they are not present or the source table is changed.   

    One of the key values in this approach is that the capturing of a change is directly tied to the user acting on the system. This is because the same transaction that is being used to commit a user change is also ensuring the trigger fires successfully. This ensures all changes are captured or they would roll back with the user change if there was an issue.   

    Another factor to consider with trigger based solutions is around the data storage of all these changes. Most trigger based approaches will write directly to a set of tables within the database that are created and maintained by the replication tool. As long as the database is sized to scale, these changes will remain present in the database itself and will not have the threat of data loss like the log based approach might have in the event of rolling over the log files.

    Pros

    • Transactional guarantee that data is committed to replication table once committed to application tables.
    • Less threat of data loss as long as the database is scaled to grow as more data is collected 

    Cons

    • Involves altering existing application tables to support replication
    • Performance could vary based on each replication tool's implementation of the change capture triggers.

     

    Time based

    Time based replication consists of selecting changes out of the source directly through queries. Although this requires the least permissions (select access) on the source database, there are some assumptions that need to be qualified in order to work properly. First, there needs to be a column that represents the last time a row was changed (usually a "last updated" column). If there is not a way to construct a query against each table to determine what has changed, this approach will not work properly.   

    If all the necessary tables for replication provide data within them regarding when they were last modified the next thing to identify is whether or not deletes need to be captured. In most straightforward time based replication, deletes are not supported. The source database would need to perform what is known as "soft deletes" where a deleted column is toggled on or off if the record is to be considered in use. Then the process of updating this soft delete column from a false to true and still utilizing the last updated column the change could be processed through time based replication. Otherwise, a true delete on the source will not be present when selecting changes as the record is no longer present to be retrieved.

     

    Pros

    • Only select access is required on the source database.
    • No additional tables or alterations to existing tables are necessary.

    Cons

    • Deletes are not supported unless the time based process reads in all tables on source and target and performs a full check. This is probably not an option on an active system or if you need constant replication that is not done just nightly.
    • All tables participating in the time based replication must have a column that can be used to determine the last change. 

  • Keeping Firebird Generators Up to Date in your DR Database

    SymmetricDS can be used to replicate Firebird databases to the cloud for Disaster Recovery. Setup is fairly straightforward. Select the tables you want to sync and the direction you want to sync them and, voilà, you have a backup database. While this is all well and good, if the database relies on id generators for primary key generation, then when failing over to the slave database you will probably end up with generators that no longer reflect the next value to insert.

    You can solve this problem by adding load transforms to your tables that use generators. The following example is a Java Extension Point that can be used as a column transform that keeps a named generator up to date.

    First, you want to add a new extension point. You do this from the Configure > Extensions screen. Make sure you select the ISingleNewAndOldValueColumnTransform interface.

    Next, you need to add the Java code that represents the implementation of the interface. You do this by selecting the extension and pressing the Edit Script button.

    Following is the implementation of the column transform. It simply checks to see that table and column it is associated with does NOT have a bigger value in the target table. If it doesn't, then it assumes that it is safe to update the generator to the current id value. Note that it gets the name of the generator from the transform expression. This will need to be configured when the transform is setup on the target table(s).

    Now we have a registered column transform extension that we can use. It is time to setup the transform on a table. For this example, the table transform is going to be on the NOTE table. We will assume that the NOTE table has already been configured to synchronize. Use the Auto Create button to select the NOTE table. We are using an IMPLIED table transform which means all of the columns that are not explicitly named will be passed through.

    Next we need to edit the table transform to make the transform a LOAD transform. LOAD transforms execute on the target node. EXTRACT transforms execute on the source node.

    Finally, we need to add our column transform on our ID column. Select the table transform and press the Edit Columns button. Add a column and select ID for the source and target columns. Drop down the transform type and select our custom transform. The custom transform expects the transform expression to be the name of the generator.

    At this point, everything is configured. It is time to test!

    This has been another example of the flexibility of SymmetricDS. The exact same pattern can be used to keep Oracle and Postgres sequences in sync.

  • Setting Up Time Based Replication With SymmetricDS

    Time Based Replication in SymmetricDS

    If the source data to be replicated resides in a read-only (select only) database, common replication approaches like log based or trigger based are not an option. Through some additional configuration and a few prerequisites on the source database SymmetricDS can still be used for replication in a time based approach.

     

    What is Time Based Replication?

    Time based replication is when data is selected out of the source directly and used to replicate to the target(s). Unlike log based or trigger based where there is a precise capture of each change. Time based will require some assumptions or additional overhead to determine the changes. If for example, the source table(s) have a column that indicates the last update times this can be used to obtain changes (although not deletes, discussed below). Another approach might be to select all data out of the source and compare it to the target. Although this seems more precise it might not be feasible on larger data sets or databases that incur a lot of consistent changes.   

    If you are limited though to read-only access to the source data then the time based approach might be your best and only option to replicate changes to a target database.

     

    Limitations

    First, we need to identify the limitations of time based replication to determine if this approach will work with your source database.

     

    Does it contain a last updated column?  

    Do the tables to be replicated contain a column that has an accurate date-time of the last time a row was modified? If not it might not be possible to determine the changes on this source database without a full data comparison against source and target. For this article, we will only focus on the cases where such a last updated column exists on the source tables.

     

    Do deletes matter?

    Deletes will not be captured in this approach unless it is a soft delete where a column (maybe named “deleted”) indicates the row is no longer active. In this case, it is just an update to a column to represent a delete and will replicate. If you need to fully replicate deletes again this would be outside the scope of this article and a full comparison would need to be done against the source and target.

     

    SymmetricDS Configuration

    Prerequisites

    • Setup the primary/registration node as the target node
    • Setup the select only source node as an “extract only node”

    Setting up tables

    *There will be two table triggers configured in SymmetricDS for each table. One set is to be used for a full initial load to send all data in the source tables. The second will use an initial load select to only select data based on a Time based interval.

    • Create a new channel named “time-based” with default settings.
    • Create a new channel named “time-based-reload” with default settings AND check the reload channel box.

    • Use the “Auto Create” wizard on the configure tab and the table triggers screen. -
    • Select tables for Time based replication
    • Select the “time-based” channel
    • Provide a suffix “-time-based”

    • On the configure tab and the table triggers screen type in the filter “-time-based” to filter only the triggers just created.
    • Use the select all to perform a bulk edit.

    • While bulk editing, turn off the insert, update, and delete on these triggers.
    • While bulk editing, set the reload channel for all these tables to “reload-time based”

     

    • On the configure tab and the table routing screen type in the filter “-time-based”.
    • Use the select all to perform a bulk edit.

    • While bulk editing, provide an initial load select based on last update column and frequency.

    Table 1 - Below is a list of examples how to go back 1 minute using a where clause on several common dialects

    Oracle last_updated_time >= (sysdate-(1/1440))
    MSSQL last_updated_time > DATEADD(MINUTE, -1, GETDATE())
    MySQL last_updated_time >= DATE_SUB(NOW(),INTERVAL 1 MINUTE); 
    Postgres last_updated_time >= (NOW() - INTERVAL '1 minute' )
    H2 last_updated_time > DATEADD('MINUTE',-1, CURRENT_DATE)

     

      

    Custom Job

    Create a custom job that is configured and scheduled to pull over updates from the source periodically.  

    • Choose the source nodes group
    • Choose BSH as the job type
    • Schedule the job to run based on the interval you chose above while setting up the table routing (every 1 minute for example).
    • On the edit script section of the job provide the following. Replace the source and target node values accordingly.import org.jumpmind.symmetric.model.TableReloadRequest;

    TableReloadRequest reloadRequest = new TableReloadRequest();
    reloadRequest.setChannelId("time-based-reload");
    reloadRequest.setSourceNodeId("YOUR SOURCE NODE ID");
    reloadRequest.setTargetNodeId("YOUR TARGET NODE ID");
    reloadRequest.setRouterId("ALL");
    reloadRequest.setTriggerId("ALL");
    reloadRequest.setCreateTime(new Date());

    engine.getDataService().insertTableReloadRequest(reloadRequest);



  • SymmetricDS Transformation Pipeline

    Often when setting up data replication it is necessary to redirect or transform the source tables and columns into a different set of target tables and columns.   SymmetricDS offers this transformation feature as part of the real time data replication workflow.   Below we will review the places within the workflow this transformation can take place and how to configure it.

    Overview

    Many ETL based tools on the market are great a providing a solution around data manipulation and transformation as part of a scheduled process.   However if you need to perform near real time transformations on changes as they occur SymmetricDS can be a powerful asset to continuously perform these transformations as part of the replication process without any additional nightly schedule of batch jobs.

    Pipeline

    First we need to understand the components that make up the replication pipeline and when they are executed.   The image below diagrams all the components that can be potentially used as part of the replication pipeline.   In order for these to execute properly though the catalog, schema, and table information configured must match as the batch works through the pipeline.  If any of these configuration values (catalog, schema, table) do not line up with the step prior the transforms or filters will not execute.

     

    Routing

    Routing is the process that checks for newly captured changes and will determine where they need to go by assigning them to a batch.  Target catalog, schema, and table information can be set at the router level.   This is often the simplest way to change the target information for multiple tables.  Setting the target catalog and schema at the router will effect all tables that flow through this router (table routing) and allows a single configuration point.

     

    TIP: Setting target information on the router works well when you just need to change the catalog and/or schema but not the table names.   If you need to also transform table names it is better done through transforms.

      

    Extract Transforms

    These transforms fire on the source if the source catalog, schema, and table match the values that came out of routing.   There are 3 ways to hook up an extract transform.

    • Routing values are blank for target catalog, schema, and table.   Then the extract transform would also have blank entries for source catalog and schema.  Source table on the transform could contain a table name or a wildcard (*) to hit all tables.

                  

    • Routing has the "use source catalog and schema" set.   Then the extract transform source catalog, schema, and table would need to match the values in the trigger (source catalog, schema, and table)

                 

    • Routing directly sets any of the values for catalog, schema, or table.   Then the extract transform source catalog and schema would need to match these values set on the router. 

                 

    Load Transforms 

    These transforms fire on the target side if there is a match.   There are 2 ways to hook up a load transform

    • When an extract transform is present.   Then the extract transform target values need to match the load transforms source values.

            

    • No extract transform present.   Then the load transform would hook in just like and extract transform following one of the 3 router combinations explained above for the extract transforms. 

    Load Filters

    Load filters must provide a target catalog, schema, and table combination that matches the upstream settings in the pipeline in order to execute.   Load filters can provide a variety of customizations on the data once executed.   Below is an example of how you could set the target catalog and schema on the target side rather than through a router or transforms.   This might be useful if only the target knows where their data should be loaded into (remember routing occurs on the source).   Through the use of a wildcard (*) as the target table a single load filter can support all tables rather than setting up a transform for each table.

     

    Example: Beanshell load filter "Before Write" to set the catalog and schema.

    // Check that the table is not a SYM_* runtime table 
    if (!table.getName().startsWith(engine.getTablePrefix())) { table.setCatalog("catalog1"
    );
    table.setSchema("dbo"
    );
    }
    return true;


    Example : Similar to one above but the values now are dynamic and could be set in each targets engine properties file (target.catalog and target.schema).

    // Check that the table is not a SYM_* runtime table 
    if (!table.getName().startsWith(engine.getTablePrefix())) { table.setCatalog(
    engine.getParameterService().getString("target.catalog"));
    table.setSchema(
    engine.getParameterService().getString("target.schema"));
    }
    return true;