Common Data Replication Mistakes

There are some common mistakes when implementing a data replication solution that should be avoided. As I’ve helped companies use SymmetricDS effectively for database replication as part of their data integration strategy, I’ve seen the same mistakes and misunderstandings. In this article, I’ll explain some of these pitfalls and give you some tips on what to do instead.

Get Close to Your Database

Most native database protocols were designed for a local area network (LAN). Connecting to the database over a wide area network (WAN) can result in slow and unreliable replication due to network errors, outages, and timeouts. Instead, deploy the replication engine either on the same server as the database or one nearby. For example, if your database is running on Relational Database Services (RDS) in Amazon Web Services (AWS), then install the replication server to Elastic Compute Cloud (EC2) in the same region.

Choose a replication engine that is designed to work across the WAN. It should let you install an engine for each database, and then the two engines should communicate with each other. For example, SymmetricDS can be installed multiple times, and each instance of the engine uses an efficient data protocol to communicate with others over HTTP/S. Next, the engine’s design should tolerate and overcome network errors. SymmetricDS retries communication and uses the transactionality of the database to guarantee data is loaded.

Know Thy Data

While it would be easy to turn on data replication for all tables, columns, and rows in the database, this approach is inefficient and can lead to problems down the road. It may work fine for a while when data changes are light, but a large change event can suddenly create a backlog and wreak havoc. By understanding your data, you can configure what data really needs to sync.

Do you really need all tax rules for all jurisdictions at every point of sale terminal, or do you just need the rules for the jurisdiction of that retail location? Should all sales receipts be synced from the central office to every retail location, or should receipts sync in one direction and be retrieved for returns when needed? Even specific columns might be excluded from replication. One popular retail software package adds a new retail store by loading product items into the database for it, but it also touches the “last modified” date of the master item table, causing all items to be sent to all store locations. That worked fine for a few locations, but try scaling that to a few thousand locations. Excluding that column from change capture saves both bandwidth and time.

Conflicted Data

Sure, there’s a conflict manager in many data replication engines, including SymmetricDS, but it might not work the way you think it does. If client A inserts a row with an ID of 1, and client B also inserts a row with an ID of 1, the conflict manager picks the row that wins and overwrites the other one. No, it doesn’t rename one of the rows to be an ID of 2. You need to design your application and database so it’s conducive to replication.

To handle this situation, some people use a globally unique identified (GUID), which is a random 128-bit bit number, as the primary key to get unique rows. Another approach is to control the sequence number so that each database has its own set of numbers, such as client A starts at 1 and client B starts at 2 using increments of 2. Or, just add a column indicating where the row came from and make it part of the primary key. Some replication engines like SymmetricDS even let you configure a transform to populate the column. This makes sense for clients replicating in one direction to a central database, where the source node ID is populated as the data loads.

Even More Conflicted

A more insidious conflict is two clients updating the same row and same column. The conflict manager can’t merge the results because the same column is being updated. One client will lose its update. However, there is a clever trick that can be done if your data is a counter and if the replication engine has transformations. Let’s say you’re replicating an inventory on-hand count. SymmetricDS has access to the old and new values, and it can transform data before it loads. You can configure a transform that applies the difference to the column instead of the absolute value. For example, client A changes the count from 10 to 5, and client B changes the count from 10 to 11. Instead of trying to update the column to the value of 5 or 11, the transform decrements the current value by 5 and increments by 1, thus merging the changes, regardless of order.

Heading for Overhead

Sending subsets of data during replication is a powerful feature, but be aware of how much overhead is added to accomplish it. The column match router in SymmetricDS lets you use a column within the row to figure out which databases will receive the row, which is very fast. But what about when the column you need exists on another table? That’s when you run a SQL query to find it, which is called a subselect router in SymmetricDS. The subselect is acceptable for a small number of rows, but if you have an initial load with millions of rows, suddenly you’ve generated too many queries on the database. For SymmetricDS, the easiest way to improve performance here is to specify an initial load SQL query on the trigger router, which uses the database to join and run the query once for all rows.

Wrapping Up

Those are some of the common ways I’ve seen data replication misused to create an unreliable system. If any of that sounds familiar, at least you’ve got company. Hopefully, you’ve found some ideas here to help take the misery out of replication. At JumpMind, we’ve been through many data integration projects, so let us know if you want to discuss or compare notes.