Tips for Fast Initial Load

For a large database, the conventional initial load that extracts each table, one by one, can run for days. Not everyone can wait for that, so here are some tips to improve initial load performance on SymmetricDS.

Parallel Extraction

Use more than one reload channel to extract tables in parallel. During configuration, each table is assigned a channel for change data capture and a reload channel for the initial load. By grouping tables and assigning them to one of multiple reload channels, we can achieve parallel extraction. Try to balance the reload channels so each one has about the same amount of data to extract, either using row counts or table size. Most databases have a system table you can query to get a list of tables and their size. For example, on Oracle you can run the following SQL statement:

select table_name, num_rows * avg_row_len from user_tables order by 2 desc

By sorting the tables from largest to smallest, it’s easy to make equal-sized groups. Once you’ve decided on reload channels, go to the Configure -> Table Triggers screen or sym_trigger table to set the reload channel for each table.

One caveat is that long running extract queries can cause the rollback space of a database to fill up, so you might need to increase rollback space or reduce the number of parallel extraction threads. To control how many channels can extract in parallel, change the following parameter:

initial.load.extract.thread.per.server.count=20

Parallel Loading

Now that you have multiple reload channels for extraction, you can make these channels load in parallel as well. Threads for loading are controlled at the channel level. Go to the Configure -> Channels screen or the sym_channel table and update the “queue” field. Each queue has a thread that processes channels by their priority order. You can use any queue name you want to create new queues. For example, give each reload channel a queue named after the channel, and you’ll get all reload channels to load in parallel. There is a limit on the total number of threads associated with push and pull synchronization, so you may need to adjust those parameters also:

pull.thread.per.server.count=10
push.thread.per.server.count=10

No Constraints

When loading data into tables, the database must check that constraints are satisfied and update indexes for each row. It’s faster to defer constraints and indexes until after all the data is loaded. You should keep primary keys, but drop (or disable) all other constraints and indexes. If you’re using SymmetricDS to create tables, then the following parameters will create tables without foreign keys or indexes:

create.table.without.foreign.keys=true
create.table.without.indexes=true

Not only does this help the load run faster, but it’s also necessary to avoid foreign key errors on some systems. When we created multiple reload channels, unless we were careful to keep dependent tables on the same channel, it will break foreign key order. It also solves a timing issue for busy systems that are being changed while the load runs. (For example, users can make changes to table A while it’s being extracted, and when table B extracts, it will find references to rows in A that weren’t extracted.) Of course, the tables become consistent after the load is complete and the changes catch up, but you need foreign keys disabled to allow that to happen.

After the load is complete, and changes have caught up to the point in time when you started the load, set the parameters back to false, then use the Manage -> Nodes screen to Send -> Table Schema (or use the “symadmin send-schema” command) to create all the indexes and foreign keys.

This technique of deferring constraints and indexes works so well that it was automated as a default feature in SymmetricDS version 3.10. Manual manipulation with the parameters as described here, or managing the DDL yourself works for earlier versions.

Bulk Loader

Most database platforms have a bulk loader in SymmetricDS that uses platform-specific features for faster loading. Bulk loaders are included for Oracle, SQL-Server, PostgreSQL, MySQL, and other databases. Bulk loading is enabled at the channel level. From the Configure -> Channels screen or the sym_channel table, edit the reload channels and set the data loader type to “bulk”. If a bulk loader isn’t available for your database, it will use platform-agnostic JDBC batch updates instead. The bulk loaders perform best when the target tables are empty. If any error occurs, such as an existing row is found, it will fall back to using the default loader, which will slow down the load.

Disable the Default Channel

For active databases, disable the channels used for change data capture right before starting the initial load. When the source database is being changed, it can create a backlog of many small batches. By disabling the channel, we’re telling the router service to let the changes be captured, but don’t put them into batches yet. After the initial load is complete, enable the channels again, and the router service will create the batches needed to catch up. Instead of a large number of small batches, there will be a small number of large batches. Not only does this help the system run faster, but it also gives you a simplified view of batches in the system as you watch the progress of the initial load.

Conclusion

We can speed up the initial load by tuning a combination of settings. We use reload channels for parallel extract threads and channel queues for parallel load threads. We use a bulk loader and defer indexes and constraints to load data into target tables quickly. For busy systems, disabling the default channel during the load ensures a speedy recovery of changes afterwards. I hope these tuning tips help you with a fast and successful initial load of data.