Hot Backup Copy (Replication) fails and retries some large clients

SUMMARY

Hot Backup Copy (Replication) will have a large job in progress, then fail and start from the beginning. This may be successful or continue to retry.

ISSUE

Hot backup copy jobs (formerly called replication), may process normally for most clients, but fail and retry for some others. Those jobs may complete normally, or, may continue to restart. This can be especially common on very large backups.

A specific type of network timeout, where certain results of database requests against the target were not returned properly, can cause a postCheckOffset error in logs. In some cases, detecting and retrying the transaction may result in success.

For more information on Hot Backup Copy (Replication) issues see Unitrends KB 5030 - Hot Backup Copy (Replication) Overview, Setup and Error Troubleshooting

RESOLUTION

Upgrade the sources and target to release 9.0.0-13 or later and then retry replication.

To verify the replication target is on release 9.0.0-13 or higher, run the following command from a command shell on the source appliance:

psql -c "select * from bp.managers"

If you are a Unitrends Cloud customer and find your target to be on a release prior to 9.0.0-13, please contact Unitrends Support.

This issue is difficult to troubleshoot as it requires access to the target, reset of log levels, increased log retention, and reproduction of the issue live. If an upgrade to release 9.0.0-13 or later has not resolved backup copy failures, please contact Unitrends Support.

CAUSE

In release 9.0.0-13 and later, reliability improvements allow for proper retries, especially on network connections where reliability may be a concern. Note: these issues may not be directly related to the local network, ISP connection, or hardware in use. While the fix for this issue is on the replication target, not the source, both ends must be upgraded for the benefit.

To confirm this error was occurring requires access to the replication target vaultServer logs.

When reviewing the vaultserver logs for a job that failed, the following output may be seen:

vaultServer.c:2916: blockCheckOffset: call putBackupsHashValues return 0, backup_no 798, epoch 1461976349, nu mber entries 1024.

Log level 2 for VaultServer is required to completely troubleshoot this condition.

SUMMARY

ISSUE

RESOLUTION

CAUSE

Browse this section