WAN backups failing

SUMMARY

WAN backups or backups across an ISP link or through firewalls

ISSUE

Backups through a layer 3 gateway firewall or especially over an ISP hosted link will fail with network connection stopped or verify failures (where both checksums are not present, indicating backup completion was not made). Timeouts in backup may also be reported.

For more detailed information on backup failures and performance issues see Common Backup Failures and Performance Issues

RESOLUTION

WAN Backups are only supported today over dedicated network links that do not traverse firewalls or gateway layer 3 hops. All connectivity between appliances and agents must be across flat VLANs. This can be accomplished over WAN links in some cases using hardware VPN, Point to point routes, MPLS, and equivalent systems. Connectivity that is not reported as a single hop in a traceroute will not be supported. For example, separated VLANs or NAT connections are not supported.

Recommendations / Best Practices
1. Backup support over WAN only includes use of the Image Agent. There are specific changes that have been made to tune the TCP socket timeouts and reconnection processes that will keep the sockets open even in case of inactivity or connection instability. Backups of VMs, applications, or other OS over WAN are not supported.

2. Individual protected server, change rate not to exceed the following for the given network speeds:

Network speed Max individual server or combined backup thread size Max daily change rate
<T3                                 Not Supported
T3 (45mbps) 300GB 5%
Fast ethernet (100Mbit) 800GB 7%
Gigabit (1gbit) Most reasonable systems Most standard change rates
*above speeds assume 100% bandwidth is allocated without filters, throttles, or packet inspection. if your network caps threaded speeds below your connection speed you must use your max sustainable transmission speed above.

3. To minimize the amount of data being sent over the WAN / MAN, Incremental forever must be the default strategy used, leveraging the change journal for incremental backups. This prevents the file system being scanned for changes and minimizes idle time across connections causing routers to drop the connection.

4. The Technical Audit provided to sales must capture the WAN / MAN bandwidth available (uplink and downlink) after QoS and any throttling the customer may have and a list of systems and sized to be protected across such a link.

5. The round trip time (RTT) (ping returns this information) must be 20 ms or less. This information must be captured in the technical audit.

6. Packet loss must be less than 1.5%.

7. Any QoS settings that capture packet loss, retransmits, network latency, jitter must be captured in the TA, if possible.

8. It is recommended that full system restores or BMR restores be done locally at the site of the appliance instead of doing so over the WAN. BMR or IR across a WAN link will be unlikely to succeed and attempts to resolve cannot be escalated to senior support. Customers will be asked to perform these operations locally to the appliance.

9. If the customer has a WAN with high latency, it may be advisable to reduce the MUX concurrency down through appropriate backup scheduling or changing the concurrency count for the device.

CAUSE

Factors affecting backups / restores
There are a number of factors that affect the performance of a backup over a WAN / MAN:

1. Network resilience:
To perform backups of any sizeable nature, the network connection between the protected asset and the backup appliance must not drop during the execution of the backup. The current backup method does not support fail/ resume capabilities which imply that the connection must stay alive for the duration of the backup.

2. Network latency:
If the network latency between the protected asset and the backup appliance is high, there is a chance that the connection may be reset due to timeouts.

3. Change rate:
The amount of data being transferred over the WAN / MAN must be minimized to decrease the probability of failure due to network connection drops.

4. OS being backed up:
Some operating systems being protected are more resilient to inactivity on the TCP channels than others. For example, Linux variants have been more resilient for protection over WAN/MAN than Windows.

NOTES

What is not applicable
This document does NOT apply to end-point backups which have non-persistent connections. The protection paradigm for end points is not the same as servers as the end point has to initiate / resume a backup when it is in the network accessibility range of the backup server. This paradigm is opposite to the way servers / hypervisors etc are protected where the job scheduling is controlled by the backup appliance.

SUMMARY

ISSUE

RESOLUTION

CAUSE

NOTES

Browse this section