SUMMARY
A starting point for diagnosing backup failures
ISSUE
A backup has failed. There are many conditions from permissions, connectivity, storage limitations, environment impacts, 3rd party interference, misconfiguration, and more that could result in a backup failure. This article will attempt to address some basic concepts and common issues that will assist in resolution.
Hard backup failures are typically associated with a conclusive error that can be identified with minimal effort. In rarer cases backups may be marked failed due to timeouts caused by many other factors or because appliance storage is insufficient to land a backup after waiting a reasonable time for space to be prepared. In all these cases, backup failure information is readily available in one of the following locations: All of these objects include filtering tools to make it easy to locate your failed job
- Reports > Backup > Weekly Summary (a handy report to quickly ID what jobs that were scheduled did and did not complete with success. Itemizes any queued job)
- Dashboard > Backup Summary Widget > Click Errors to load the failures report for last 7 days > click your failure (itemizes only failures, but for all queued jobs)
- Reports > Backup > Backup Failures > Set a date range > Click the job to review it's log. (same report as above but with a flexible range instead of 7 days)
- Jobs > Recent Jobs Tab > Select a job, Click Details (last 7 days of jobs including hot and cold copies as well)
- Reports > Backup > Backup History > Set a range > Click the job to review it's log. (only includes backups that began - connection made, effort begun - but contains more details.)
RESOLUTION
Did a job ever officially begin?
Such errors will be found in the backup failures report, schedule failure email output, or recent jobs summary, but these types of failures will not be itemized in backup history as no backup began in these cases. Most of these revolve around basic connectivity problems like below:What can cause the "Host appears to be up, but had a network related error?" messages?
- This error can sometimes result from an agent upgrade that was not successful or the agent is not running, or a software firewall may be interfering. Check the troubleshooting steps in Host appears to be up, but had a network related error (-255).
- This error means we were unable to ping the asset. This could be a DNS error, the client may have a new IP, or routes through firewalls may be blocked. See Windows Backup Error: Task failed: Could not contact client after 5 attempts.
Did a job begin but fail within a few minutes without transferring data?
Typically a clear error will be reported in the backup summary report for such an event. Review the data presented. Common issues are noted here, but if you do not see the response, note the error and check our KB system as there is likely a response to most presented errors documented in our system. Ensure you provide this error output to your Customer Engineer to save time if engaging Unitrends Support. What can cause snapshot related errors like "USNAPError! Writer has failed PrepareBackup request" or "VSS Errors" errors?
- For Windows file level, image level, and application backups, the Microsoft VSS agents are invoked during the backup. VSS is a critical service to ensure locked files can be protected and every backup vendor depends on it. If VSS is non-functional a backup cannot proceed.
- If VSS errors are confirmed, restart the VSS service or reboot the client to reset VSS. (note, some VSS services may impact other operations, and no backups or snapshots can be running when this is restarted. Do so with caution and plan that downtime may be required to resolve if the service does not successfully restart) See Volume Shadow Copy Service Errors
- Check to ensure that you have sufficient free disk space on the volume that is failing. A general guide is 10% free space minimum shoudl always be maintained, but Microsoft recommends 30% free on AD NTDS volumes or where any database or it's logs is stored.
- Exchange VSS errors are numerous. Information about a microsoft provided VSSTester proces is included in Exchange Backup - VSS_WS_FAILED_AT_FREEZE
- If the application is SQL Server, additional information can be found in Unitrends VSS SQL agent errors and warning messages
- Incremental backups depend not only on finding new/modified files, but Windows Replicas and incremental forever backups depend on also knowing what's been deleted, moved, or had underlying permission changes. This information is captured in the NTFS USN Journal called the change journal. Issues with the journal are commonly that too much change is occuring in windows for the Journal size to accommodate between 2 backups. If the journal has wrapped a new full backup will be required.
- This issue is typically solved by increasing journal size and performing a new full, but may also be solved by performing more frequent backups or identifying the nature of high change and ensuring that such change is expected.
- For instructions on how to increase the change journal, see Journal wrapping or Journal Overflow / Expand journal
- For an overview of how the NTFS change journal works, see Change Journal Records.
Did a job run for a while receiving data and then result in failure?
What can cause a backup to fail with "FATAL: Incomplete backup received from client."?
This error generally means we were able to communicate with the client and pass pre-checks, but did not receive the expected data. We are aware the backup ended abruptly before complete, or, that it completed but with an unacceptable amount of skipped data. The details of the report will usually help clarify the specific condition, but if the agent was terminated on the client or the client rebooted during a backup then the appliance would not be aware.
- Check Backup Summary report and/or client catalog files for skipped file information and resolve the condition that caused the files to be skipped. See How to find skipped files in a "Yellow"(warning) backup This article also applies when backups are red/failed.
- Ensure other processes do not interrupt scheduled backups, and that no other backup products or snapshots are used.
- Ensure the asset is not rebooted or otherwise impacted during scheduled backups.
- On Hyper-V, Xen, or AHV, ensure exclusively an Agent or a VM backup is used, not both.
In some cases we time out completing standard operations, or a firewall between our appliance and the protected asset closes ports.
- See Network Connection Stopped / Increasing Master.ini Timeout Values on a Windows Server Note, increasing these timeouts is considered a workaround, this should never be necessary on a healthy server, and requiring this change indicates a critical or severe condition exists with windows or your hardware that needs to be directly addressed. Your client system may be in danger of eminent crash or corruption is this change is required!
- Exchange backups may have similar pressures. See Exchange backup failing for "Network connection stopped" This is often doe to Exchange consistency checks failing which is a production emergency concern, failure of log truncation which may also indicate exchange database or log corruption, and is common for exchange databases that are larger than Microsoft's recommended 100-200GB maximum size for databases that do not have at least 3 DAG copies. It may also indicate a lack of sufficient IO or other resources in the exchange server if the timeouts are related to VSS operations that are not outright failing.
- Ensure backups are done across local LAN connections without firewall hops. If hops are detected, enable additional network adapters in the Unitrends appliance and use proper network route configuration to ensure chokepoints between VLANs do not impact backup operations. If the appliance cannot be directly connected to the VLAN the asset resides in, deploy an alternate appliance in that VLAN instead, or, work with your firewall administrator to resolve.
- For backups across WAN or MAN links, see WAN backups failing
- Ensure your server is not running out of free space.
This indicates backup operations for agent-based file or application backups did complete, but port 1743 required for wrap-up communication at the end of the operation was no longer open. This is most common when connecting to clients through firewalls or L3 network segments, but can also indicate lack of RAM resources in Windows systems resulting in Windows terminating idle WinSock connections.
- Ensure no network hops exist between client and Unitrends Appliance.
- Ensure your client system is not using pagefile resources prior to backup launch
- Limit load on the server during backups, especially older 2008/2003 servers.
- Address IO performance issues in your servers, which are often exacerbated by pagefile activity.
- Ensure antivirus processes are set with proper exemptions to ignore wbps.exe processes. See Recommended Anti-Virus exclusions for Unitrends backup
- If the server runs SQL or other database engines, ensure basic SQL tuning has been done to lock SQL processes to available RAM and ensure SQL does not spill into pagefile or kernel memory space.
This error is seen when a backup has more data protected than the predicted or real need for free space to land that backup. Note appliance space is reserved to the D2D folder, and the product will ensure some limited disk free space in addition is always available for internal processes and restore operations. This error does not mean your disk is full, only that the appliance cannot contain the data it has been asked to contain and also the new backup at the same time. For new data to come in, old data must go out in most cases, and this error is because that operation was not possible or could not complete timely.
- See Error: No more space on device to address this condition.
TASKS
For many errors, Unitrends Development staff will continuously work to improve the identification of issues, and in many cases can work around non-critical errors with updated code. Please always ensure you are running the latest available appliance release and matching agent version where one exists. Updated agents may at least provide improved error messaging but may also solve many problems simply by upgrading.
Though older agents are allowed to continue functioning on a newer appliance releases, that is intended only to provide backward compatibility for the short duration to account for scheduling a maintenance window between an appliance update and a safe time to install updated agents on clients as we understand some agent updates may require new full backups. We allow you to run on an older agent until you can accommodate that change but do insist on agent updates being done as quickly as is possible. Some appliance and or agent updates are so critical we will require a new full immediately on appliance update whether or not the agent is updated, and users are requested to check release notes for those conditions before upgrading. Unitrends Staff will only officially support the current appliance running a current agent and does not support back releases other than to either assist in upgrading to a current release, or for P1 restore emergencies after which an update will still be requested to be completed for the future.
If you are encountering backup failures that are not expressly described or easily resolved above, and you do not find supporting information in other KBs for your issue, and are running an older agent release, a good step is to try to update and perform a new full backup on the latest agent. If this does not resolve your issue, contact support to assist in further troubleshooting. In some cases, updating an agent may itself lead to failure where backups were successful on a prior agent.
DO NOT DOWNGRADE AGENTS until after you have consulted with Unitrends Support as critical log information may be lost as well as custom agent settings that may be in use during the un-installation process. If you suspect an updated agent is directly leading to backup failures that were not encountered on an older agent, inform support immediately. In many cases the root cause may already be known or resolved. Downgrading agents may increase complications with replication, DRaaS Processes, standby replicas, CDA testing and more and is not recommended.
NOTES
There are many other error conditions that can occur. By reviewing the Backup Summary messages and researching apparent errors in our KB system you will typically find resolutions to those errors. When in doubt, contact Unitrends Support as quickly as possible, including any relevant Backup Summary or client log information.