How does Deduplication use D2D storage

SUMMARY

Describes the process and storage implications of Unitrends native de-duplication process

ISSUE

This KB was written to describe the worst-case situation in which the desired retention is very low, the amount of protected customer capacity is very high, the deduplicatability of specific data is very low (i.e., the data is unique) and a master/differential or master/incremental backup strategy is being used rather than an incremental forever strategy. In this extreme case, Unitrends’ Adaptive Deduplication™ algorithm will optimize for higher compression and for fast backup and recovery (including both VMware HOS-based instant recovery and Windows physical server and GOS-based instant recovery) using our landing zone architecture.

Purpose

Describes the process and storage implications of Unitrends native de-duplication process

Description

Unitrends’ adaptive deduplication is the process by which duplicate data blocks

are removed from backups. With deduplication, backup sizes decrease as

duplicate blocks are removed, thereby increasing the number of backups that can

be stored on the system, also referred to as on‐system retention.

Cause

Unitrends Native Adaptive De-duplication is enabled. This causes the creation and maintenance of the SIS (Single Instance Storage) folder inside each D2D storage device on the backup system and begins active duplication of available backups. Initially, storage utilization will increase as the de-duplication process begins and builds the single instance storage. Once enabled, until sufficient backups exist in the D2D device, storage use may be more than normal.

After additional new backups exist for client systems or applications, older backups in the D2D are de-duplicated based on the contents of the SIS container, reducing the size of older backups. Backups are de-duplicated at the block level across the entire contents of the D2D device, but a minimum number of backups must exist for this process to operate normally.

Resolution

A few things need to be understood about the SIS and de-duplication process.

1) New backups that enter the D2D as successful backups are scanned by the dedupe process. The backups are hashed at the block level and compared with the existing contents of the SIS folder. All new blocks are added to the SIS. Thus, when a backups of a new client is run for the first time, the minimum storage of that client’s backups is TWO TIMES it’s compressed size assuming all of it’s blocks are new and unique.

2) new (marked “last” in the backups report) backups are never de-duplicated. This is for improved performance of archive and vaulting operations, and rapid data recovery of full backups. Since the most recent full backups are the most likely to be recovered, and because the reconstitution process during recovery of de-duplicated backups is performance intensive, the current backups are never de-duplicated.

3) The second time a master backup for the same client is run, it’s new backup must land in the D2D device. Since the existing backup until this second for that client was most recently the “last master” it is as yet not de-duplicated, and all of it’s blocks are also replicated to the SIS folder, this means your current storage before deduplication begins on a client is THREE TIMES it’s compressed size, because at this instant you will have THREE copies of every block in your compressed backups. One each in the old and new backup, and a hashed block in the SIS folder. This is the minimum required storage space for de-duplication to function normally. If you do not meet this requirement, De-duplication will self-disable and the SIS folder will slowly dissolve away as new backups are run and old backups are purged to increase retention in storage constrained environments.

4) After a second successful backup for a client is run, the older backup set is no longer protected as “last” and is de-duplicated as a background task. Technically, dedupe is always running, every time any new data enters the SIS folder, all existing backups are checked for matching SIS blocks, so a new backup cannot only result in de-duplication of older backups for the same client, but other backups for other clients, further improving the de-duplication ratio.

5) Once the threshold for de-duplication, 3 times the compressed size of base capacity, is met, de-duplication performance improves. Each new backup takes the space of a previously de-duplicated backup, so the net additional weekly storage growth becomes the sum of each single unique new block.

6) To improve replication performance in version 7.0, the SIS database is leveraged to determine unique blocks, but they are compared against the Target’s own independent SIS storage, which can contain blocks from multiple sources and far more clients. Only blocks unique to the target are sent across the wire during replication, and backups are individually re-constituted based on the SIS structure of the target itself.

Third-Party Sources