Information Technology Navigator

Tips, Advice & Insights from Technology Pros

Dedupe Architecture Considerations

Posted by Vin Choinski

Tue, Jul 28, 2009

As with any new target device you might be incorporating into your backup environment, you are usually doing it for one of the following reasons; to increase capacity and /or throughput or to improve manageability. The basic connectivity considerations that you would apply to any backup device such as tape, disk or optical still hold true for deduplication targets.  Because of some of the unique benefits deduplication solutions offer, it is very important that you don't overlook critical architectural components in the quest to best leverage this technology.

For those who still question the maturity of the technology I would remind you that its underpinnings are loosely based on journal file system concepts which I was first exposed to in the open systems world in the early 1990's through the Digital UNIX ADVFS file system. (Remember those guys?)  This type of file system abstracts the address level of the file system from the data level, creating pointers from the address level to common data sets at the data level.  First used for pointer-based snapshots, this concept has since been leveraged for deduplication storing only unique data blocks at the data level and using the address level as a reference.

There are two predominate deployment configuration commonly being used across the industry today:  source based and target based.  Source-based deduplication occurs on the client side and deduplicates data on the host before it sends the information across the TCP/IP layer. This can make it a good candidate for servers at a remote office that might not have an optimal WAN connection to the final data storage location. Most source-based deduplication products started off as standalone backup applications with proprietary agent code that makes it difficult, if not impossible, to integrate with heterogeneous backup environments.  Where backup application integration exists it is usually through an acquisition of the code by one of the big storage manufacturers and only with that manufacturer's applications.

Target-based deduplication, on the other hand, is typically appliance based and designed to plug into the backend of the existing backup infrastructure similar to a classic backup device such as an automated tape library.  Target-based appliances are designed to fit into the existing backup solution paradigm and can usually be presented as both a VTL (virtual tape library) of a disk backup target.  Unlike source-based solutions, target deduplication occurs at the device level after normal backup data is sent to the backup server; this provides no reduction in data at the TCP/IP level, making it a better choice for data center deduplication where LAN bandwidth is not an issue.

Both source- and target-based solutions can reduce or eliminate physical tape from a backup environment by leveraging the same deduplication mechanism at the replication layer. If you can replicate data offsite to a second appliance, then you can eliminate the need to create a physical tape for offsite storage. If only the deduplication delta has to be replicated to make a complete offsite copy, you will contain your site-to-site network connectivity costs.

Remember, the amount of data you manage in a deduplicated environment is greatly affected by the retention period of the backup data. Long term retentions can have a significant cost on the overall solution.  Managing daily, weekly and even monthly data on deduplicated is common, for longer retention periods hybrid tape/ de-dupe solutions may be the best answer. Always consider any additional licensing cost for integration into an existing backup infrastructure and support for your client base (e.g. unique operation systems or database agent support).