A Closer Look at Data Deduplication

Data deduplication is a method for reducing storage needs by eliminating redundant data so that only one unique instance of the data is actually retained on the storage media. Reducing the amount of data that needs to be transmitted across the network, especially to the cloud, can increase backup speeds and save storage costs.

Although new approaches to achieving data deduplication are continually being developed, here are the basic methods.

With File-level deduplication, a comparison is made with the files already in storage and if the file is unique, it is stored and the index is updated.  If the file is not unique, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with pointers. The drawback of this approach is that the slightest change in the file causes it to become unique and the whole file is saved again.

Block-level deduplication segments data streams into blocks, inspecting each block to determine if it has been encountered before.  If the block is unique, it is stored and its unique identifier is added to the index. By replacing repeated blocks with pointers rather than storing the block again, disk storage space is saved.  Among the drawbacks of this approach is that as the number of pointers grows, the inspection process slows down significantly.

Another approach relies on byte-level deduplication, which analyzes data streams byte-by-byte versus previously stored ones. This granular approach results in a higher level of accuracy. The drawback is that this kind of processing is usually performed after the backup has been completed and requires a reserve of disk cache.

A more recent approach uses zone-level deduplication, which breaks data into larger “zones” and then compares at the byte level. This approach offers a compelling advantage: As data grows, all resources are brought to bear: processor, memory and bandwidth as well as disk. If the data doubles, triples or quadruples, then the processor, memory, bandwidth, and disk also doubles, triples, or quadruples so that as data grows, the backup window stays at a fixed length.

The choice of data deduplication method hinges on several factors, including the type of data (structured or unstructured), the backup environment, and your organization’s specific requirements, some of which may be mandated by laws and regulations.

DataLink offers backup and restore solutions to fit any business need and can recommend the best data deduplication approach for you.  Contact us today: 410.729.0440 or sales@DataLinkTech.com