Q&A: Diligent CTO demystifies data deduplication
By
Deni Connor
,
Network World
, 05/04/2007
- Share/Email
- Tweet This
- Print
Diligent Technologies is among the pioneers of data deduplication technology, which helps enterprises reduce redundant copies of data and, in turn,
shrink storage requirements and shorten backup times. Neville Yates, Diligent’s CTO, talked with Network World Senior Editor Deni Connor about the varying deduplication technologies used with today’s virtual tape libraries (VTL).
So what is deduplication?
Deduplication is a means by which data is examined and compared to existing data. If it is the same, it is filtered out and
the existing data is referenced. Deduplication is very prominent in applications such as backup that cause a lot of duplication
as a byproduct of how they work. These applications are prime targets for deduplication technology.
What forms of deduplication are there?
There are three ways deduplication can occur that are talked about today in the market. One of them is the offering from Diligent
called HyperFactor, which takes a look at data in an agnostic form and searches the datastream for similarity. Once similarity
is found, a computation difference is performed guaranteeing that what is to be filtered out is exactly the same as what is
referenced. Only new data is stored.
Another one uses hash technology or hash algorithms whereby data is sliced into some digestible piece -- such as perhaps 8Kbytes
in size -- and a hash is assigned to that data and the data is stored. If that signature or hash is recomputed on a new datastream,
then that computation suggests that that data already exists and can be referenced. It doesn't need to consume more storage,
thereby reducing the amount of storage consumed.
The third is one where the datastream is looked at inside for its logical content, assuming that a file of a particular name
is most likely to be a good candidate when compared to the contents of a file of exactly the same name on a fully qualified
basis, meaning directory, directory tree, etc., and then a computational difference is done between the two files.
So there are three fundamental approaches and many different ways of implementing those approaches.
What are the different ways deduplication has been implemented?
One of the implementation differences in those approaches is whether you receive all of the data and lay it down on disk and
then sometime in the future read it back in from a deduplication perspective, or whether during the receipt of the data you
process it inline and in real time to achieve the deduplication.
Those are called inline and post-processing?
That is correct.
You say that Diligent uses the HyperFactor approach. Who are some of the vendors that use hash algorithms?
Hashing or some derivative thereof is used by Quantum/ADIC, Data Domain and FalconStor. HyperFactor is our own IP. Content-aware is something that is being pursued by Sepaton.
What are the advantages and disadvantages of inline deduplication and post-processing?
Inline deduplication first of all is difficult to achieve in terms of performance. But if you do achieve it, it is advantageous
because once you have finished the job, the job is done -- there is no heavy lifting and you don't have to worry about capacity
planning for any background tasks and what resources might be available to support that. Contrary to post-processing, while
the data is being received by the backup application, none of the heavy lifting is being done, and so end users need to concern
themselves with the amount of effort needed to do the post-processing.
Partner Content
www.bmc.com
Gartner 2009 Magic Quadrant for Job Scheduling
Gartner has positioned BMC CONTROL-M in the Leaders Quadrant of their "2009 Magic Quadrant for Job Scheduling." The report assesses the ability to execute and completeness of vision of key vendors in the marketplace. Read a full copy today, courtesy of BMC Software.
Download whitepaper
Dell's SMART Approach to Workload Automation
Read a compelling case study by EMA, Inc. to learn how Dell uses BMC CONTROL-M to cut cost and increase productivity with workload automation.
Download whitepaper
Workload Automation Cost Savings 2 Minute Video
A major computer manufacturer uses BMC CONTROL-M and just four people to schedule and run over 85,000 jobs every month. By switching to BMC CONTROL-M, they more than quadrupled the workload without adding a single staff member. See how in this 2-minute video overview.
Go to video
Comments (1)
WowBy Anonymous on March 16, 2008, 11:36 pmI guess people just totally forgot Avamar (now EMC). I know for certain that Avamar's "commonality factoring" is hash based, inline deduplication.
Reply | Read entire comment
View all comments