Dedupe to tape is “crazy bad” if the architecture is crazy
Posted by Preston on 2009-10-26
Over at Backup Central, Curtis Preston says he’s convinced that dedupe to tape according to the CommVault model is a good idea, in a “crazy good” way rather than a “crazy bad” way. To summarise Curtis’ argument (and thereby establish my understanding of it), the process is:
- Day to day recovery of deduped tape backup would be crazy (I agree with this)
- Design the system so that you still facilitate most recoveries from dedupe on disk (I have no issue with this)
- Periodically effectively stage out the dedupe data to tape (first objection)
- Long-term recoveries are done from tape written in dedupe format (holy cow that’s insane!)
So, let’s look at why I think this is “crazy bad” by examining each point.
Point one – day to day recovery of deduped tape backup would be crazy
Fully agreed. I’d liken recovery from deduped data on tape to recovery of highly fragmented files from a block level backup. Block level backup products (e.g., EMC’s SnapImage) allows you to bypass the inefficiencies of the filesystem on dense structures to do a block by block backup. This can deliver fantastic time savings. For. Backup.
For recovery, file level reconstruction from block level backups can suck in a terribly horrendous way. File level reconstruction from block level backups requires recovery of the required blocks into a cache, and then the files are put back together. If your files are heavily fragmented (which is often the case on dense filesystems), the number of reads from tape required – and the amount of seeking required – is very high. Real world example: 400 GB dense filesystem (about 40,000,000 files) had full backups reduced from 15 hours to 4 hours using block level backup. Recovery of the entire filesystem took less than 4 hours – recovery of a 40 GB directory took 12 hours. Having a very large cache is one way to get around this, but that starts to get costly (and in my experience is frequently poached).
Recovery from deduped data on tape will very likely suck just as badly.
Point two – design the system so that you facilitate most recoveries from dedupe on disk
Again, fully agreed. So far I’m in complete agreement with Curtis and CommVault. This point can be said of any backup design – design your system so that the most frequently performed recoveries are done from the fastest backup medium.
Point three – Periodically effectively stage out all dedupe data to tape
This is the crazy part, and not crazy good, but out and out crazy bad. To quote Curtis on this:
If you’re going to dedupe to tape, you first have to dedupe to disk. You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests. (Typically that’s 30 days, but it could be longer in your environment.)
What’s so crazy-bad about this?
Now, I’ll profess that I don’t know for sure which way this is being done, but it reads that new full backups are generated periodically in the dedupe environment, allowing the previous dependency chains of fulls + incrementals to be transferred out to tape. (Based on my reading of the CommVault marketing documentation, which refers to “reducing” the number of fulls required for retention cycles, this appears to be an accurate assessment.)
So this means that every X days (whatever your period-between-fulls is going to be) you have to do new fulls. Now while this isn’t so much of an issue in regular backups, in dedupe backups it’s a known fact that the initial full backups are hideously slow. This can be worn by most organisations when it’s a once-off. Every month? Even every 3 months or 6 months? Far less likely.
Point four – Long-term recoveries are done from tape written in dedupe format
Obviously some of my objections to this have already been expressed in my comments for point two, but to continue with my objections, let’s look at what Curtis has to say on this point as well:
But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal. (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)
Now, I agree that recovery of longer term backups can be done from slower media in most instances.
There’s a difference between “slower media” and “a snail just overtook our data recovery”.
In the first case, I don’t believe that recovery from deduped data on tape will be in the order of “several minutes” … I think this would turn out to be a highly optimistic rather than terribly realistic time-frame. I would need to see a large number of real world instances of short recovery times to really believe this will be in an order of “several minutes”. Yes, I’m going on a gut feeling, but I feel it’s somewhat justified.
In the second case … “you could actually bring it back in to disk in its entirety” … how much storage do you want to be using here? If we’re talking bringing back the entire “silo”, that’s a lot of storage to bring back – I’d suggest it’s going to be comparable to but orders of magnitude worse than say, recovering a 1TB virtual machine fileserver to a separate location in order to pull out a 100KB Excel spreadsheet. Let’s be accurate about this: recovering the entire silo would mean recovering all deduped backups – most notably a full of your entire environment.
If we’re talking about recovering just portions of the data on tape, then again, it’s going to be like the file-level recovery from block-level backup issue previously described, and we’ll be back to square one.
In Summary
I’ve got to be entirely blunt here – CommVault’s approach reminds me of the old (crude) expression (made as “G Rated” as possible):
“You can’t polish a poo, but you can roll it in gold dust”.
If the supporting architecture is crazy, it doesn’t matter that it can do something “nifty” – particularly if that something “nifty” will result in significantly slower recoveries (even in limited circumstances).
Yes, it’s undoubtedly the case that the CommVault approach will reduce the amount of data stored on tape, which will result in some cost savings. However, penny pinching in backup environments has a tendency to result in recovery impacts – often significant recovery impacts. For example, NetBackup gives “media savings” by not enforcing dependencies. Yes, this can result in in saving money here and there on media, but can result in being unable to do complete filesystem recoveries approaching the end of a total retention period, which is plain dumb.
The CommVault approach while saving some money on tape will significantly expand recovery times (or require large cache areas and still take a lot of recovery time). Saving money is good. Wasting a little time during longer-term recoveries is likely to be perceived as being OK – until there’s a pressing need. Wasting a lot of time during longer-term recoveries is rarely going to be perceived as being OK.
The other saying that springs to mind is: The road to hell is paved with good intentions.
If I’m correct in my understanding of how the CommVault dedupe-to-tape strategy works based on a review of the CommVault marketing material (typically for any vendor, slim information) and Curtis’ summary, I can only say that their approach is not crazy good as Curtis concludes, but crazy bad.
4 Responses to “Dedupe to tape is “crazy bad” if the architecture is crazy”
Sorry, the comment form is closed at this time.
Matt Simmons said
I’m glad I’m not the only person who feels this way.
Dedupe to tape: Are you crazy? | Standalone Sysadmin said
[...] It should probably be noted that Preston wrote about this too. The difference is, of course, that he knows what he’s talking about… Send this [...]
Daniel King said
Isn’t this along the same lines as Parallelism in Network?
“The faster to backup= slower to restore.”
I like the concept of de-dupe, but until we have a way of fixing the inherit issue of rehydration, we aren’t going to go anywhere fast.
Preston said
It’s similarish to the parallelism problem in NetWorker (when going direct to tape, certainly) – but I think has more parallels (no pun intended) with file level recovery from block level backup.
Regardless, rehydration is something that is too quickly forgotten, yes.
For what it’s worth, I’ve got another article coming up in a day or so about my overall concerns with dedupe.