NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Dedupe to tape is “crazy bad” if the architecture is crazy

Posted by Preston on 2009-10-26

Over at Backup Central, Curtis Preston says he’s convinced that dedupe to tape according to the CommVault model is a good idea, in a “crazy good” way rather than a “crazy bad” way. To summarise Curtis’ argument (and thereby establish my understanding of it), the process is:

  1. Day to day recovery of deduped tape backup would be crazy (I agree with this)
  2. Design the system so that you still facilitate most recoveries from dedupe on disk (I have no issue with this)
  3. Periodically effectively stage out the dedupe data to tape (first objection)
  4. Long-term recoveries are done from tape written in dedupe format (holy cow that’s insane!)

So, let’s look at why I think this is “crazy bad” by examining each point.

Point one – day to day recovery of deduped tape backup would be crazy

Fully agreed. I’d liken recovery from deduped data on tape to recovery of highly fragmented files from a block level backup. Block level backup products (e.g., EMC’s SnapImage) allows you to bypass the inefficiencies of the filesystem on dense structures to do a block by block backup. This can deliver fantastic time savings. For. Backup.

For recovery, file level reconstruction from block level backups can suck in a terribly horrendous way. File level reconstruction from block level backups requires recovery of the required blocks into a cache, and then the files are put back together. If your files are heavily fragmented (which is often the case on dense filesystems), the number of reads from tape required – and the amount of seeking required – is very high. Real world example: 400 GB dense filesystem (about 40,000,000 files) had full backups reduced from 15 hours to 4 hours using block level backup. Recovery of the entire filesystem took less than 4 hours – recovery of a 40 GB directory took 12 hours. Having a very large cache is one way to get around this, but that starts to get costly (and in my experience is frequently poached).

Recovery from deduped data on tape will very likely suck just as badly.

Point two – design the system so that you facilitate most recoveries from dedupe on disk

Again, fully agreed. So far I’m in complete agreement with Curtis and CommVault. This point can be said of any backup design – design your system so that the most frequently performed recoveries are done from the fastest backup medium.

Point three – Periodically effectively stage out all dedupe data to tape

This is the crazy part, and not crazy good, but out and out crazy bad. To quote Curtis on this:

If you’re going to dedupe to tape, you first have to dedupe to disk.  You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests.  (Typically that’s 30 days, but it could be longer in your environment.)

What’s so crazy-bad about this?

Now, I’ll profess that I don’t know for sure which way this is being done, but it reads that new full backups are generated periodically in the dedupe environment, allowing the previous dependency chains of fulls + incrementals to be transferred out to tape. (Based on my reading of the CommVault marketing documentation, which refers to “reducing” the number of fulls required for retention cycles, this appears to be an accurate assessment.)

So this means that every X days (whatever your period-between-fulls is going to be) you have to do new fulls. Now while this isn’t so much of an issue in regular backups, in dedupe backups it’s a known fact that the initial full backups are hideously slow. This can be worn by most organisations when it’s a once-off. Every month? Even every 3 months or 6 months? Far less likely.

Point four – Long-term recoveries are done from tape written in dedupe format

Obviously some of my objections to this have already been expressed in my comments for point two, but to continue with my objections, let’s look at what Curtis has to say on this point as well:

But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal.  (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)

Now, I agree that recovery of longer term backups can be done from slower media in most instances.

There’s a difference between “slower media” and “a snail just overtook our data recovery”.

In the first case, I don’t believe that recovery from deduped data on tape will be in the order of “several minutes” … I think this would turn out to be a highly optimistic rather than terribly realistic time-frame. I would need to see a large number of real world instances of short recovery times to really believe this will be in an order of “several minutes”. Yes, I’m going on a gut feeling, but I feel it’s somewhat justified.

In the second case … “you could actually bring it back in to disk in its entirety” … how much storage do you want to be using here? If we’re talking bringing back the entire “silo”, that’s a lot of storage to bring back  – I’d suggest it’s going to be comparable to but orders of magnitude worse than say, recovering a 1TB virtual machine fileserver to a separate location in order to pull out a 100KB Excel spreadsheet. Let’s be accurate about this: recovering the entire silo would mean recovering all deduped backups – most notably a full of your entire environment.

If we’re talking about recovering just portions of the data on tape, then again, it’s going to be like the file-level recovery from block-level backup issue previously described, and we’ll be back to square one.

In Summary

I’ve got to be entirely blunt here – CommVault’s approach reminds me of the old (crude) expression (made as “G Rated” as possible):

“You can’t polish a poo, but you can roll it in gold dust”.

If the supporting architecture is crazy, it doesn’t matter that it can do something “nifty” – particularly if that something “nifty” will result in significantly slower recoveries (even in limited circumstances).

Yes, it’s undoubtedly the case that the CommVault approach will reduce the amount of data stored on tape, which will result in some cost savings. However, penny pinching in backup environments has a tendency to result in recovery impacts – often significant recovery impacts. For example, NetBackup gives “media savings” by not enforcing dependencies. Yes, this can result in in saving money here and there on media, but can result in being unable to do complete filesystem recoveries approaching the end of a total retention period, which is plain dumb.

The CommVault approach while saving some money on tape will significantly expand recovery times (or require large cache areas and still take a lot of recovery time). Saving money is good. Wasting a little time during longer-term recoveries is likely to be perceived as being OK – until there’s a pressing need. Wasting a lot of time during longer-term recoveries is rarely going to be perceived as being OK.

The other saying that springs to mind is: The road to hell is paved with good intentions.

If I’m correct in my understanding of how the CommVault dedupe-to-tape strategy works based on a review of the CommVault marketing material (typically for any vendor, slim information) and Curtis’ summary, I can only say that their approach is not crazy good as Curtis concludes, but crazy bad.

About these ads

4 Responses to “Dedupe to tape is “crazy bad” if the architecture is crazy”

  1. I’m glad I’m not the only person who feels this way.

  2. [...] It should probably be noted that Preston wrote about this too. The difference is, of course, that he knows what he’s talking about… Send this [...]

  3. Daniel King said

    Isn’t this along the same lines as Parallelism in Network?

    “The faster to backup= slower to restore.”

    I like the concept of de-dupe, but until we have a way of fixing the inherit issue of rehydration, we aren’t going to go anywhere fast.

    • Preston said

      It’s similarish to the parallelism problem in NetWorker (when going direct to tape, certainly) – but I think has more parallels (no pun intended) with file level recovery from block level backup.

      Regardless, rehydration is something that is too quickly forgotten, yes.

      For what it’s worth, I’ve got another article coming up in a day or so about my overall concerns with dedupe.

Sorry, the comment form is closed at this time.

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: