NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

How important is it to clone?

Posted by Preston on 2009-01-29

This isn’t a topic that’s restricted just to NetWorker. It really does apply to any backup product that you’re using, regardless of the terminology involved. (E.g., for NetBackup, we’re talking duplication).

When talking to a broad audience I don’t like to make broad generalisations, but in the case of cloning, I will, and it’s this:

If your production systems backups aren’t being cloned, your backup system isn’t working.

Yes, that’s a very broad generalisation, and I tend to hear a lot of reasons why backups can’t be cloned/duplicated – time factors, cost factors, even assertions that it isn’t necessary. There may even be instances where this actually is correct – but thus far, I’ve not been convinced by anyone who isn’t cloning their production systems backups that they don’t need to.

I always think of backups as insurance – it’s literally what they are. In fact, my book is titled on that premise. So, on that basis, if you’re not cloning, it’s like taking out an insurance policy from a company that in turn doesn’t have an underwriter – i.e., they can’t guarantee being able to deliver on the insurance if you need to make a claim.

Would you really take out insurance with a company that can’t provide a guarantee they can honour a legitimate claim?

So, let’s disect the common arguments as to why cloning typically isn’t done:

Money

This is the most difficult one, and to me it speaks that the business, overall, doesn’t appreciate the role of backup. It means that the IT department is solely responsible for sourcing funding from its own budget to facilitate backup.

It means the company doesn’t get backup.

Backup is not an IT function. It’s a corporate governance function, or an operating function. It’s a function that belongs to every department. Returning to insurance, therefore, it’s something that must be funded by every department, or rather, the company as a whole. The finance department, for instance, doesn’t solely provide, out of its own departmental budget, the funding for insurance for a company. Funding for such critical, company wide expenditure comes from the entire company operating budget.

So, if you don’t have the money to clone, you have the hardest challenge – you need to convince the business that it, not IT, is responsible for backup budget, and cloning is part of that budget.

Time/Backup Window

If you’re not cloning because of the time it takes to do so, or the potential increase to the backup window (or that the backup window is already too long), then you’ve got a problem.

Typically such a problem has one of two solutions:

  • Revisit the environment – are there architectural changes that can be made to improve the processes? Are there procedural changes that can be made to improve the processes? Are backup windows arbitrary rather than meaningful? Consider the environment at hand – it may be that the solution is there, waiting to be implemented.
  • Money – sometimes the only way to make the time available is to spend money on the environment. If you’re worried about being able to spend money on the environment, revisit the previous comment on money.

Backup to another site

This is probably the most insidious reason that might be invoked for not needing to clone. It goes something like this:

We backup our production datacentre to storage/media in the business continuance/disaster recovery site. Therefore we don’t need to clone.

This argument disturbs me. It’s false for two very, very important reasons:

  • If your storage/media fails in the business continuance/disaster recovery site, you’ve lost  your historical backups anyway. E.g., think Sarbanes-Oxley.
  • If your production site fails, you only have one copy of your data left – on the backups. Not good.

In summary

There are true business imperatives why you should be cloning. At least for production systems, your backups should never represent a single point of failure to your environment, and need to be developed and maintained on the premise that they represent insurance. As such, not having a backup of your backup may be one of the worst business decisions that you could make.

Non-group cloning

If you’re looking to manage cloning outside of NetWorker groups but not wanting to write scripts, I’d suggest you check out IDATA Tools, a suite of utilities I helped to design and continue to write; included in the tools suite is a utility called sslocate, which is expressly targetted at assisting with manual cloning operations.

Advertisements

10 Responses to “How important is it to clone?”

  1. Preston — love the blog: the topic, the tone and the depth — keep up the good work!

    — Chuck

  2. Mike Dutch said

    Hi Preston,

    Your first two points hit the mark but the argument about backup to another site hasn’t convinced me.

    First, it sounds like you using “historical backups” in the sense of archive. While backup and archive both make copies of data, they have independent lifecycles and purposes. It’s comparing apples and oranges — both of which are good ;-)

    Second, saying “not good” when you’ve had a site failure isn’t exactly relevant to the point being discussed. It’s equivalent to saying “not good” when all your backups are destroyed. True enough but it says nothing about how well protected the “other copies” are.

  3. nsrd said

    Hi Mike,

    You raised two concerns about the argument over backing up to another site not being sufficient reason to avoid cloning. I’ll cover both off.

    In the first, when I said “historical backups”, I didn’t mean archives – backups and archives are indeed not the same thing. What I was referring to is the long term retention backups, such as say, monthly backups that are kept for a period of years, rather than weeks or days.

    Typically when you backup to another site, you’re not backing up to standalone tape drives – the vast majority of the time it’s some form of mass storage system, regardless of whether that’s a PTL, VTL, or disk backup unit (DBU). So, this means that you’ll have more than just the most recent backup stored “online” or “nearline” at the disaster recovery/business continuity site. (Let’s refer to it as the BCS).

    A good backup system should keep a particular percentage of backup history online – the actual amount will depend on the frequency of recovery requests* – thus, if the BCS say, burns to the ground and you don’t clone, there’s a high likelihood you’ll have lost backups prior to the most recent backup as well, and that more than likely means losing data that you’ll want to recover at some point. Hence, unless you say, remove every tape from your BCS as soon as the backup completes, you’re going to lose data in a BCS failure in this scenario.

    However, that still doesn’t protect you from media failure. Say you do remove every piece of media from the BCS as soon as you’ve finished writing to it. What happens then if you request that the media comes back from its offsite storage location to the BCS, you load it into a drive to recover data that say, the financial regulators are requiring you recover, and the drive chews the tape, or the tape was dropped by the storage people and fails to mount?

    This brings us to my second point, the “Not good” point when it comes to dealing with a production site failure. Forgive my Australian tendency to understate problems. “Terrifying beyond all possible belief” I guess is the expression I would use if I were being completely open – that’s to describe the notion of having my production site disappear and having to rely on either of the following scenarios:

    (a) A cold BCS that needs to be bootstrapped from backups where there is only one copy

    (b) A ‘warm’ BCS that has been kept mostly in sync but still needs some recoveries to be done where there is only one copy

    (c) A ‘hot’ BCS that has been kept fully in sync but _suddenly_ I’m in a position where all my eggs are in the one basket, particularly if there are regulatory reasons why backups and originals can’t be kept together. (E.g., a common reason to backup prod to BCS is to get data “immediately offsite”. If you don’t have any provisioning in your backup solution for cloning, and you’re suddenly running off your BCS site with no production site to backup to, you may have a solution that is not looked too favorably upon by the powers that be.)

    The ultimate problem with failing to clone though is the risk of cascading failures: that being the potential for the media you’re trying to recover from experiencing some fault while you’re doing the recovery, thus invalidating the media and preventing the recovery from succeeding.


    * Something I cover in my book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy.”

  4. The thing is, the cloning code in Networker is really limited. Unless you have a really, really simple setup, the built-in “clone at group completion” is not going to get it for you. Anyone who is serious about creating duplicates ends up scripting it.

    What really stinks is that it is ridiculously hard to get parallel cloning working. I have it at our site because we had EMC engaged and they provided custom scripts to do it. It works great and we could not get our cloning done without it. That said, cloning form multiple source drives to multiple target drives should just be built into the product.

    At least, it should be if cloning is so very crucial.

    • nsrd said

      Scott, you raise valid points – cloning in NetWorker isn’t always as admin-friendly as it could be. I do think that at least some of the time this happens its because the solution hasn’t been sized/scoped correctly. That is, it’s not uncommon to see say, physical tape libraries purchased with only just enough drives to facilitate backup, and maybe one extra drive that is meant to somehow enable cloning to work successfully. Please note I’m not saying this is your experience, just that it is a common one.

      While I’m not a big fan of VTLs, I do think that when looking at hardware aspects to solutions, having a VTL + PTL, where you backup first to the VTL then clone to the PTL, does really give cloning a boost in NetWorker, so long as the VTL is configured optimally. (By optimally I mean lots of virtual drives, and lots of small virtual tapes.)

      But looking at the software side of it, the NetWorker side, yes, there’s still a lot to be desired. Your experience, being that group cloning wasn’t sufficient, isn’t uncommon. It does most suit smaller sites, or sites where there’s no groups overlapping. Indeed, many sites don’t use group cloning, and do all cloning via scripts. Indeed, the company that I work for produces software to assist in this. If nothing else, this points to the value of having a framework based backup product such as NetWorker that allows for site-customisation to such a level.

      That being said, NetWorker 7.5 has introduced some additional features into nsrclone, such as the ability to specify how many copies you want made, etc., which will (to a degree) make scripting of cloning operations easier. It would be nice if some of these additional features made it into the group cloning criteria, but I’m not sure whether that will happen or not.

      Being able to generate multiple copies of a backup simultaneously is something that NetWorker is sadly lacking at the moment. I suggest if you feel strongly on this, you should send an email to networker_usability@emc.com – this is an email address that really is monitored actively by the product managers, and feedback to this email address is vital. Having periodically talked to several of the product managers, and having known former EMC (and Legato) NetWorker product managers, I know that the best way to get features added, or planned features expedited, is for real customer prompting.

  5. The main thing nsrclone is lacking is the ability to clone from multiple tapes simultaneously. eg. I have a list of tapes (or, frankly, savesets generated via mminfo) to clone: XXX YYY ZZZ and I have 6 available tape drives for cloning operations. Without (very) extraordinary measures, networker will use 1 source device and 1 target device and clone serially. It will not use 3 sources and 3 targets if those resources are available.

    Obviously, I have this functionality in my environment. But it should be built into the product. If I buy enough tape drives to complete backups in my backup window, the rest of the day and all of those tape drives can be used for cloning operations (reserving a few drives for ad-hoc backups and recoveries, of course!)

    And yes, I agree with your thoughts on VTL is a first backup target — particularly if you have really LARGE datasets and you care about recovery time. You really don’t want to have a high multiplexing ratio when you backup like you would normally have with high speed/high density tape media.

  6. nsrd said

    Hi Scott,

    It’s probably important to note however that nsrclone is designed to be single threaded, for want of a better term. If you want multiple cloning operations to run simultaneously, you can run multiple nsrclone operations, which I gather is how the custom scripts designed for your site work.

    Similarly, you can achieve multi-drive cloning out of standard post-group cloning by ensuring that you have multiple groups running, and potentially different pools. That perhaps isn’t so easy to achieve, but it is still an option.

  7. If you run nsrclone manually and you launch a second instance, cloning from different sources to a different pool (or even the same target pool) you WILL NOT get parallel cloning. the second instance will wait for the first to finish. There is much more to it than that.

    EMC’s internal critical account team has developed a solution that must be customized on a per customer basis. nsrclone by design is meant to be run serially. getting it to do otherwise takes substantial effort.

    The second idea doesn’t work either.

    By default, networker will queue up subsequent nsrclone operation whether launched from the CLI or via automated mechanism built into group completion.

    Been there. done it. do not have the t-shirt.

    parallel cloning should be a feature that is easily enabled within the product. If cloning matters to you and you have a large busy environment, you can’t get the job done without it.

  8. nsrd said

    Hi Scott,

    What you’re describing as nsrclone being totally linear when having multiple instances run with separate resource requirements it totally contrary to my 12+ years of use of NetWorker.

    Even running up a quick test environment in my lab, I was immediately able to have 2 nsrclone processes simultaneously reading from disk backup units and writing out to alternate media. If nsrclone didn’t support this form of serialisation, you wouldn’t be able to have multiple groups cloning simultaneously.

    Maybe if you setup multiple cloning jobs from within NMC as manual clones it may happen here, but my question is why use a GUI to initiate a command line activity?

    If you’re experiencing this sort of issue, there’s some bug or misconfiguration occurring for you.

  9. […] If you’re dealing with potentially damaged media, the file/record numbers can sometimes be your saving grace. Say you’re doing a recovery and mid-way through the recovery NetWorker aborts saying there’s an error at file number 33 on the tape. At this point, you should at this point have a clone. Honestly, you really, really should have a clone. […]

Sorry, the comment form is closed at this time.

 
%d bloggers like this: