NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

What is a zero error policy?

Posted by Preston on 2009-08-11

In my book, I recommend that all businesses should adopt a zero error policy in regards to backup. I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

Now, the problem with talking about zero error policies is that many people get excited about the wrong things when it comes to them. That is, they either focus on:

  • This will be too expensive!

or

  • Who gets into trouble when errors DO occur?

Not only are these attitudes not helpful, but they’re not necessary either.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

You may think that rule (2) implies rule (3), and it does, but rule (3) gives us a special case/allowance for noting that some errors are permitted, in the short term, if there is a sufficient reason.

The actual purpose of the zero error policy is to ensure that any error or abnormal report from the backup system is treated as something requiring investigation and resolution. If this sounds like a lot of work, there’s a couple of key points to make:

  • When switching from any other policy to a zero error policy, yes, there will be a settling-in period that takes more time and effort, but once the initial hurdle has been cleared there should not be a significant ongoing drain of resources;
  • Given the importance of successful backups (i.e., being able to successfully recover when required), the work that is required is not only important, but very easily arguably necessary and ethically required.

Let’s step through those three rules.

All errors shall be known

Recognising that there must be limits to the statement “all errors shall be known”, we take this to mean that if an error is reported it will be known about. The most simple interpretation of this is that all savegroup completion reports must be read. For the purposes of a NetWorker backup environment, any run-time backup error is going to appear in the savegroup completion report, and so reading the report and checking on a per-host basis is the most appropriate action.

There are some logical consequences of this requirement:

  1. Backups reports shall be checked.
  2. Recoveries shall be tested.
  3. An issue register shall be maintained.
  4. Backup logs shall be kept for at least the retention period of the backups they are for.

Note: By “…all savegroup completion reports must be read”, I’m not suggesting that you can’t automatically parse results – however, there’s a few rules that have to be carefully followed on this. Discussed more in my book, the key rule however is that when adopting both automated parsing and a zero error policy, one must configure the system such that any unknown output/text is treated as an error. I.e., anything not catered for at time of writing of an automated parser must be flagged as a potential error so that it is either dealt with or added to the parsing routine.

All errors shall be resolved

Errors aren’t meant to just keep occurring. Here’s some reasonably common errors within a NetWorker environment:

  • System fails backup every night because it’s been decommissioned.
  • System fails backup every night because it’s been incorrectly configured for inclusive backups and a filesystem/saveset is no longer present.
  • File open errors on Windows systems.
  • Errors about files changing during backup on Linux/Unix systems.

There’s not a single error in the above list (and I could have made it 5x longer) that can’t be resolved. The purpose of stating “all errors shall be resolved” is to discourage administrators (either backup or individual system administrators) from leaving errors unchallenged.

Every error represents a potential threat to the backup system, in one of two distinct ways:

  1. Real errors represent a recovery threat.
  2. Spurious errors may discourage the detection of a real error.

What’s a spurious error? That’s one where the fault condition is known. E.g., “that backup fails every night because one of the systems has been turned off”. In most cases, spurious errors are going to either come down to at best a domain error (“I didn’t fix that because it’s someone else’s problem”) or at worst, laziness (“I haven’t found the <1 minute required to turn off the backup for a decommissioned system”).

Spurious errors, I believe, are actually as bad, if not worse, than the real errors. While we work to protect our systems against real errors, it’s a fact of life and systems administration that they will periodically occur. Systems change, minor bugs may surface, environmental factors may play a part, etc. The role of the backup administrator therefore is to be constantly vigilant in detecting errors, taking preventative actions where applicable, and corrective actions where necessary.

Allowing spurious errors to continually occur within a backup system is however inappropriate, and runs totally contrary to good administration practices. The key problem is that if you come to anticipate that particular backups will have failures, you become lax in your checking, and thus may skip over real errors that creep in. As an example, consider the “client fails because it has been decommissioned” scenario. In NetWorker terms, this may mean that a particular savegroup completes every day with a status of “1 client failed”. So, every day, an administrator may note that the group had 1 failed client and not bother to check the rest of the report, since that failed client is expected. But what if another administrator had decommissioned that client? What if that client is no longer in the group, but another client is now being reported as failed every day?

That’s the insidious nature of spurious errors.

No error shall be allowed to continue indefinitely

No system is perfect, so we do have to recognise that some errors may have a life-span greater than a single backup job. However, in order for a zero error policy to work properly, we must give time limits to any failure condition.

There are two aspects to this rule – one is the obvious, SLA style aspect, to do with the length at which an error is allowed to occur before it is escalated and/or must be resolved. (E.g., “No system may have 3 days of consecutive backup failures”).

The other aspect to this rule that can be more challenging to work with is dealing with those “expected” errors. E.g., consider a situation where the database administrators are trialling upgrades to Oracle on a development server. In this case, it may be known that the development system’s database backups will fail for the next 3 days. In such instances, to correctly enable zero-error policies, one must maintain not only an issues register, but an expected issues register – that is, noting which errors which are going to happen, and when they should stop happening*.

Summarising

Zero error policies are arguably not only a functional but ethical requirement of good backup administration. While they may take a little while to implement, and may formalise some of the work processes involved in the backup system, these should not be seen as a detriment. Indeed, I’d go so far as to suggest that you can’t actually have a backup system without a zero error policy. That is, without a zero error policy you can still get backups/recoveries, but with less degrees of certainty – and the more certainty you can build into a backup environment, the more it becomes a backup system.

[Ready for more? Check out the next post on this topic, Zero Error Policy Management.]


* In the example given, we could in theory use the “scheduled backup” feature of a client instance to disable backups for that particular client. However, that feature has a limitation in that there’s no allowances for automatically turning scheduled backups on again at a later date. Nevertheless, it’s a common enough scenario that it serves the purpose of the example.

About these ads

7 Responses to “What is a zero error policy?”

  1. Preston –

    Excellent post.

    The principles that you’ve outlined extend to far more than backups. In the general case, any process or application that fails for any reason needs to have the same principles applied.

    I’d guess that in the last couple decades, the cases where I ignored a ‘transient’ error came back to haunt me more often than not.

    –Mike

    • Preston said

      Hi Mike,

      Thanks for the feedback; I’d certainly agree that going for a zero error policy should be the aim of many areas, not just in backup/data protection fields. Moving just a little beyond backup, it certainly could be equally argued that all aspects of system administration should follow this approach as well.

      And yes, I too have learned the need for a zero error policy the hard way, from ignoring errors – usually thinking I was too busy, then later being much, much busier trying to correct the consequences of the error I’d previously ignored.

      Cheers,

      Preston.

  2. OK, then — being a total newbie to this theory, I had a look at my errors again, and they are usually errors along the lines of files changing during active backups.

    How do I make this to NOT be an error? I can’t stop the file from changing (or rather, I am not in a position to halt the process generating the change) and I do want to back up the file in one form or another.

    So what do I do?

    • Preston said

      Hi David,

      There’s a few things that you can do here. First step is to differentiate between the errors and the warnings. The errors are going to be hard failures – e.g., a client being unable to backup at all. The warnings are soft failures – they let the backup continue, but they still need to be considered.

      Expanding on the zero error policy a little, we can permit ‘soft failures’ to continue to occur so long as it’s documented they happen (in case anyone else needs to work with the system) and there’s acknowledged acceptance of the limitations created by those soft failures.

      In this case, with warnings of files changing, you could:

      If they are files necessary to the recovery of the host – It’s likely that backups will need to be ‘fixed’ – e.g., on Windows systems, use VSS so that the backup is done as an instant point-in-time snapshot.
      If they aren’t files necessary to the recovery of the host – e.g., tmp files – then you may consider noting these as permitted ‘soft errors’. This would mean updating an issues or errors register to indicate that those sorts of errors are potentially going to occur, but they’re known to not require investigation. (“Client X will generate open file errors for files with extensions .TMP, .LST and .TEMP. Open file errors for those extensions ONLY are permitted.”)

      (Obviously part of the process involved in having a zero error policy is a more formal approach to operational documentation and procedures on site.)

  3. […] What is a zero error policy? – […]

  4. […] by Preston In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, […]

  5. […] There is no such thing as 100% certainty, but the closest you can get to it is by maintaining a zero error policy. In essence, by maintaining a zero error policy, you become immediately aware of any issues that […]

Sorry, the comment form is closed at this time.

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: