What is a zero error policy?
Posted by Preston on 2009-08-11
In my book, I recommend that all businesses should adopt a zero error policy in regards to backup. I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.
Now, the problem with talking about zero error policies is that many people get excited about the wrong things when it comes to them. That is, they either focus on:
- This will be too expensive!
- Who gets into trouble when errors DO occur?
Not only are these attitudes not helpful, but they’re not necessary either.
Having a zero error policy requires the following three rules:
- All errors shall be known.
- All errors shall be resolved.
- No error shall be allowed to continue to occur indefinitely.
You may think that rule (2) implies rule (3), and it does, but rule (3) gives us a special case/allowance for noting that some errors are permitted, in the short term, if there is a sufficient reason.
The actual purpose of the zero error policy is to ensure that any error or abnormal report from the backup system is treated as something requiring investigation and resolution. If this sounds like a lot of work, there’s a couple of key points to make:
- When switching from any other policy to a zero error policy, yes, there will be a settling-in period that takes more time and effort, but once the initial hurdle has been cleared there should not be a significant ongoing drain of resources;
- Given the importance of successful backups (i.e., being able to successfully recover when required), the work that is required is not only important, but very easily arguably necessary and ethically required.
Let’s step through those three rules.
All errors shall be known
Recognising that there must be limits to the statement “all errors shall be known”, we take this to mean that if an error is reported it will be known about. The most simple interpretation of this is that all savegroup completion reports must be read. For the purposes of a NetWorker backup environment, any run-time backup error is going to appear in the savegroup completion report, and so reading the report and checking on a per-host basis is the most appropriate action.
There are some logical consequences of this requirement:
- Backups reports shall be checked.
- Recoveries shall be tested.
- An issue register shall be maintained.
- Backup logs shall be kept for at least the retention period of the backups they are for.
Note: By “…all savegroup completion reports must be read”, I’m not suggesting that you can’t automatically parse results – however, there’s a few rules that have to be carefully followed on this. Discussed more in my book, the key rule however is that when adopting both automated parsing and a zero error policy, one must configure the system such that any unknown output/text is treated as an error. I.e., anything not catered for at time of writing of an automated parser must be flagged as a potential error so that it is either dealt with or added to the parsing routine.
All errors shall be resolved
Errors aren’t meant to just keep occurring. Here’s some reasonably common errors within a NetWorker environment:
- System fails backup every night because it’s been decommissioned.
- System fails backup every night because it’s been incorrectly configured for inclusive backups and a filesystem/saveset is no longer present.
- File open errors on Windows systems.
- Errors about files changing during backup on Linux/Unix systems.
There’s not a single error in the above list (and I could have made it 5x longer) that can’t be resolved. The purpose of stating “all errors shall be resolved” is to discourage administrators (either backup or individual system administrators) from leaving errors unchallenged.
Every error represents a potential threat to the backup system, in one of two distinct ways:
- Real errors represent a recovery threat.
- Spurious errors may discourage the detection of a real error.
What’s a spurious error? That’s one where the fault condition is known. E.g., “that backup fails every night because one of the systems has been turned off”. In most cases, spurious errors are going to either come down to at best a domain error (“I didn’t fix that because it’s someone else’s problem”) or at worst, laziness (“I haven’t found the <1 minute required to turn off the backup for a decommissioned system”).
Spurious errors, I believe, are actually as bad, if not worse, than the real errors. While we work to protect our systems against real errors, it’s a fact of life and systems administration that they will periodically occur. Systems change, minor bugs may surface, environmental factors may play a part, etc. The role of the backup administrator therefore is to be constantly vigilant in detecting errors, taking preventative actions where applicable, and corrective actions where necessary.
Allowing spurious errors to continually occur within a backup system is however inappropriate, and runs totally contrary to good administration practices. The key problem is that if you come to anticipate that particular backups will have failures, you become lax in your checking, and thus may skip over real errors that creep in. As an example, consider the “client fails because it has been decommissioned” scenario. In NetWorker terms, this may mean that a particular savegroup completes every day with a status of “1 client failed”. So, every day, an administrator may note that the group had 1 failed client and not bother to check the rest of the report, since that failed client is expected. But what if another administrator had decommissioned that client? What if that client is no longer in the group, but another client is now being reported as failed every day?
That’s the insidious nature of spurious errors.
No error shall be allowed to continue indefinitely
No system is perfect, so we do have to recognise that some errors may have a life-span greater than a single backup job. However, in order for a zero error policy to work properly, we must give time limits to any failure condition.
There are two aspects to this rule – one is the obvious, SLA style aspect, to do with the length at which an error is allowed to occur before it is escalated and/or must be resolved. (E.g., “No system may have 3 days of consecutive backup failures”).
The other aspect to this rule that can be more challenging to work with is dealing with those “expected” errors. E.g., consider a situation where the database administrators are trialling upgrades to Oracle on a development server. In this case, it may be known that the development system’s database backups will fail for the next 3 days. In such instances, to correctly enable zero-error policies, one must maintain not only an issues register, but an expected issues register – that is, noting which errors which are going to happen, and when they should stop happening*.
Zero error policies are arguably not only a functional but ethical requirement of good backup administration. While they may take a little while to implement, and may formalise some of the work processes involved in the backup system, these should not be seen as a detriment. Indeed, I’d go so far as to suggest that you can’t actually have a backup system without a zero error policy. That is, without a zero error policy you can still get backups/recoveries, but with less degrees of certainty – and the more certainty you can build into a backup environment, the more it becomes a backup system.
[Ready for more? Check out the next post on this topic, Zero Error Policy Management.]
* In the example given, we could in theory use the “scheduled backup” feature of a client instance to disable backups for that particular client. However, that feature has a limitation in that there’s no allowances for automatically turning scheduled backups on again at a later date. Nevertheless, it’s a common enough scenario that it serves the purpose of the example.
7 Responses to “What is a zero error policy?”
Sorry, the comment form is closed at this time.