Zero Error Policy Management
Posted by Preston on 2009-08-25
In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, viz:
- All errors shall be known.
- All errors shall be resolved.
- No error shall be allowed to continue to occur indefinitely.
As a result of various questions and discussions I’ve had about this, I want to expand on the zero error approach to backups to discuss management of such a policy.
Saying that you’re going to implement a zero error policy – indeed, wanting to implement a zero error policy, and actually implementing are significantly different activities. So, in order to properly manage a zero error policy, the following three components must be developed, maintained and followed:
- Error classification.
- Procedures for dealing with errors.
- Documentation of the procedures and the errors.
In various cases I’ve seen companies try to implement a zero error policy by following one or two of the above, but they’ve never succeeded unless they’ve implemented all three.
Let’s look at each one individually.
Classification is at the heart of many activities we perform. In data storage, we classify data by its importance and its speed requirements, and assign tiers. In systems protection, we classify systems by whether they’re operational production, infrastructure support production, development, Q&A, test, etc. Stepping outside of IT, we routinely do things by classification – we pay bills in order of urgency, or we go shopping for the things we need sooner rather than the things we’re going to run out of in three months time, etc. Classification is not only important, but it’s also something we do (and understand the need for) naturally – i.e., it’s not hard to do.
In the most simple sense, errors for data protection systems can be broken down into three types:
- Critical errors – If error X occurs then data loss occurs.
- Hard errors – If error X occurs and data loss occurs, then recoverability cannot be achieved.
- Soft errors – If error X occurs and data loss occurs, then recoverability can still be achieved, but with non-critical data recoverability uncertain.
Here’s a logical follow-up from the above classification – any backup system designed such that it can cause a critical error has been incorrectly designed. What’s an example of a critical error? Consider the following scenario:
- Database is shutdown at 22:00 for cold backups by scheduled system task
- Cold backup runs overnight
- Database is automatically started at 06:00 by scheduled system task
Now obviously our preference would be to use a backup module, but that’s actually not the risk of critical error here: it’s the divorcing of the shutdown/startup from the actual filesystem backup. Why does this create a “critical error” situation, you may ask? On any system where exclusive file locking takes place, if for any reason the backup is still running when the database is started, corruption is likely to occur. (For example, I have seen Oracle databases on Windows destroyed by such scenarios.)
So, a critical error is one where the failure in the backup process will result in data loss. This is an unacceptable error; so, not only must we be able to classify critical errors, but all efforts must be made to ensure that no scenarios which permit critical errors are ever introduced to a system.
Moving on, a hard error is one where we can quantify that if the error occurs and we subsequently have data loss (recovery required), then we will not be able to facilitate that recovery to within our preferred (or required) windows. So if a client completely fails to backup overnight, or one filesystem on the client fails, then we would consider that to be a hard error – the backup did not work and thus if there is a failure on that client we cannot use that backup to recover.
A soft error, on the other hand, is an error that will not prevent core recovery from happening. These are the most difficult to classify. Using NetWorker as an example, you could say that these will often be the warnings issued during the backups where the backup still manages to complete. Perhaps the most common example of this is files being open (and thus inaccessible) during backup. However, we can’t (via a blanket rule) assume that any warning is a soft error – it could be a hard error in disguise.
To use language as an example, a syntax error is one which is immediately obvious. A semantic error is one where the meaning is not obvious. Thus, syntax errors cause an immediate failure, whereas semantic errors usually cause a bug.
Taking that analogy back to soft vs hard errors, and using our file-open example, you can readily imagine a scenario where files open during backup could constitute a hard or a soft error. In the case of a soft error, it may refer to temporary files that are generated by a busy system during backup processing. Such temporary files may have no relevance to the operational state of a recovered system, and thus the recoverability of the temporary files does not affect the recoverability* of the system as a whole. On the other hand, if critical data files are missed due to being open at the time of the backup, then the recoverability of the system as a whole is compromised.
So, to achieve a zero error policy, we must be able to:
- Classify critical errors, and ensure situations that can lead to them are designed out of the solution.
- Classify hard errors.
- Classify soft errors and be able to differentiate them from hard errors.
One (obvious) net result of this is that you must always check your backup results. No ifs, no buts, no maybes. For those who want to automatically parse backup results, as mentioned in the first article, it also means you must configure the automatic parser such that any unknown result is treated as an error for examination and either action or rule updating.
[Note: An interesting newish feature in NetWorker was the introduction of the “success threshold” option for backup groups. Set to “Warning”, by default, this will see savesets that generated warnings (but not hard errors) flagged as successful. The other option is “Success”, which means that in order for a saveset to be listed as a successful saveset, it must complete without warning. One may be able to argue that in an environment where all attempts have been made to eliminate errors, and the environment operates under a zero-error policy, then this option should be changed from the default to the more severe option.]
Procedures for dealing with errors
The ability to classify an error as critical, hard, or soft is practically useless unless procedures are established for dealing with the errors. Procedures for dealing with errors will be driven, at first, by any existing SLAs within the organisation. I.e., the SLA for either maximum amount of data loss or recovery time will drive the response to any particular error.
That response however shouldn’t be an unplanned reaction. That is, there should be procedures which define:
- By what time backup results will be checked.
- To whom (job title), to where (documentation), and by when critical and hard errors shall be reported.
- To where (documentation) soft errors shall be reported.
- For each system that is backed up, responses to hard errors. (E.g., some systems may require immediate re-run of the backup, whereas others may require the backup to be re-run later, etc.)
Note that this isn’t an exhaustive list – for instance, it’s obvious that any critical errors must be immediately responded to, since data loss has occurred. Equally it doesn’t take into account routine testing, etc., but the above procedures are more for the daily procedures associated with enacting a zero error policy.
Now, you may think that that the above requirements don’t constitute the need for procedures – that the processes can be followed informally. It may seem a callous argument to make, but in my experience in data protection, informal policies lead to laxity in following up those policies. (Or: if it isn’t written down, it isn’t done.)
Obviously when checks aren’t done it’s rarely for a malicious reason. However, knowing that “my boss would like a status report on overnight backups by 9am” is elastic – and so if we’re feeling there’s other things we need to look at first, we can choose to interpret that as “would like by 9am, but will settle for later”. If however there’s a procedure that says “management must have backup reports by 9am”, it takes away that elasticity. Where that is important is it actually helps in time management – tasks can be done in a logical and process required order, because there’s a definition of importance of activities within the role. This is critically important – not only for the person who has to perform the tasks, but also for those who would otherwise feel that they can assign other tasks that interrupt these critical processes. You’ve heard that a good offense is a good defense? Well, a good procedure is also a good defense – against lower priority interruptions.
Documentation of the procedures and the errors
There are two acutely different reasons why documentation must be maintained (or three, if you want to start including auditing as a reason). So, to rephrase that, there are three acutely different reasons why documentation must be maintained. These are as follows:
- For auditing and compliance reasons it will be necessary to demonstrate that your company has procedures (and documentation for those procedures) for dealing with backup failures.
- To deal with sudden staff absence – it may be as simple as someone not being able to make it in on time, or it could be the backup administrator gets hit by a bus and will be in traction in the hospital for two weeks (or worse).
- To assist any staff member who does not have an eidetic memory.
In day to day operations, it’s the third reason that’s the most important. Human memory is a wonderfully powerful search and recall tool, yet it’s also remarkably fallible. Sometimes I can remember seeing the exact message 3 years prior in an error log from another customer, but forget that I’d asked a particular question only a day ago and ask it again. We all have those moments. And obviously, I also don’t remember what my colleagues did half an hour ago if I wasn’t there with them at the time.
I.e., we need to document errors because that guarantees us being able to reference them later. Again – no ifs, no buts, no maybes. Perhaps the most important factor in documenting errors in a data protection environment though is documenting in a system that allows for full text search. At bare minimum, you should be able to:
- Classify any input error based on:
- System (server and client)
- Application (if relevant)
- Error type – critical, hard, soft
- Conduct a full text search (optionally date restricted):
- On any of the methods used to classify
- On the actual error itself
The above scenario fits nicely with Wiki systems, so that may be one good scenario, but there are others out there that can be equally used.
The important thing though is to get the documentation done. What may initially seem time consuming when a zero error policy is enacted will quickly become quick and automatic; combined with the obvious reduction in errors over time in a zero error policy, the automatic procedural response to errors will actually streamline the activities of the backup administrator.
That documentation obviously, on a day to day basis, provides the most assistance to the person(s) in the ongoing role of backup administrator. However, in any situation where someone else has to fill in, this documentation becomes even more important – it allows them to step into the role, data mine for any message they’re not sure of and see what the local response was if a situation had happened before. Put yourself into the shoes of that other person … if you’re required to step into another person’s role temporarily, do you want to do it with plenty of supporting information, or with barely anything more than the name of the system you have to administer?
Just like when I first discussed zero error policies, you may be left thinking at the end of this that it sounds like there’s a lot of work involved in managing a zero error policy. It’s important to understand however that there’s always effort involved in any transition from a non-managed system to a managed system (i.e., from informal policies to formal procedures). However, for the most part this extra work mainly comes in at the institution of the procedures – namely in relation to:
- Determining appropriate error categorisation techniques
- Establishing the procedures
- Establishing the documentation of the procedures
- Establishing the documentation system used for the environment
Once these activities have been done, day to day management and operation of the zero error policy becomes a standard part of the job, and therefore doesn’t represent a significant impact to work. That’s for two key reasons: once these components are in place then following them really doesn’t take a lot of extra time, and that time that it does take is actually factored into the job, so the extra time taken can hardly be considered wasteful or frivolous.
At both a personal and ethical level, it’s also extremely satisfying to be able to answer the question, “How many errors slipped through the net today?” with “None”.
One Response to “Zero Error Policy Management”
Sorry, the comment form is closed at this time.