NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Archive for the ‘Policies’ Category

15 crazy things I never want to hear again

Posted by Preston on 2009-12-14

Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

In no particular order then, these are the crazy things I never want to hear again:

  1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
  2. We just backup /etc/passwd on that machine.
  3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
  4. Oh, hadn’t thought about that (In response to #3).
  5. Can you fax me some cleaning cartridge barcodes?
  6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
  7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
  8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
  9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
  10. Why would I want to match barcode labels to tape labels? That’s crazy!
  11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
  12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
  13. If a user is dumb enough to delete their file, I don’t care about getting it back.
  14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
  15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

I will say that numbers #1 and #15 are my personal favourites for crazy statements.

Posted in Backup theory, General Technology, Policies, Quibbles | Tagged: | 1 Comment »

How complex is your backup environment?

Posted by Preston on 2009-12-07

Something I’ve periodically mentioned to various people over the years is that when it comes to data protection, simplicity is King. This can be best summed up with the following rule to follow when designing a backup system:

If you can’t summarise your backup solution on the back of a napkin, it’s too complicated.

Now, the first reaction a lot of people have to that is “but if I do X and Y and Z and then A and B on top, then it’s not going to fit, but we don’t have a complex environment”.

Well, there’s two answers to that:

  1. We’re not talking a detailed technical summary of the environment, we’re talking a high level overview.
  2. If you still can’t give a high level overview on the back of a napkin, it is too complicated.

Another way to approach the complexity issue, if you happen to have a phobia about using the back of a napkin is – if you can’t give a 30 second elevator summary of your solution, it’s too complicated.

If you’re struggling to think of why it’s important you can summarise your solution in such a short period of time, or such limited space, I’ll give you a few examples:

  1. You need to summarise it in a meeting with senior management.
  2. You need to summarise it in a meeting with your management and a vendor.
  3. You’ve got 5 minutes or less to pitch getting an upgrade budget.
  4. You’ve got a new assistant starting and you’re about to go into a meeting.
  5. You’ve got a new assistant starting and you’re about to go on holiday.
  6. You’ve got consultant(s) (or contractors) coming in to do some work and you’re going to have to leave them on their own.
  7. The CIO asks “so what is it?” as a follow-up question when (s)he accosts you in the hallway and asks, “Do we have a backup policy?”

I can think of a variety of other reasons, but the point remains – a backup system should not be so complex that it can’t be easily described. That’s not to ever say that it can’t either (a) do complex tasks or (b) have complex components, but if the backup administrator can’t readily describe the functioning whole, then the chances are that there is no functioning whole, just a whole lot of mess.

Posted in Backup theory, General thoughts, Policies | Tagged: , , , , , | Comments Off on How complex is your backup environment?

The 5 Golden Rules of Recovery

Posted by Preston on 2009-11-10

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

Posted in Backup theory, NetWorker, Policies | Tagged: , , , , | Comments Off on The 5 Golden Rules of Recovery

Laptop/Desktop Backups as easy as 1-2-3!

Posted by Preston on 2009-10-23

When I first mentioned probe based backups a while ago, I suggested that they’re going to be a bit of a sleeper function – that is, I think they’re being largely ignored at the moment because people aren’t quite sure how to make use of them. My take however is that over time we’re going to see a lot of sites shifting particular backups over to probe groups.

Why?

Currently a lot of sites shoe-horn ill-fitting backup requirements into rigid schedules. This results in frequent violations of the best practices approach to backup of Zero Error Policies. Here’s a prime example: for those sites that need to do laptop and/or desktop backups using NetWorker, the administrators are basically resigned on those sites to having failure rates in such groups of 50% or more depending on how many machines are currently not connected to the network.

This doesn’t need to be the case – well, not any more thanks to probe based backups. So, if you’ve been scratching your head looking for a practical use for these backups, here’s something that may whet your appetite.

Scenario

Let’s consider a site where there are group of laptops and desktops that are integrated into the NetWorker backup environment. However, there’s never a guarantee of which machines may be connected to the network at any given time. Therefore administrators typically configure laptop/desktop backup groups to start at say, 10am, on the premise that the most systems are likely to be available at that time.

Theory of Resolution

Traditional time-of-day start backups aren’t really appropriate to this scenario. What we want is a situation where the NetWorker server waits for those infrequently connected clients to be connected, then runs a backup at the next opportunity.

Rather than having a single group for all clients and accepting that the group will suffer significant failure rates, split each irregularly connected client into its own group, and configure a backup probe.

The backup system will loop probes of the configured clients during nominated periods in the day/night at regular intervals. When the client is connected to the network and the probe successfully returns that (a) the client is running and (b) a backup should be done, the backup is started on the spot.

Requirements

In order to get this working, we’ll need the following:

  • NetWorker 7.5 or higher (clients and server)
  • A probe script – one per operating system type
  • A probe resource – one per operating system type
  • A 1:1 mapping between clients of this type and groups.

Practical Application

Probe Script

This is a command which is installed on the client(s), in the same directory as the “save” or “save.exe” binary (depending on OS type), and starts with either nsr or save. I’ll be calling my script:

nsrcheckbackup.sh

I don’t write Windows batch scripts. Therefore, I’ll give an example as a Linux/Unix shell script, with an overview of the program flow. Anyone who wants to write a batch script version is welcome to do so and submit it.

The “proof of concept” algorithm for the probe script works as follows:

  • Establish a “state” directory in the client nsr directory called bckchk. I.e., if the directory doesn’t exist, create it.
  • Establish a “README” file in that directory for reference purposes, if it doesn’t already exist.
  • Determine the current date.
  • Check for a previous date file. If there was a previous date file:
    • If the current date equals the previous date found:
      • Write a status file indicating that no backup is required.
      • Exit, signaling that no backup is required.
    • If the current date does not equal the previous date found:
      • Write the current date to the “previous” date file.
      • Write a status file indicating that the current date doesn’t match the “previous” date, so a new backup is required.
      • Exit, signaling that a backup is required.
  • If there wasn’t a previous date file:
    • Write the current date to the “previous” date file.
    • Write a status file indicating that no previous date was found so a backup will be signaled.
    • Exit, signaling backup should be done.

Obviously, this is a fairly simplistic approach, but is suitable for a proof of concept demonstration. If you were wishing to make the logic more robust for production deployment, my first suggestion would be to build in mminfo checks to determine (even if the dates match), whether there has been a backup “today”. If there hasn’t, that would override and force a backup to start. Additionally, if users can connect via VPN and the backup server can communicate with connected clients, you may want to introduce some logic into the script to deny probe success over the VPN.

If you were wanting a OS independent script for this, you may wish to code in Perl, but I’ve hung off doing that in this case simply because a lot of sites have reservations about installing Perl on Windows systems. (Sigh.)

Without any further guff, here’s the sample script:

preston@aralathan ~
$ cat /usr/sbin/nsrcheckbackup.sh
#!/bin/bash 

PATH=$PATH:/bin:/sbin:/usr/sbin:/usr/bin
CHKDIR=/nsr/bckchk

README=`cat <<EOF
==== Purpose of this directory ====

This directory holds state file(s) associated with the probe based
laptop/desktop backup system. These state file(s) should not be
deleted without consulting the backup administrator.
EOF
`

if [ ! -d "$CHKDIR" ]
then
   mkdir -p "$CHKDIR"
fi

if [ ! -f "$CHKDIR/README" ]
then
   echo $README > "$CHKDIR/README"
fi

DATE=`date +%Y%m%d`
COMPDATE=`date "+%Y%m%d %H%M%S"`
LASTDATE="none"
STATUS="$CHKDIR/status.txt"
CHECK="$CHKDIR/datecheck.lck"

if [ -f "$CHECK" ]
then
   LASTDATE=`cat $CHKDIR/datecheck.lck`
else
   echo $DATE > "$CHECK"
   echo "$COMPDATE Check file did not exist. Backup required" > "$STATUS"
   exit 0
fi

if [ -z "$LASTDATE" ]
then
   echo "$COMPDATE Previous check was null. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

if [ "$DATE" = "$LASTDATE" ]
then
   echo "$COMPDATE Last backup was today. No action required" > "$STATUS"
   exit 1
else
   echo "$COMPDATE Last backup was not today. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

As you can see, there’s really not a lot to this in the simplest form.

Once the script has been created, it should be made executable and (for Linux/Unix/Mac OS X systems), be placed in /usr/sbin.

Probe Resource

The next step is, within the NetWorker, to create a probe resource. This will be shared by all the probe clients of the same operating system type.

A completed probe resource might resemble the following:

Configuring the probe resource

Configuring the probe resource

Note that there’s no path in the above probe command – that’s because NetWorker requires the probe command to be in the same location as the save command.

Once this has been done, you can either configure the client or the probe group next. Since the client has to be reconfigured after the probe group is created, we’ll create the probe group first.

Creating the Probe Groups

First step in creating the probe groups is to come up with a standard so that they can be easily identified in relation to all other standard groups within the overall configuration. There are two approaches you can take towards this:

  • Preface each group name with a keyword (e.g., “probe”) followed by the host name the group is for.
  • Name each group after the client that will be in the group, but set a comment along the lines of say, “Probe Backup for <hostname>”.

Personally, I prefer the second option. That way you can sort by comment to easily locate all probe based groups but the group name clearly states up front which client it is for.

When creating a new probe based group, there are two tabs you’ll need to configure – Setup and Advanced – within the group configuration. Let’s look at each of these:

Probe group configuration – Setup Tab

Probe group configuration – Setup Tab

You’ll see from the above that I’m using the convention where the group name matches the client name, and the comment field is configured appropriately for easy differentiation of probe based backups.

You’ll need to set the group to having an autostart value of Enabled. Also, the Start Time field does have relevance exactly once for probe based backups – it still seems to define the first start time of the probe. After that, the probe backups will follow the interval and start/finish times defined on the second tab.

Here’s the second tab:

Probe Group - Advanced Tab

Probe Group - Advanced Tab

The key thing on this obviously is the configuration of the probe section. Let’s look at each option:

  • Probe based group – Checked
  • Probe interval – Set in minutes. My recommendation is to have each group a different number of minutes. (Or at least reduce the number of groups that have exactly the same probe interval.) That way over time as probes run, there’s less likelihood of multiple groups starting at the same time. For instance, in my test setup, I have 5 clients, set to intervals of 90 minutes, 78 minutes, 104 minutes, 82 minutes and 95 minutes*.
  • Probe start time – Time of day that probing starts. I’ve left this on the defaults, which may be suitable for desktops, but for laptops where there’s a very high chance of machines being disconnected of a night time, you may wish to start probing closer to the start of business hours.
  • Probe end time – Time of day that NetWorker stops probing the client. Same caveats as per the probe start time above.
  • Probe success criteria – Since there’s only one client per group, you can leave this at all.
  • Time since successful backup – How many days NetWorker should allow probing to run unsuccessfully before it forcibly sets a backup running. If set to zero it will never force a backup running. I’ve actually changed, since I took the screen-shot, that value, and set it to 3 on my configured clients. Set yours to a site-optimal value. Note that since the aim is to run only one backup every 24 hours, setting this to “1” is probably not all that logical an idea.

(The last field, “Time of the last successful backup” is just a status field, there’s nothing to configure there.)

If you have schedules enforced out of groups, you’ll want to set the schedule up here as well.

With this done, we’re ready to move onto the client configuration!

Configuring the Client for Probe Backups

There’s two changes required here. In the General tab of the client properties, move the client into the appropriate group:

Adding the client to the correct group

Adding the client to the correct group

In the “Apps & Modules” tab, identify the probe resource to be used for that client:

Configuring the client probe resource

Configuring the client probe resource

Once this has been done, you’ve got everything configured, and it’s just a case of sitting back and watching the probes run and trigger backups of clients as they become available. You’ll note, in the example above, that you can still use savepnpc (pre/post commands) with clients that are configured for probe backups. The pre/post commands will only be run if the backup probe confirms that a backup should take place.

Wrapping Up

I’ll accept that this configuration can result in a lot of groups if you happen to have a lot of clients that require this style of backup. However, that isn’t the end of the world. Reducing the number of errors reported in savegroup completion notifications does make the life of backup administrators easier, even if there’s a little administrative overhead.

Is this suitable for all types of clients? E.g., should you use this to shift away from standard group based backups for the servers within an environment? The answer to that is a big unlikely. I do really see this as something that is more suitable for companies that are using NetWorker to backup laptops and/or desktops (or a subset thereof).

If you think no-one does this, I can think of at least five of my customers alone who have requirements to do exactly this, and I’m sure they’re not unique.

Even if you don’t particularly need to enact this style of configuration for your site, what I’m hoping is that by demonstrating a valid use for probe based backup functionality, I may get you thinking about where it could be used at your site for making life easier.

Here’s a few examples I can immediately think of:

  • Triggering a backup+purge of Oracle archived redo logs that kick in once the used capacity of the filesystem the logs are stored on exceed a certain percentage (e.g., 85%).
  • Triggering a backup when the number of snapshots of a fileserver exceed a particular threshold.
  • Triggering a backup when the number of logged in users falls below a certain threshold. (For example, on development servers.)
  • Triggering a backup of a database server whenever a new database is added.

Trust me: probe based backups are going to make your life easier.


* There currently appears to be a “feature” with probe based backups where changes to the probe interval only take place after the next “probe start time”. I need to do some more review on this and see whether it’s (a) true and (b) warrants logging a case.

Posted in Backup theory, NetWorker, Policies, Scripting | Tagged: , , , | Comments Off on Laptop/Desktop Backups as easy as 1-2-3!

How much aren’t you backing up?

Posted by Preston on 2009-10-05

Do you have a clear picture of everything that you’re not backing up? For many sites, the answer is not as clear cut as they may think.

It’s easy to quantify the simple stuff – QA or test servers/environments that literally aren’t configured within the backup environment.

It’s also relatively easy to quantify the more esoteric things within a datacentre – PABXs, switch configurations, etc. (Though in a well run backup environment, there’s no reason why you can’t configure scripts that, as part of the backup process, logs onto such devices and retrieves the configuration, etc.)

It should also be very, very easy to quantify what data on any individual system that you’re not backing up – e.g., knowing that for fileservers you may be backing up everything except for files that have a “.mp3” extension.

What most sites find difficult to quantify is the quasi-backup situations – files and/or data that they are backing up, but which is useless in a recovery scenario. Now, many readers of that last sentence will probably think of one of the more immediate examples: live database files that are being “accidentally” picked up in the filesystem backup (even if they’re being backed up elsewhere, by a module). Yes, such a backup does fall into this category, but there are other types of backups which are even less likely to be considered.

I’m talking about information that you only need during a disaster recovery – or worse, a site disaster recovery. Let’s consider an average Unix (or Linux) system. (Windows is no different, I just want to give some command line details here.) If a physical server goes up in smoke, and a new one has to be built, there’s a couple of things that have to be considered pre-recovery:

  • What was the partition layout?
  • What disks were configured in what styles of RAID layout?

In an average backup environment, this sort of information isn’t preserved. Sure, if you’ve got say, HomeBase licenses (taking the EMC approach), or using some other sort of bare metal recovery system, and that system supports your exact environment*, then you may find that such information is preserved and is available.

But what about the high percentage of cases where it’s not?

This is where the backup process needs to be configured/extended to support generation of system or disaster recovery information. It’s all very good for instance, for a Linux machine to say that you can just recover “/etc/fstab”, but what if you can’t remember the size of the partitions referenced by that file system table? Or, what if you aren’t there to remember what the size of the partitions were? (Memory is a wonderful yet entirely fallible and human-dependent process. Disaster recovery situations shouldn’t be bound by what we can or can’t remember about the systems, and so we have to gather all the information required to support disaster recovery.)

On a running system, there’s all sorts of tools available to gather this sort of information, but when the system isn’t running, we can’t run the tools, so we need to run them in advance, either as part of the backup process or as a scheduled, checked-upon function. (My preference is to incorporate it into the backup process.)

For instance, consider that Linux scenario – we can quickly assemble the details of all partition sizes on a system with one simple command – e.g.:

[root@nox ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sda2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sda3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sda4           19458      121601   820471680    5  Extended
/dev/sda5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sda6           19702      121601   818511718+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         250     2008093+  82  Linux swap / Solaris
/dev/sdb2             251      121601   974751907+  83  Linux

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   83  Linux

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sdd2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sdd3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sdd4           19458      121601   820471680    5  Extended
/dev/sdd5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sdd6           19702      121601   818511718+  fd  Linux raid autodetect

That wasn’t entirely hard. Scripting that to occur at the start of the backup process isn’t difficult either. For systems that have RAID, there’s another, equally simple command to extract RAID layouts as well – again, for Linux:

[root@nox ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda3[0] sdd3[1]
 138456128 blocks [2/2] [UU]

md2 : active raid1 sda6[0] sdd6[1]
 818511616 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdd1[1]
 16779776 blocks [2/2] [UU]

unused devices: <none>

I don’t want to consume realms of pages discussing what, for each operating system you should be gathering. The average system administrator for any individual platform should, with a cup of coffee (or other preferred beverage) in hand, should be able to sit down and in under 10 minutes jot down the sorts of information that would need to be gathered in advance of a disaster to assist in the total system rebuild of an operating system of a machine they administer.

Once these information gathering steps have been determined, they can be inserted into the backup process as a pre-backup command. (In NetWorker parlance, this would be via a savepnpc “pre” script. Other backup products will equally feature such options.) Once the information is gathered, a copy should be kept on the backup server as well as in an offsite location. (I’ll give you a useful cloud backup function now: it’s called Google Mail. Great for offsiting bootstraps and system configuration details.)

When it comes to disaster recovery, such information can take the guess work or reliance on memory out of the equation, allowing a system or backup administrator in any (potentially sleep-deprived) state, with any level of knowledge about the system in question, to conduct the recovery with a much higher degree of certainty.


* Due to what they offer to do, bare metal recovery (BMR) products tend to be highly specific in which operating system variants, etc., they support. In my experience a significantly higher number of sites don’t use BMR than do.

Posted in Architecture, Backup theory, Linux, NetWorker, Policies, Scripting | Tagged: , , , , | 2 Comments »

Zero Error Policy Management

Posted by Preston on 2009-08-25

In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, viz:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

As a result of various questions and discussions I’ve had about this, I want to expand on the zero error approach to backups to discuss management of such a policy.

Saying that you’re going to implement a zero error policy – indeed, wanting to implement a zero error policy, and actually implementing are significantly different activities. So, in order to properly manage a zero error policy, the following three components must be developed, maintained and followed:

  1. Error classification.
  2. Procedures for dealing with errors.
  3. Documentation of the procedures and the errors.

In various cases I’ve seen companies try to implement a zero error policy by following one or two of the above, but they’ve never succeeded unless they’ve implemented all three.

Let’s look at each one individually.

Error Classification

Classification is at the heart of many activities we perform. In data storage, we classify data by its importance and its speed requirements, and assign tiers. In systems protection, we classify systems by whether they’re operational production, infrastructure support production, development, Q&A, test, etc. Stepping outside of IT, we routinely do things by classification – we pay bills in order of urgency, or we go shopping for the things we need sooner rather than the things we’re going to run out of in three months time, etc. Classification is not only important, but it’s also something we do (and understand the need for) naturally – i.e., it’s not hard to do.

In the most simple sense, errors for data protection systems can be broken down into three types:

  • Critical errors – If error X occurs then data loss occurs.
  • Hard errors – If error X occurs and data loss occurs, then recoverability cannot be achieved.
  • Soft errors – If error X occurs and data loss occurs, then recoverability can still be achieved, but with non-critical data recoverability uncertain.

Here’s a logical follow-up from the above classification – any backup system designed such that it can cause a critical error has been incorrectly designed. What’s an example of a critical error? Consider the following scenario:

  • Database is shutdown at 22:00 for cold backups by scheduled system task
  • Cold backup runs overnight
  • Database is automatically started at 06:00 by scheduled system task

Now obviously our preference would be to use a backup module, but that’s actually not the risk of critical error here: it’s the divorcing of the shutdown/startup from the actual filesystem backup. Why does this create a “critical error” situation, you may ask? On any system where exclusive file locking takes place, if for any reason the backup is still running when the database is started, corruption is likely to occur. (For example, I have seen Oracle databases on Windows destroyed by such scenarios.)

So, a critical error is one where the failure in the backup process will result in data loss. This is an unacceptable error; so, not only must we be able to classify critical errors, but all efforts must be made to ensure that no scenarios which permit critical errors are ever introduced to a system.

Moving on, a hard error is one where we can quantify that if the error occurs and we subsequently have data loss (recovery required), then we will not be able to facilitate that recovery to within our preferred (or required) windows. So if a client completely fails to backup overnight, or one filesystem on the client fails, then we would consider that to be a hard error – the backup did not work and thus if there is a failure on that client we cannot use that backup to recover.

A soft error, on the other hand, is an error that will not prevent core recovery from happening. These are the most difficult to classify. Using NetWorker as an example, you could say that these will often be the warnings issued during the backups where the backup still manages to complete. Perhaps the most common example of this is files being open (and thus inaccessible) during backup. However, we can’t (via a blanket rule) assume that any warning is a soft error – it could be a hard error in disguise.

To use language as an example, a syntax error is one which is immediately obvious. A semantic error is one where the meaning is not obvious. Thus, syntax errors cause an immediate failure, whereas semantic errors usually cause a bug.

Taking that analogy back to soft vs hard errors, and using our file-open example, you can readily imagine a scenario where files open during backup could constitute a hard or a soft error. In the case of a soft error, it may refer to temporary files that are generated by a busy system during backup processing. Such temporary files may have no relevance to the operational state of a recovered system, and thus the recoverability of the temporary files does not affect the recoverability* of the system as a whole. On the other hand, if critical data files are missed due to being open at the time of the backup, then the recoverability of the system as a whole is compromised.

So, to achieve a zero error policy, we must be able to:

  1. Classify critical errors, and ensure situations that can lead to them are designed out of the solution.
  2. Classify hard errors.
  3. Classify soft errors and be able to differentiate them from hard errors.

One (obvious) net result of this is that you must always check your backup results. No ifs, no buts, no maybes. For those who want to automatically parse backup results, as mentioned in the first article, it also means you must configure the automatic parser such that any unknown result is treated as an error for examination and either action or rule updating.

[Note: An interesting newish feature in NetWorker was the introduction of the “success threshold” option for backup groups. Set to “Warning”, by default, this will see savesets that generated warnings (but not hard errors) flagged as successful. The other option is “Success”, which means that in order for a saveset to be listed as a successful saveset, it must complete without warning. One may be able to argue that in an environment where all attempts have been made to eliminate errors, and the environment operates under a zero-error policy, then this option should be changed from the default to the more severe option.]

Procedures for dealing with errors

The ability to classify an error as critical, hard, or soft is practically useless unless procedures are established for dealing with the errors. Procedures for dealing with errors will be driven, at first, by any existing SLAs within the organisation. I.e., the SLA for either maximum amount of data loss or recovery time will drive the response to any particular error.

That response however shouldn’t be an unplanned reaction. That is, there should be procedures which define:

  1. By what time backup results will be checked.
  2. To whom (job title), to where (documentation), and by when critical and hard errors shall be reported.
  3. To where (documentation) soft errors shall be reported.
  4. For each system that is backed up, responses to hard errors. (E.g., some systems may require immediate re-run of the backup, whereas others may require the backup to be re-run later, etc.)

Note that this isn’t an exhaustive list – for instance, it’s obvious that any critical errors must be immediately responded to, since data loss has occurred. Equally it doesn’t take into account routine testing, etc., but the above procedures are more for the daily procedures associated with enacting a zero error policy.

Now, you may think that that the above requirements don’t constitute the need for procedures – that the processes can be followed informally. It may seem a callous argument to make, but in my experience in data protection, informal policies lead to laxity in following up those policies. (Or: if it isn’t written down, it isn’t done.)

Obviously when checks aren’t done it’s rarely for a malicious reason. However, knowing that “my boss would like a status report on overnight backups by 9am” is elastic – and so if we’re feeling there’s other things we need to look at first, we can choose to interpret that as “would like by 9am, but will settle for later”. If however there’s a procedure that says “management must have backup reports by 9am”, it takes away that elasticity. Where that is important is it actually helps in time management – tasks can be done in a logical and process required order, because there’s a definition of importance of activities within the role. This is critically important – not only for the person who has to perform the tasks, but also for those who would otherwise feel that they can assign other tasks that interrupt these critical processes. You’ve heard that a good offense is a good defense? Well, a good procedure is also a good defense – against lower priority interruptions.

Documentation of the procedures and the errors

There are two acutely different reasons why documentation must be maintained (or three, if you want to start including auditing as a reason). So, to rephrase that, there are three acutely different reasons why documentation must be maintained. These are as follows:

  1. For auditing and compliance reasons it will be necessary to demonstrate that your company has procedures (and documentation for those procedures) for dealing with backup failures.
  2. To deal with sudden staff absence – it may be as simple as someone not being able to make it in on time, or it could be the backup administrator gets hit by a bus and will be in traction in the hospital for two weeks (or worse).
  3. To assist any staff member who does not have an eidetic memory.

In day to day operations, it’s the third reason that’s the most important. Human memory is a wonderfully powerful search and recall tool, yet it’s also remarkably fallible. Sometimes I can remember seeing the exact message 3 years prior in an error log from another customer, but forget that I’d asked a particular question only a day ago and ask it again. We all have those moments. And obviously, I also don’t remember what my colleagues did half an hour ago if I wasn’t there with them at the time.

I.e., we need to document errors because that guarantees us being able to reference them later. Again – no ifs, no buts, no maybes. Perhaps the most important factor in documenting errors in a data protection environment though is documenting in a system that allows for full text search. At bare minimum, you should be able to:

  1. Classify any input error based on:
    • Date/Time
    • System (server and client)
    • Application (if relevant)
    • Error type – critical, hard, soft
    • Response
  2. Conduct a full text search (optionally date restricted):
    • On any of the methods used to classify
    • On the actual error itself

The above scenario fits nicely with Wiki systems, so that may be one good scenario, but there are others out there that can be equally used.

The important thing though is to get the documentation done. What may initially seem time consuming when a zero error policy is enacted will quickly become quick and automatic; combined with the obvious reduction in errors over time in a zero error policy, the automatic procedural response to errors will actually streamline the activities of the backup administrator.

That documentation obviously, on a day to day basis, provides the most assistance to the person(s) in the ongoing role of backup administrator. However, in any situation where someone else has to fill in, this documentation becomes even more important – it allows them to step into the role, data mine for any message they’re not sure of and see what the local response was if a situation had happened before. Put yourself into the shoes of that other person … if you’re required to step into another person’s role temporarily, do you want to do it with plenty of supporting information, or with barely anything more than the name of the system you have to administer?

Wrapping Up

Just like when I first discussed zero error policies, you may be left thinking at the end of this that it sounds like there’s a lot of work involved in managing a zero error policy. It’s important to understand however that there’s always effort involved in any transition from a non-managed system to a managed system (i.e., from informal policies to formal procedures). However, for the most part this extra work mainly comes in at the institution of the procedures – namely in relation to:

  • Determining appropriate error categorisation techniques
  • Establishing the procedures
  • Establishing the documentation of the procedures
  • Establishing the documentation system used for the environment

Once these activities have been done, day to day management and operation of the zero error policy becomes a standard part of the job, and therefore doesn’t represent a significant impact to work. That’s for two key reasons: once these components are in place then following them really doesn’t take a lot of extra time, and that time that it does take is actually factored into the job, so the extra time taken can hardly be considered wasteful or frivolous.

At both a personal and ethical level, it’s also extremely satisfying to be able to answer the question, “How many errors slipped through the net today?” with “None”.

Posted in Backup theory, NetWorker, Policies | Tagged: , , , , , , , | 1 Comment »

What is a zero error policy?

Posted by Preston on 2009-08-11

In my book, I recommend that all businesses should adopt a zero error policy in regards to backup. I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

Now, the problem with talking about zero error policies is that many people get excited about the wrong things when it comes to them. That is, they either focus on:

  • This will be too expensive!

or

  • Who gets into trouble when errors DO occur?

Not only are these attitudes not helpful, but they’re not necessary either.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

You may think that rule (2) implies rule (3), and it does, but rule (3) gives us a special case/allowance for noting that some errors are permitted, in the short term, if there is a sufficient reason.

The actual purpose of the zero error policy is to ensure that any error or abnormal report from the backup system is treated as something requiring investigation and resolution. If this sounds like a lot of work, there’s a couple of key points to make:

  • When switching from any other policy to a zero error policy, yes, there will be a settling-in period that takes more time and effort, but once the initial hurdle has been cleared there should not be a significant ongoing drain of resources;
  • Given the importance of successful backups (i.e., being able to successfully recover when required), the work that is required is not only important, but very easily arguably necessary and ethically required.

Let’s step through those three rules.

All errors shall be known

Recognising that there must be limits to the statement “all errors shall be known”, we take this to mean that if an error is reported it will be known about. The most simple interpretation of this is that all savegroup completion reports must be read. For the purposes of a NetWorker backup environment, any run-time backup error is going to appear in the savegroup completion report, and so reading the report and checking on a per-host basis is the most appropriate action.

There are some logical consequences of this requirement:

  1. Backups reports shall be checked.
  2. Recoveries shall be tested.
  3. An issue register shall be maintained.
  4. Backup logs shall be kept for at least the retention period of the backups they are for.

Note: By “…all savegroup completion reports must be read”, I’m not suggesting that you can’t automatically parse results – however, there’s a few rules that have to be carefully followed on this. Discussed more in my book, the key rule however is that when adopting both automated parsing and a zero error policy, one must configure the system such that any unknown output/text is treated as an error. I.e., anything not catered for at time of writing of an automated parser must be flagged as a potential error so that it is either dealt with or added to the parsing routine.

All errors shall be resolved

Errors aren’t meant to just keep occurring. Here’s some reasonably common errors within a NetWorker environment:

  • System fails backup every night because it’s been decommissioned.
  • System fails backup every night because it’s been incorrectly configured for inclusive backups and a filesystem/saveset is no longer present.
  • File open errors on Windows systems.
  • Errors about files changing during backup on Linux/Unix systems.

There’s not a single error in the above list (and I could have made it 5x longer) that can’t be resolved. The purpose of stating “all errors shall be resolved” is to discourage administrators (either backup or individual system administrators) from leaving errors unchallenged.

Every error represents a potential threat to the backup system, in one of two distinct ways:

  1. Real errors represent a recovery threat.
  2. Spurious errors may discourage the detection of a real error.

What’s a spurious error? That’s one where the fault condition is known. E.g., “that backup fails every night because one of the systems has been turned off”. In most cases, spurious errors are going to either come down to at best a domain error (“I didn’t fix that because it’s someone else’s problem”) or at worst, laziness (“I haven’t found the <1 minute required to turn off the backup for a decommissioned system”).

Spurious errors, I believe, are actually as bad, if not worse, than the real errors. While we work to protect our systems against real errors, it’s a fact of life and systems administration that they will periodically occur. Systems change, minor bugs may surface, environmental factors may play a part, etc. The role of the backup administrator therefore is to be constantly vigilant in detecting errors, taking preventative actions where applicable, and corrective actions where necessary.

Allowing spurious errors to continually occur within a backup system is however inappropriate, and runs totally contrary to good administration practices. The key problem is that if you come to anticipate that particular backups will have failures, you become lax in your checking, and thus may skip over real errors that creep in. As an example, consider the “client fails because it has been decommissioned” scenario. In NetWorker terms, this may mean that a particular savegroup completes every day with a status of “1 client failed”. So, every day, an administrator may note that the group had 1 failed client and not bother to check the rest of the report, since that failed client is expected. But what if another administrator had decommissioned that client? What if that client is no longer in the group, but another client is now being reported as failed every day?

That’s the insidious nature of spurious errors.

No error shall be allowed to continue indefinitely

No system is perfect, so we do have to recognise that some errors may have a life-span greater than a single backup job. However, in order for a zero error policy to work properly, we must give time limits to any failure condition.

There are two aspects to this rule – one is the obvious, SLA style aspect, to do with the length at which an error is allowed to occur before it is escalated and/or must be resolved. (E.g., “No system may have 3 days of consecutive backup failures”).

The other aspect to this rule that can be more challenging to work with is dealing with those “expected” errors. E.g., consider a situation where the database administrators are trialling upgrades to Oracle on a development server. In this case, it may be known that the development system’s database backups will fail for the next 3 days. In such instances, to correctly enable zero-error policies, one must maintain not only an issues register, but an expected issues register – that is, noting which errors which are going to happen, and when they should stop happening*.

Summarising

Zero error policies are arguably not only a functional but ethical requirement of good backup administration. While they may take a little while to implement, and may formalise some of the work processes involved in the backup system, these should not be seen as a detriment. Indeed, I’d go so far as to suggest that you can’t actually have a backup system without a zero error policy. That is, without a zero error policy you can still get backups/recoveries, but with less degrees of certainty – and the more certainty you can build into a backup environment, the more it becomes a backup system.

[Ready for more? Check out the next post on this topic, Zero Error Policy Management.]


* In the example given, we could in theory use the “scheduled backup” feature of a client instance to disable backups for that particular client. However, that feature has a limitation in that there’s no allowances for automatically turning scheduled backups on again at a later date. Nevertheless, it’s a common enough scenario that it serves the purpose of the example.

Posted in Backup theory, NetWorker, Policies | Tagged: , , , | 7 Comments »

NetWorker vs Scratch pools

Posted by Preston on 2009-08-03

A common question asked in NetWorker forums is “how do I configure a scratch pool in NetWorker?” This comes mainly from people who have exposure to Veritas NetBackup, which does media management in a different way to NetWorker.

In NetWorker, once a piece of media has been assigned to a pool, the only way it can be assigned to another pool is by relabelling. I.e., moving media from one pool to another in NetWorker (a) is data-destructive and (b) requires physical writes to the media. (By “data-destructive”, I mean that whatever was previously written to the media is lost when the volume is relabelled into another pool.)

NetBackup on the other hand, which mainly categorises media by the retention policy used for images written to that media, allows media to be shifted from one ‘pool’ to another. Hence, in NetBackup, new media is typically added to a ‘scratch’ pool when it is brought into the environment, and allocated out of that pool into the required pool on an as-needed basis.

The reason NetWorker doesn’t really need a scratch pool is two-fold:

  • By enabling auto-media-management, you can have media added to a tape library, left unlabelled, and then labelled on an ‘as-required’ basis by NetWorker automatically into the required pool, and,
  • You can configure each pool to effectively be a scratch pool for every other pool with only a few settings, allowing recyclable media to be moved into whatever pool requires new media.

Let’s look at these two features.

First, auto media management. Despite a lot of misconceptions, this fulfills only one purpose. NetWorker, by default, will not under any circumstance use a tape which has not been previously labelled. I.e., it (by default) requires you to manually label each piece of media – either via NMC, or nsrjb. (There’s a reason for this – it prevents the overwriting of data that might belong to another product or another backup not recognised by NetWorker.)

If you know that the only media that will ever go into a tape library however is media you want to use with NetWorker, you can turn on auto media management. This is done on a per library basis, by editing the library properties in NMC and clicking a single check-box. Once done, any new media added to the system will be automatically labelled only when absolutely required, and labelled into the pool you want.

Here’s the setting:

Auto media verify checkbox ("Media Management", first checkbox.)

Auto media management checkbox ("Media Management", first checkbox.)

The second requirement is to enable media which has become recyclable to be recycled from one pool into another.

To do this, you need to look at your pool properties. The settings we want to adjust are “Recycle from other pools” and “Recycle to other pools”:

Recycling to/from other pools

Recycling to/from other pools

These options are pretty straight forward:

  • Recycle from other pools – Allow recyclable media in other pools to be recycled into this pool.
  • Recycle to other pools – Allow recyclable media in this pool to be recycled into other pools when required.

Using both auto media management and recycle from/to other pools, you can achieve the purpose of scratch pools without actually using them – NetWorker will allocate new media to pools as required, and move media between pools as they become available and are required.

…But our barcodes mean something!

Say you have a “system” whereby all clone media has a barcode starting with “C” and all backup media starts with “B”. How do you use the above functionality with such a system? The answer is simple:

Drop the system, you shouldn’t be doing it.

That might sound cavalier, but I’ll be perfectly blunt: any site that confuses bar code label with data content is using NetWorker incorrectly. NetWorker has good query, report and media management functionality, and you should be using this to determine what is on each tape. Not only that, you place artificial restrictions on media availability when you do this, and invite situations where say, someone dials in at 3am because the system has run out of backup media and “for an emergency” relabels empty “clone” volumes as “backup”.

Posted in NetWorker, Policies | Tagged: , | Comments Off on NetWorker vs Scratch pools

Of cascading failures and the need to test

Posted by Preston on 2009-07-22

Over at Daily WTF, there’s a new story that has two facets of relevance to any backup administrator. Titled “Bourne Into Oblivion“, the key points for a backup administrator are:

  • Cascading failures.
  • Test, test, test.

In my book, I discuss the both the implications of cascading failures, and the need to test within a backup environment. Indeed, my ongoing attitude is that if you want to assume something about an untested backup, assume it’s failed. (Similarly, if you want to make an assumption about an unchecked backup, assume it failed too.)

While normally in backup, cascading failures come down to situations such as “the original failed, and the clone failed too”, this article points out a more common form of data loss through cascading failures –  the original failure coupled with backup failure.

In the article, a shell script includes the line:

rm -rf $var1/$var2

Any long-term Unix user will shudder to think of what can happen with the above script. (I’d hazard a guess that a lot of Unix users have themselves written scripts such as the above, and suffered the consequences. What we can hope for in most situations is that we do it on well backed up personal systems rather than corporate systems with inadequate data protection!)

Something I’ve seen in several sites however is the unfortunate coupling of the above shell script with the execution of said script on a host that has read/write network mounted a host of other filesystems across the corporate network. (Indeed, the first system administration group I ever worked with told me a horror story about a script with a similar command run from a system with automounts enabled under /a.)

The net result in the story at Daily WTF? Most of a corporate network wiped out by a script run with the above command where a new user hadn’t populated either $var1 or $var2, making the script instead:

rm -rf /

You could almost argue that there’s already been a cascading failure in the above – allowing scripts to be written that have the potential for that much data loss and allowing said scripts to be run on systems that mount many other systems.

The true cascading failure however was that the backup media was unusable, having been repeatedly overwritten rather than replaced. Whether this meant that the backups ran after the above incident, or that the backups couldn’t recover all required data (e.g., running an incremental on top of a tape with a previous incremental on top of a tape with a previous full, each time overwriting the previous data), or that the tapes were literally unusable due to high overuse (or indeed, all 3), the results were the same – data loss coupled with recovery loss.

With backups not being tested periodically, such errors (in some form) can creep into any environment. Obviously in the case in this article, there’s also the problem that either (a) procedures were not established regarding rotation of media or (b) procedures were not followed.

The short of it: any business that collectively thinks that either formalisation of backup processes or the rigorous checking of backups is unnecessary is just asking for data loss.

Posted in Policies, Scripting | Tagged: , | Comments Off on Of cascading failures and the need to test

Directives and change control

Posted by Preston on 2009-07-15

It’s easy to change NetWorker directives. A few clicks here and there if you use NMC, then a couple of lines of text rattled off into the right fields, and suddenly you’ve made anywhere from small, precise changes to massive changes to a backup.

It’s for this reason that I think that modifying directives within the backup configuration should be considered important enough that they warrant their own change control processes. (I’ve previously talked about the backup administrator needing to be part of the change control authorisation process – this is another aspect however.)

Now, don’t get me wrong – despite what former employees may think, I’m not keen on excessive levels of red tape. In fact, I think a smart system should be designed at all times to minimise administrative overheads while ensuring that all accounting is still correctly done.

That being said, directives are, for want of a better term, dangerous. Mis-used, they can result in recovered systems being unusable – in data loss.

With this in mind, like other aspects of the backup system (adding clients, removing clients, adjusting savesets etc.), adjusting directives or applying directives to clients should also form part of change control.

Whenever directives are being changed, or applied, the following questions should be asked:

  • What is not working as desired?
  • What is the solution required?
  • What are the minimal steps required to make those changes?
  • How can system recoverability following the changes be tested?

It’s that final point that often goes missing with directives. Once, a long time ago (long enough to be NetWorker 5.5.3), a customer providing backup services to a host of companies setup a “zero error policy” but due to budget and time constraints merely kept on adjusting directives to remove any file from the backup that couldn’t be opened/read during the backup process. The end result was unrecoverable systems.

By placing directive maintenance into the realm of change control, we don’t seek to add more red tape to the backup system, but more thought, and more consideration of the consequences of changes that may adversely affect data and systems recovery.

Posted in NetWorker, Policies | Tagged: , | Comments Off on Directives and change control