NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Posts Tagged ‘Saveset’

Quibbles – The maddening shortfall of ADV_FILE

Posted by Preston on 2009-11-25

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

Posted in NetWorker, Quibbles | Tagged: , , , , , | 2 Comments »

Quibbles – Why can’t you clone or stage incomplete savesets?

Posted by Preston on 2009-10-27

NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

[root@tara ~]# save -b Big -q -LL /usr
Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
<backup running, CTRL-C pressed>
(interrupted), exiting
[root@tara ~]# mminfo -q "volume=BIG995S3"
 volume        client       date      size   level  name
BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
[root@tara ~]# mminfo -q "volume=BIG995S3" -avot
 volume        client           date     time         size ssid      fl   lvl name
BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
[root@tara ~]# nsrclone -b Default -S 14922466
5876:nsrclone: skipping aborted save set 14922466
5813:nsrclone: no complete save sets to clone

Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

  • It doesn’t recycle savesets until all dependent savesets are also recyclable;
  • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
  • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
  • It won’t overwrite data on recovery unless you explicitly tell it to;
  • It lets you recover from incomplete savesets via scanner/uasm!

and so on.

So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

Posted in NetWorker, Quibbles | Tagged: , , , | 2 Comments »

Avoiding 2GB saveset chunks

Posted by Preston on 2009-08-19

Periodically a customer will report to me that a client is generating savesets in 2GB chunks. That is, they get savesets like the following:

  • C:\ – 2GB
  • <1>C:\ – 2GB
  • <2>C:\ – 2GB
  • <3>C:\ – 1538MB

Under much earlier versions of NetWorker, this was expected; these days, it really shouldn’t happen. (In fact, if it does happen, it should be considered a potential error condition.)

The release notes for 7.4.5 suggest that if you’re currently experiencing chunking in the 7.4.x series, going to 7.4.5 may very well resolve the issue. However, if that doesn’t do the trick for you, the other way of doing it is to switch from nsrauth to oldauth authentication on the backup server for the client exhibiting the problem.

To do this, you need to fire up nsradmin against the client process on the server and adjust the NSRLA record. Here’s an example server output/session, using a NetWorker backup server of ‘tara’ as our example:

[root@tara ~]# nsradmin -p 390113 -s tara
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> show type:; name:; auth methods:
nsradmin> print type: NSRLA
                        type: NSRLA;
                        name: tara.pmdg.lab;
                auth methods: "0.0.0.0/0,nsrauth/oldauth";

So, what we want to do is adjust the ‘auth methods’ for the client that is chunking data, and we want to switch it to using ‘oldauth’ instead. Assuming we have a client called ‘cyclops’ that is exhibiting this problem, and we want to only adjust cyclops, we would run the command:

nsradmin> update auth methods: "cyclops,oldauth","0.0.0.0/0,nsrauth/oldauth"
                auth methods: "cyclops,oldauth", "0.0.0.0/0,nsrauth/oldauth";
Update? y
updated resource id 4.0.186.106.0.0.0.0.42.47.135.74.0.0.0.0.192.168.50.7(7)

Once this has been done, it’s necessary to stop and restart the NetWorker services on the backup server for the changes to take effect.

So the obvious follow up questions and their answers are:

  • Why would you need to change the security model from nsrauth to oldauth to fix this problem? It seems the case that in some instances the security/authentication model can lead to NetWorker having issues with some clients that forces a reversion to chunking. By switching to the oldauth method it prevents this behaviour.
  • Should you just change every client to using oldauth? No – oldauth is being retired over time, and nsrauth is more secure, so it’s best to only do this as a last resort. Indeed, if you can upgrade to 7.4.5 that may be the better solution.

[Edit – 2009-10-27]

If you’re on 7.5.1, then in order to avoid chunking you need to be at least on 7.5.1.5 (that’s cumulative patch cluster 5 for 7.5.1.); if you’re one of those sites experiencing recovery problems from continuation/chunked savesets, you are going to need 7.5.1.6. Alternatively, you’ll need LGTsc31925 for whatever platform/release of 7.5.1 that you’re running.

Posted in NetWorker, Security | Tagged: , , , , , , , | Comments Off on Avoiding 2GB saveset chunks

Sub-saveset checkpointing would be good

Posted by Preston on 2009-07-29

Generally speaking I don’t have a lot of time for NetBackup, primarily due to the lack of dependency checking. That’s right, a backup product that doesn’t ensure that fulls are kept for as long as necessary to guarantee recoverability of dependent incrementals isn’t something I enjoy using.

That being said, there are some nifty ideas within NetBackup that I’d like to see eventually make their way into NetWorker.

One of those nifty ideas is the notion of image checkpointing. To use the NetWorker vernacular, this would be sub-saveset checkpointing. The notion of checkpointing is to allow a saveset to be restarted from a point as close to the failure as possible rather than from the start. E.g., your backup may be 20GB into a 30GB filesystem and a failure occurs. With image checkpointing turned on in NetBackup, the backup won’t need to re-run the entire 20GB previously done, but will pick up from the last point in the backup that a checkpoint was taken.

I’m not saying this would be easy to implement in NetWorker. Indeed, if I were to be throwing a bunch of ideas into a group of “Trivial”, “Easy”, “Hmmm”, “Hard” and “Insanely Difficult” baskets, I’d hazard a guess that the modifications required for sub-saveset checkpointing would fall at least into the “Hard” basket.

To paraphrase a great politician though, sometimes you need to choose to do things not because they’re easy, but because they’re hard.

So, first – why is sub-saveset checkpointing important? Well, as data sizes increase, and filesystems continue to grow, having to restart the entire saveset because of a failure “somewhere” within the stream is increasingly inefficient. For the most part, we work through these issues, but as filesystems continue to grow in size and complexity, this makes it harder to hit backup windows when failures occur.

Secondly – how might sub-saveset checkpointing be done? Well, NetWorker already is capable of doing this – sort of. It’s in chunking or fragments. Long term NetWorker users will be well aware of this: savesets that had a maximum size of 2GB, and so if you were backing up a 7 GB filesystem called “/usr”, you’d get:

/usr
<1>/usr
<2>/usr
<3>/usr

In the above, “/usr” was considered the “parent” of “<1>/usr”, “<1>/usr” was the parent of “<2>/usr”, and so on. (Parent? man mminfo – read about pssid.)

Now, I’m not suggesting a whole-hearted return to this model – it’s a pain in the proverbial to parse and calculate saveset sizes, etc., and I’m sure there’s other inconveniences to it. However, it does an entry to the model we’re looking for – if needing to restart from a checkpoing, a backup could continue via a chunked/fragmented saveset.

The difficulty lays in differentiating between the “broken” part of the parent saveset chunk and the “correct” part of the child saveset chunk, which would likely require extension to at least the media database. However, I think it’s achievable given that the media database contains details about segments within savesets (i.e., file/record markers, etc.), then in theory it should be possible to include a “bad” flag so that a chunk of data at the end of a saveset chunk can be declared as bad, indicating to NetWorker that it needs to move onto the next child chunk.

It’s fair to say that most people would be happy with needing to go through a media database upgrade (i.e., a change to the structure as part of starting a new version of NetWorker) in order to get sub-saveset checkpointing.

Posted in Architecture, NetWorker | Tagged: , , , , , | Comments Off on Sub-saveset checkpointing would be good

Basics – Changing saveset browse/retention times

Posted by Preston on 2009-02-16

Hi!

The text of this article has been moved, and can now be found at its permanent home, the NetWorker Information Hub. You can read it here.

Posted in Basics, NetWorker | Tagged: , , , , | 10 Comments »

Instantiating savesets

Posted by Preston on 2009-01-25

Following a recent discussion I’ve been having on the NetWorker Mailing List, I thought I should put a few details down about clone IDs.

If you don’t clone your backups (and if you don’t: why not?), you may not have really encountered clone IDs very much. They’re the shadowy twin of the saveset ID, and serve a fairly important purpose.

From hereon in, I’ll use the following nomenclature:

  • SSID = Save Set ID
  • CLID = CLone ID

“SSID” is pretty much the standard NetWorker terminology for saveset ID, but usually clone ID is just written as “clone ID” or “clone-id”, etc., which gets a bit tiresome after a while.

Every saveset in NetWorker is tagged with a unique SSID. However, every copy of a saveset is tagged with the same SSID, but a different CLID.

You can see this when you ask mminfo to show both:

[root@nox ~]# mminfo -q "savetime>=18 hours ago,pool=Staging,client=archon,
name=/Volumes/TARDIS" -r volume,ssid,cloneid,nsavetime
 volume        ssid          clone id  save time
Staging-01     3962821973  1228135765 1228135764
Staging-01.RO  3962821973  1228135764 1228135764

(If you must know, being a fan of Doctor Who, all my Time Machine drives are called “TARDIS” – and no, I don’t backup my Time Machine copies with NetWorker, it would be a truly arduous and wasteful thing to do; I use my Time Machine drives for other database dumps from my Macs.)

In this case we’re not only seeing the SSID and CLID, but also a special instance of the SSID/CLID combination – that which is assigned for disk backup units. In the above example, you’ll note that the CLID associated with the read-only (.RO) version of the disk backup unit is exactly one less than the CLID associated with the read-write version of the disk backup unit. This is done by NetWorker for a very specific reason.

So, you might wonder then what the purpose of the CLID is, since we use the SSID to identify an individual saveset, right?

I had hunted for ages for a really good analogy on SSID/CLIDs, and stupidly the most obvious one never occurred to me. One of the NetWorker Mailing List’s most helpful posters, Davina Treiber, posted the (in retrospect) obvious and smartest analogy I’ve seen – comparing savesets to books in a library. To paraphrase, while a library may have multiple copies of the same book (with each copy having the same ISBN – after all, it’s the same book), they will obviously need to keep track of the individual copies of the book to know who has which copy, how many copies they have left, etc. Thus, the library would assign an individual copy number to each instance of the book they have, even if they only have one instance.

This, quite simply, is the purpose of the CLID – to identify individual instances of a single saveset. This means that you can, for example, do any of the following (and more!):

  • Clone a saveset by reading from a particular cited copy.
  • Recover from a saveset by reading from a particular cited copy.
  • Instruct NetWorker to remove from its media database reference to a particular cited copy.

In particular, in the final example, if you know that a particular tape is bad, and you want to delete that tape, you only want NetWorker to delete reference to the saveset instances on that tape – you wouldn’t want to also delete reference to perfectly good copies sitting on other tapes. Thus you would refer to SSID/CLID.

I’ve not been using the terminology SSID/CLID randomly. When working with NetWorker in a situation where you either want to, or must specify a specific instance of a saveset, you literally use that in the command. E.g.,:

# nsrclone -b “Daily Clone” -S 3962821973/1228135764

Would clone the saveset 3962821973 to the “Daily Clone” pool, using the saveset instance (CLID) 1228135764.

The same command could be specified as:

# nsrclone -b “Daily Clone” -S 3962821973

However, this would mean that NetWorker would pick which instance of the saveset to read from in order to clone the nominated saveset. The same thing happens when NetWorker is asked to perform a recovery in standard situations (i.e., non-SSID based recoveries).

So, how does NetWorker pick which instance of a saveset should be used to facilitate a recovery? The algorithm used goes a little like this:

  • If there are instances online, then the most available instance is used.
  • If there are multiple instances equally online, then the instance with the lowest CLID is requested.
  • If all instances are offline, then the instance with the lowest CLID not marked as offsite is requested.

The first point may not immediately make sense. Most available? If you say, have 2 copies on tape, and one tape is in a library, but the other is physically mounted in a tape drive, and is not in use, that tape in the drive will be used.

For the second point, consider disk backup units – adv_file type devices. In this case, both the RW and the RO “version” of the saveset (remembering, there’s only one real physical copy on disk, NetWorker just mungs some details to make it appear to the media database that there’s 2 copies) are equally online – they’re both mounted disk volumes. So, to prevent recoveries automatically running from the RW “version” of the saveset on disk, when the instances are setup, the “version” on the RO portion of the disk backup unit is assigned a CLID one less than the CLID of the “version” on the RW device.

Thus, we get “guaranteed” recovery/reading from the RO version of the disk backup unit. In normal circumstances, that is. (You can still force recovery/reading from the RW version if you so desire.)

In the final point, if all copies are equally offline, NetWorker previously just requested the copy with the lowest CLID. This works well in a tape only environment – i.e.:

  • Backup to tape
  • Clone backup to another tape
  • Send clone offsite
  • Keep ‘original’ onsite

In this scenario, NetWorker would ask for the ‘original’ by virtue of it having the lowest CLID. However, the CLID is only generated when the saveset is cloned. Thus, consider the backup to disk scenario:

  • Backup to disk
  • Clone from disk to tape
  • Send clone offsite
  • Later, when disk becomes full or savesets are too old, stage from disk to tape
  • Keep new “originals” on-site.

This created a problem – in this scenario, if you went to do a recovery after staging, then NetWorker would (annoyingly for many!) request the clone version of the saveset. This either meant requesting it to be pulled back from the offsite location, or doing a SSID/CLID recovery or marking the clone SSID/CLID as suspect or mounting the “original”. However you looked at it, it was a lot of work that you really shouldn’t have needed to do.

NetWorker 7.3.x however introduced the notion of an offsite flag; this isn’t the same as setting the volume location to offsite however. It’s literally a new flag:

# nsrmm -o offsite 800841

Would mark the volume 800841 in the media database as not being onsite – I.e., having a less desirable availability for recovery/read operations.

The net result is that in this situation, even if the offsite clone has a lower CLID, if it is flagged as offsite, but there’s a clone with a higher CLID not flagged as offsite, NetWorker will bypass that normal “use the lowest CLID” preference to instead request the onsite copy.

It would certainly be preferable however if a future version of NetWorker could have read priority established as a flag for pools; that way, rather than having to bugger around with the offsite flag (which, incidentally, can only be set/cleared from the command line, and can’t be queried!), an administrator could nominate “This pool has highest recovery priority, whereas this pool has lower recovery priority”. That way, NetWorker would pick the lowest CLID in the highest recovery priority pool.

(I wait, and hope.)

Posted in NetWorker, Scripting | Tagged: , , , | 4 Comments »