NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Archive for the ‘Quibbles’ Category

15 crazy things I never want to hear again

Posted by Preston on 2009-12-14

Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

In no particular order then, these are the crazy things I never want to hear again:

  1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
  2. We just backup /etc/passwd on that machine.
  3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
  4. Oh, hadn’t thought about that (In response to #3).
  5. Can you fax me some cleaning cartridge barcodes?
  6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
  7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
  8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
  9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
  10. Why would I want to match barcode labels to tape labels? That’s crazy!
  11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
  12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
  13. If a user is dumb enough to delete their file, I don’t care about getting it back.
  14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
  15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

I will say that numbers #1 and #15 are my personal favourites for crazy statements.

Advertisements

Posted in Backup theory, General Technology, Policies, Quibbles | Tagged: | 1 Comment »

Quibbles – The maddening shortfall of ADV_FILE

Posted by Preston on 2009-11-25

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

Posted in NetWorker, Quibbles | Tagged: , , , , , | 2 Comments »

Quibbles – Why can’t I rename clients in the GUI?

Posted by Preston on 2009-11-16

For what it’s worth, I believe that the continuing lack of support for renaming clients as a function within NMC, (as opposed to the current, highly manual process), represents an annoying and non-trivial gap in functionality that causes administrators headaches and undue work.

For me, this was highlighted most recently when a customer of mine needed to shift their primary domain, and all clients had been created using the fully qualified domain name. All 500 clients. Not 5, not 50, but 500.

The current mechanisms for renaming clients may be “OK” if you only rename one client a year, but more and more often I’m seeing sites renaming up to 5 clients a year as a regular course of action. If most of my customers are doing it, surely they’re not unique.

Renaming clients in NetWorker is a pain. And I don’t mean a “oops I just trod on a pin” style pain, but a “oh no, I just impaled my foot on a 6 inch rusty nail” style pain. It typically involves:

  • Taking care to note client ID
  • Recording the client configuration for all instances of the client
  • Deleting all instances of the client
  • Rename the index directory
  • Recreate all instances of the client, being sure on first instance creation to include the original client ID

(If the client is explicitly named in pool resources, they have to be updated as well, first clearing the client from those pools and then re-adding the newly “renamed” client.)

This is not fun stuff. Further, the chance for human error in the above list is substantial, and when we’re playing with indices, human error can result in situations where it becomes very problematic to either facilitate restores or ensure that backup dependencies have appropriate continuity.

Now, I know that facilitating a client rename from within a GUI isn’t easy, particularly since the NMC server may not be on the same host as the NetWorker server. There’s a bunch of (potential pool changes), client resource changes, filesystem changes and the need to put in appropriate rollback code so that if the system aborts half-way through it can revert at least to the old client name.

As I’ve argued in the past though, just because something isn’t easy doesn’t mean it shouldn’t be done.

Posted in NetWorker, Quibbles | Tagged: , , , , , | Comments Off on Quibbles – Why can’t I rename clients in the GUI?

Quibbles – Why can’t you clone or stage incomplete savesets?

Posted by Preston on 2009-10-27

NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

[root@tara ~]# save -b Big -q -LL /usr
Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
<backup running, CTRL-C pressed>
(interrupted), exiting
[root@tara ~]# mminfo -q "volume=BIG995S3"
 volume        client       date      size   level  name
BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
[root@tara ~]# mminfo -q "volume=BIG995S3" -avot
 volume        client           date     time         size ssid      fl   lvl name
BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
[root@tara ~]# nsrclone -b Default -S 14922466
5876:nsrclone: skipping aborted save set 14922466
5813:nsrclone: no complete save sets to clone

Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

  • It doesn’t recycle savesets until all dependent savesets are also recyclable;
  • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
  • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
  • It won’t overwrite data on recovery unless you explicitly tell it to;
  • It lets you recover from incomplete savesets via scanner/uasm!

and so on.

So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

Posted in NetWorker, Quibbles | Tagged: , , , | 2 Comments »

Quibbles – Directive Management Redux

Posted by Preston on 2009-10-20

A while ago, I made a posting about a long-running annoyance I have with directive management in NetWorker.

This time I want to expand slightly upon it, thanks mainly to some recent discussions with customers that pointed out an obvious and annoying additional lack of flexibility in directive management.

That’s to do with the complete inability to apply directives – particularly the skip or the null directive, against “special” savesets. By “special” I’m referring to savesets that aren’t part of standard filesystem backups yet are still effectively just a bunch of files.

Such as say:

  • SYSTEM FILES:
  • SYSTEM STATE:
  • ASR:

(And so on.)

In short, NetWorker provides no way of skipping these savesets while still using the “All” special saveset for Windows clients. You can’t do any of the following:

  • Hand craft a server-side directive
  • Hand craft a client-side directive
  • Use the directive management option in the client GUI (winworkr) to create a directive to skip these styles of savesets.

OK, the last point is just slightly inaccurate. Yes, you can create the directive using this method – but:

  • The created directive is not honoured, either when left as is, or by transferring to a more standard directive;
  • The created directive is “lost” when you next load winworkr’s directive management option. Clearly it lets you create directives that aren’t valid and it subsequently won’t deal with.

Why does this suck? For a very important reason – in some situations you don’t want to have to back these up, or you can’t back them up. For instance, on certain OS levels and bitness using clusters, you will get an error if you try to backup the ASR: saveset.

This creates a requirement to either:

  1. Accept that you’ll get an error every day in your backup report (completely unacceptable)
  2. Switch from exclusionary backups to inclusionary backups (highly unpalatable and risky)

Clearly then the option is the second, not the first. This though is effectively removing an error by introducing poor backup systems management into the environment.

It would be nice if this problem “went away”.

Posted in NetWorker, Quibbles | Tagged: , , , | 1 Comment »

Vendors! Listen up! Stop talking about archive when you mean HSM

Posted by Preston on 2009-09-22

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit – 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? :-)

I won’t give up on ILP though!

Posted in Architecture, Backup theory, General Technology, General thoughts, Quibbles | Tagged: , , | 6 Comments »

Quibbles – Directive Management

Posted by Preston on 2009-09-03

I’m a big fan of careful management of directives – for instance, I always go by the axiom that it’s better to backup a little bit too much and waste some tape than it is to not backup enough and not be able to recover.

That being said, I’m also a big fan of correct use of directives within NetWorker – skipping files that 100% are not required, adjusting preferences for the way files are backed up (e.g., logasm), etc., are quite important to getting well running backups.

So needless to say it bugs the hell out of me that after all this time, you still can’t include a directive within a directive.

Or rather, you can, but it’s through a method called “copy and paste”, which as we know, doesn’t lend itself too well to auto updating functionality.

So the current directive format is:

<< path >>
[+]asm: criteria

For example, you might want directives for a system such as:

<< / >>
+skip: *.mp3

<< /home/academics >>
forget

<< /home/students >>
+skip: *.mov *.m4v *.wma *.dv

Now, it could be that for every Unix system, you also want to include Unix Standard Directives. Currently the only way to do this is to create a new directive where you’ve copied and pasted in all the Unix Standard Directives then added in your above criteria.

This, to use the appropriate technical term, is a dogs breakfast.

The only logical way, the way which obviously hasn’t been developed yet for NetWorker but falls into the category of “why the hell not?” would be support for include statements. That way, it could be embedded into the directive itself.

For example, what I’m talking about is that we should be able to do the following:

<< / >>
include: Unix Standard Directives

<< / >>
+skip: *.mp3

<< /home/academics >>
forget

<< /home/students >>
+skip: *.mov *.m4v *.wma *.dv

Now wouldn’t that be nice? Honestly, how hard could it be?

NB: The correct answer to “how hard could it be?” is actually “I don’t care.” That is, there’s some things that should be done regardless of whether they’re easy to do.

Posted in NetWorker, Quibbles | Tagged: , | 4 Comments »

Quibbles – Cloning and Staging

Posted by Preston on 2009-08-13

For the most part cloning and staging within NetWorker are pretty satisfactory, particularly when viewed from a combination of automated and manual operations. However, one thing that constantly drives me nuts is the inane reporting of status for cloning and staging.

Honestly, how hard can it be to design cloning and staging to accurately report the following at all times:

Cloning X of Y savesets, W GB of Z GB

or

Staging X of Y savesets, W GB of Z GB

While there have been various updates to cloning and staging reporting, and sometimes it at least updates how many savesets it has done, it continually breaks when dealing with the total amount staged/cloned in as much as it resets whenever a destination volume is changed.

Oh, and while I’m begging for this change, I will request one other – include in daemon.raw whenever cloning/staging occurs, the full list of ssid/cloneids that have been cloned or staged, and the client/saveset details for each one – not just minimal details when a failure occurs. It’s called auditing.

Posted in Quibbles | Tagged: , , , , | Comments Off on Quibbles – Cloning and Staging

Quibbles – A better NetWorker init script would be nice…

Posted by Preston on 2009-07-16

Coming primarily from a Unix background, I’ve remained disappointed for 10+ years that NetWorker’s init script has barely changed in that time. Or rather, the only things that have really changed in the script are checks for additional software – e.g., Legato License Manager.

It’s frustrating, in a pesky sort of way, that in all this time the engineers at EMC have never bothered to implement a restart function within the script.

For just about every Unix platform in NetWorker, the only arguments that /etc/init.d/networker takes are:

  • stop – stop the NetWorker services
  • start – start the NetWorker services

That’s it. Once, a long time ago, I got frustrated, and hacked on the NetWorker init script to include a restart option, one that worked with the following logic:

  1. Issued a stop command.
  2. Waited 30 seconds.
  3. Checked to see if there were any NetWorker daemons still running.
  4. If there were NetWorker daemons still running, warn the user and abort.
  5. If there were no NetWorker daemons still running, issue a start command.

Over time, I got tired of inserting this hack back into the init scripts after every upgrade or reinstall, and in time even gave up keeping it around. Call it laziness or apathy on my part, but whatever you want to label it, the same applies to first Legato, then EMC engineering for not adding this absolute basic and practically expected functionality after all these years.

Is there an RFE for this? I don’t know, but logically there should be no need for one. As systems have matured, the restart option has effectively become a default/expected setting on most platforms for init scripts. NetWorker has sadly lagged behind and still requires the administrator to manually run the stop and the start, one after the other.

A minor quibble, I know, but nevertheless a quibble.

Posted in NetWorker, Quibbles, Scripting | Tagged: , | 3 Comments »

Quibbles – nsrmmd vs process IDs

Posted by Preston on 2009-05-22

OK, there’s not a lot about NetWorker that drives me nuts. I think I’ve done only one other “Quibbles” topic here so far, but I’ve reached the point on this one where I’d like to vent some exasperation.

There are times – not often, but they occasionally happen – where for some reason or another, a device will lock up and become unresponsive. When this reaches a point where the only way to recover is to either kill the controlling nsrmmd process or restarting NetWorker, things get tough.

The reason for this is that NetWorker does not, anywhere, provide a mapping between each nsrmmd, the device it controls and the process ID for that device.

Honestly, this is one of these basic administrative usability issues for which there is no excuse that it hasn’t been resolved and available for the last 5 years, if not the last 10 years. It comes down to either laziness or apathy – people have been asking for it long enough that with all the changes done to nsrmmd over the years, it should have been added a long time ago.

What do you think?

Posted in NetWorker, Quibbles | Tagged: | 5 Comments »