NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Archive for the ‘Features’ Category

Validcopies hazardous to your sanity

Posted by Preston on 2009-12-04

While much of NetWorker 7.6’s enhancements have been surrounding updates to virtualisation or (urgh) cloud, there remains a bunch of smaller updates that are of interest.

One of those new features is the validcopies flag, something I unfortunately failed to check out in beta testing. It looks like it could use some more work, but the theory is a good one. The idea behind validcopies is that we can use it in VTL style situations to determine not only whether we’ve got an appropriate number of copies, but they’re also valid – i.e., they’re usable by NetWorker for recovery purposes.

It’s a shame it’s too buggy to be used.

Here’s an example where I backup to an ADV_FILE type device:

[root@tara ~]# save -b Default -e "+3 weeks" -LL -q /usr/share
57777:save:Multiple client instances of tara.pmdg.lab, using the first entry
save: /usr/share  1244 MB 00:03:23  87843 files
completed savetime=1259366579

[root@tara ~]# mminfo -q "name=/usr/share,validcopies>1"
 volume        client       date      size   level  name
Default.001    tara.pmdg.lab 11/28/2009 1244 MB manual /usr/share
Default.001.RO tara.pmdg.lab 11/28/2009 1244 MB manual /usr/share

[root@tara ~]# mminfo -q "name=/usr/share,validcopies>1" -r validcopies
6095:mminfo: no matches found for the query

[root@tara ~]# mminfo -q "name=/usr/share,validcopies>1"
 volume        client       date      size   level  name
Default.001    tara.pmdg.lab 11/28/2009 1244 MB manual /usr/share
Default.001.RO tara.pmdg.lab 11/28/2009 1244 MB manual /usr/share

[root@tara ~]# mminfo -q "name=/usr/share,validcopies>1" -r validcopies
6095:mminfo: no matches found for the query

[root@tara ~]# mminfo -q "name=/usr/share,validcopies>1" -r validcopies,copies
 validcopies copies
 2     2
 2     2

I have a few problems with the above output, and am working through the bugs in validcopies with EMC. Let’s look at each of those items and see what I’m concerned about:

  1. We don’t have more than one valid copy just because it’s sitting on an ADV_FILE device. If the purpose of the “validcopies” flag is to count the number of unique recoverable copies, we do not have 2 copies for each instance on ADV_FILE. There should be some logic there to not count copies on ADV_FILE devices twice for valid copy counts.
  2. As you can see from the last two commands, the results found differ depending on report options. This is inappropriate, to say the least. We’re getting no validcopies reported at all if we only look for validcopies, or 2 validcopies reported if we search for both validcopies and copies.

Verdict from the above:

  • Don’t use validcopies for disk backup units.
  • Don’t report on validcopies only, or you’ll skew your results.

Let’s move on to VTLs though – we’ll clone the saveset I just generated to the ADV_FILE type over to the VTL:

[root@tara ~]# mminfo -q "volume=Default.001.RO" -r ssid,cloneid
 ssid         clone id
4279265459  1259366578

[root@tara ~]# nsrclone -b "Big Clone" -v -S 4279265459/1259366578
5874:nsrclone: Automatically copying save sets(s) to other volume(s)
6216:nsrclone:
Starting cloning operation...
Nov 28 11:29:42 tara logger: NetWorker media: (waiting) Waiting for 1 writable volume(s)
to backup pool 'Big Clone' tape(s) or disk(s) on tara.pmdg.lab
5884:nsrclone: Successfully cloned all requested save sets
5886:nsrclone: Clones were written to the following volume(s):
 BIG998S3

[root@tara ~]# mminfo -q "ssid=4279265459" -r validcopies
 0

[root@tara ~]# mminfo -q "ssid=4279265459" -r copies,validcopies
 copies validcopies
 3          3
 3          3
 3          3

In the above instance, if we query just by the saveset ID for the number of valid copies, NetWorker happily tells us “0”. If we query for copies and validcopies, we get 3 of each.

So, what does this say to me? Steer away from ‘validcopies’ until it’s fixed.

(On a side note, why does the offsite parameter remain Write Only? We can’t query it through mminfo, and I’ve had an RFE in since the day the offsite option was introduced into nsrmm. Why this is “hard” or taking so long is beyond me.)

Posted in Features, NetWorker, Scripting | Tagged: , , , | Comments Off on Validcopies hazardous to your sanity

Staging and Connectivity Loss

Posted by Preston on 2009-10-16

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

Posted in Features, NetWorker | Tagged: , , , , , , | 1 Comment »

A cautionary tale about changing pools

Posted by Preston on 2009-10-08

Sometimes I feel like a NetWorker old-timer. (When I don’t feel like a NetWorker old-timer, it doesn’t change the fact that I am.) These days, given the huge architectural gulf between them, I’d suggest that anyone who has been using NetWorker since v4.x or v5.x days is a NetWorker “old timer”. Since I’ve been using it with a trailing edge of v3.x days and heavily from v4.x, that puts me well into that territory.

One of the things NetWorker old timers will remember is that for a lot of its history, it was impossible to change anything to do with pools whenever a backup was running. If for instance, you had a backup going to a Monthly pool, and you wanted to configure a new group that would go to the Daily pool, you had to wait for all backups to complete, even though there was no overlap in pool interests, before you could add that group to the Daily pool.

When the restriction was first relaxed, the NetWorker GUIs would prompt with a warning when pool changes were being made during backup activities to indicate that it wasn’t recommended, but giving a proceed/OK button to force the change. These days, NetWorker is more permissive, but not always more forgiving.

A customer experienced a problem recently where he’d configured a new group, and started the group, only to realise that he’d not configured it to go to the correct pool. Rather than stopping the group and making the pool change, he hoped that he could change the pool settings and see the group start requesting media from the correct pool instead of the Default pool. Once the change was made though, the group kept on asking for media in the Default pool, so he stopped the group, waited a few minutes, and restarted.

NetWorker kept asking for media in the Default pool. The NMC pool configuration pane clearly showed that the group was now configured for the correct pool, but plain as day, the group wanted to still write to the Default pool.

Stopping and starting NetWorker didn’t seem to help either.

When he logged the case, and explained about the pool-change-during-backup, I immediately thought back to how NetWorker previously wouldn’t have allowed such a change to happen, and how it in an interim period used to allow the change to happen after issuing a warning. But what if, I thought, there’s still some locking that can happen which would cause a screw-up if the pool were changed for a group while the group was already requesting media?

So I suggested two courses of action to the customer:

  • EMC engineering’s hated solution: Stop NetWorker, clean out /nsr/tmp, restart, and see if that fixes it.
  • Stop the backup, take the group back out of the pool, restart and allow it to write to Default, then put the group back in the correct pool and run the backup again.

In this case, the customer chose the first option – cleaning out /nsr/tmp. While it wasn’t tested, I equally suspect that the second option would have worked too.

There is a lesson with this: avoid making changes to pools for data which is already actively trying to be written to media. Even though it’s technically supported, operationally it can still cause issues.

Posted in Features, NetWorker, Support | Tagged: , | Comments Off on A cautionary tale about changing pools

Client side compression and saveset sizes

Posted by Preston on 2009-09-30

Hi!

The text of this article can now be read at the NetWorker Information Hub. Click here to read it.

Posted in Features, NetWorker | Tagged: , , , , , | 1 Comment »

Basics – Updates vs Upgrades

Posted by Preston on 2009-08-18

After 13+ years of using NetWorker, I still tend to interchangeably use the terms ‘upgrade’ and ‘update’ (or to be more precise, mainly use the term ‘upgrade’).

However, there is, and always has been, a difference between the two terms in NetWorker nomenclature, and it’s useful knowing it in case you’re being asked to qualify your environment to a support person.

Here’s what they mean for NetWorker:

  • An upgrade is transitioning from one licensed feature set to a more advanced licensed feature set. For example, you might upgrade from NetWorker, Network Edition to NetWorker, Power Edition. Previously (when tiered licensing was still used for Windows modules), you might upgrade from say, Exchange Module Tier 1 to Exchange Module Tier 2. Alternatively, you can buy an upgrade to slot capacity for an Autochanger license (e.g., upgrading from a 1-64 slot license to a 1-128 slot license).
  • An update is where you change the version of a NetWorker product. E.g., you update from NetWorker 7.4.4 to NetWorker 7.4.5, or from NetWorker 7.3 to NetWorker 7.5.1. You would equally update from Oracle Module 4.5 to Oracle Module 5.

Since in both support and data protection it’s useful to avoid ambiguities, understanding the difference between these two terms can be important.

Posted in Basics, Features, NetWorker | Tagged: , | Comments Off on Basics – Updates vs Upgrades

NetWorker 7.5 and Oracle Module 5

Posted by Preston on 2009-05-07

Since I have more than a passing interest in databases, I always try to keep appraised of the Oracle module for NetWorker. It therefore surprised me a few days ago to see that v5 of the module had been released in March. I guess my excuse is that March was an insanely busy month for me between work and travel. (Well, that’s my excuse, and I’m sticking to it.)

So yesterday I downloaded v5 of the module (for Linux), and spun it up. This is a version I really, really like.

Now, here’s a few bullet points before I get to the most impressive feature:

  • No longer supports Oracle 9i or lower; if you want older, unsupported versions of Oracle you have to use an older version of the module.
  • Requires features that exist only in Networker 7.5.x as the underlying client.
  • Must have the NetWorker regular client installed and running in order for the module software to correctly install and activate,
  • Can work with the 7.4.x NetWorker server with the exception that what I’m about to describe below doesn’t work with a 7.4 server.
  • Now has a client configuration wizard that works within NMC and makes Oracle backup configuration a breeze.

Honestly, if you’re about to do a new NetWorker install into a site that has Oracle, skip everything else and install 7.5.1. I.e., this is one of these compelling reasons for 7.5.x.

The Oracle client configuration wizard is integrated into NMC’s wizards. Right-click on a client in the configuration panel, choose “Client Backup Configuration -> New”, and you’re off and running:

Oracle Client Configuration Step 1

Oracle Client Configuration Step 1

Oracle Client Configuration Step 2

Oracle Client Configuration Step 2

Note that you won’t reach this point if you’ve disabled ‘nsrauth’ authentication on the backup server. I had done so on my lab server as a test on Monday, and spent half an hour trying to work out a … rather inexact … error message.

Oracle Client Configuration Step 3

Oracle Client Configuration Step 3

Oracle Client Configuration Step 4

Oracle Client Configuration Step 4

The above step is where things get fun. Note that if you are given these details, you don’t even need to log onto the client to setup an nsrnmo script any longer. This is the start of A Really Good Thing.

Also, I should note, in the above screen shot, because I was using a temporary database installed just for a few tests and I was in a rush, I used the sys account for connecting to the target database. No, you shouldn’t ever do that – create a backup user and use that account, please.

Note that Oracle, and the Oracle Listener, must both be running on the client in order to clear the above step.

After the above, we then start to get into the ‘regular’ client configuration options:

Oracle Client Configuration Step 5

Oracle Client Configuration Step 5

Oracle Client Configuration Step 6

Oracle Client Configuration Step 6

Oracle Client Configuration Step 7

Oracle Client Configuration Step 7

This summary screen shows you what you’re going to get as far as the configuration is concerned – including the RMAN script that has been automatically generated for you:

Oracle Client Configuration Step 8

Oracle Client Configuration Step 8

Confirmation of sweet success:

Oracle Client Configuration Step 9

Oracle Client Configuration Step 9

The finished client in NMC:

Oracle Client Configuration Step 10

Oracle Client Configuration Step 10

Once configured, you’re ready to start backing up straight away. Honestly, it couldn’t be simpler.

As a closing note, I know some other backup products have had Oracle backup wizards for some time, so I’m not claiming EMC is the first with this style of setup, but I do think it’s a great feature to see included now.

Posted in Databases, Features, NetWorker | Tagged: , | 17 Comments »

7.5(.1) changed behaviour – deleting savesets from adv_file

Posted by Preston on 2009-04-25

Yesterday I wanted to delete a few savesets from a lab server I’d upgraded from 7.4.4 to 7.5.1.

Wanting to go about it quickly, I did the following:

  • I used “nsrmm -dy -S ssid” for each saveset ID I wanted to delete, to erase it from the media database.
  • I used “nsrstage -C -V volumeName” for the disk backup unit volumes to run a cleaning operation.

Imagine my surprise when, instead of seeing a chunk of space being freed up, I got a lot of the following notifications:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

I got one of these for every saveset I deleted. And since I’d run a lot of tests, that was a lot of savesets. The corresponding result was that they all remained on disk. What had been a tried and true version of saveset deletion under 7.4.x and below appears to not be so useful under 7.5.1.

In the end I had to run a comparison between media database content and disk backup unit content – i.e.:

# mminfo -q "volume=volName" -r "ssid(60)"

To extract the long saveset IDs, which are in effect the names of the files stored on disk, then:

# find /path/to/volume -name -print

Then for each filename, check to see whether it existed in the media database, and if it didn’t, manually delete it. This is not something the average user should do without talking to their support people by the way, but, well, I am support people and it was a lab server…

This change is worrying enough that I’ll be running up a couple of test servers using multiple operating systems (the above happened on Linux) to see whether its reproducible or whether there was just say, some freaky accident with the media database on my lab machine.

I’ll update this post accordingly.

[Update – 2009-04-27]

Have done some more tests on 7.5.1 on various Linux servers, comparing results to 7.4.4. This is definitely changed behaviour and I don’t like it, given that it’s very common for backup administrators to delete one or two savesets here and there from disk. Chatting to EMC about it.

In the interim, here’s a workaround I’ve come up with – instead of using nsrmm -d to delete the saveset, instead run:

# nsrmm -w now -e now -S ssid

To mark the saveset as immediately recyclable. Then run “nsrim -X” to force a purge. That will work. If you have scripts though that manually delete savesets from disk backup units, you should act now to update them.

[Update – 2009-04-30]

It would appear as well that if you delete then attempt to reclaim space, NetWorker will flag the “scan required” flag for a volume. Assuming you’re 100% OK with what you’ve manually deleted and then purged from disk using rm, you can probably safely clear the flag (nsrmm -o notscan). If you’re feeling paranoid, unmount the volume, scan it, then clear the flag.

[Update – 2009-05-06]

Confirmed this isn’t present in vanilla 7.5. It seemed to occur in 7.5.1.

[Update – 2009-06-16]

Cumulative patches for 7.5.1 have been released; according to EMC support these patches include the fixes for addressing this issue, allowing a return to normal operations. If you’re having this issue, make sure you touch base with EMC support or your EMC support partner to get access to the patches. (Note: I’ve not had a chance to review the cumulative patches, so I can’t vouch for them yet.)

[Update 2009-08-11]

I forgot to update earlier; the cumulative patches (7.5.1.2 in the case of what I received) did properly incorporate the patch for this issue.

Posted in Features, NetWorker | Tagged: , , , , | 7 Comments »