NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Posts Tagged ‘Recovery’

Recovery reporting comes to NetWorker

Posted by Preston on 2009-12-02

One of the areas where administrators have been rightly able to criticise NetWorker has been the lack of reporting or auditing options to do with recoveries. While some information has always been retrievable from the daemon logs, it’s been only basic and depends on keeping the logs. (Which you should of course always do.)

NetWorker 7.6 however does bring in recovery reporting, which starts to rectify those criticisms. Now in the enterprise reporting section, you’ll find the following section:

  • NetWorker Recover
    • Server Summary
    • Client Summary
    • Recover Details
    • Recover Summary over Time

Of these reporting options, I think the average administrator will want the bottom two the most, unless they operate in an environment where clients are billed for recoveries.

Let’s look at the Recover Summary over Time report:

Recover summary over time

This presents a fairly simple summary of the recoveries that have been done on a per-client basis, including the number of files recovered, the amount of data recovered and the breakdown of successful vs failed recovery actions.

I particularly like the Recover Details report though:

Recover Details report

(Click the picture to see the entire width.)

As you can see there, we get a per user breakdown of recovery activities, when they were started, how long they took, how much data was recovered, etc.

These reports are a brilliant and much needed addition to NetWorker reporting capabilities, and I’m pleased to see EMC has finally put them into the product.

There’s probably one thing still missing that I can see administrators wanting to see – file lists of recovery sessions. Hopefully 7.(6+x) would see that report option though.

Posted in NetWorker | Tagged: , , , , , , | 2 Comments »

The 5 Golden Rules of Recovery

Posted by Preston on 2009-11-10

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

Posted in Backup theory, NetWorker, Policies | Tagged: , , , , | Comments Off on The 5 Golden Rules of Recovery

Recovery is NOT #3

Posted by Preston on 2009-10-13

There’s a Top 10 Reasons marketing document from EMC now explaining why you should be using NetWorker.

While obviously this is a marketing document, it has one serious flaw above and beyond any issues that people may have over standard marketing documents. It puts recovery at the wrong numbered item.

The first reason, according to the document, is about driving costs out of the environment. Sure, that’s a reason for deployment of an integrated solution, but it’s not the number 1 reason. The second reason is apparently due to closed backup windows. That is a good second reason, but it doesn’t have priority over the real number one reason.

The third reason they cite on the document is:

If your backups don’t recover, neither will your business.

NetWorker is the leader in recovery performance – up to 100 percent faster than the alternatives. It delivers fast, secure, and reliable data and server recovery to ensure you meet your required service levels.

Who in their right mind would put this at #3 on a list of reasons? Backup is recovery – or rather, it’s all about recovery. Whoever put this marketing document together was wrong on one very, key point:

Recovery is and has always been Number One.

(If you think that’s wrong, just go talk to all the Sidekick users…)

Posted in Backup theory, NetWorker | Tagged: , , | Comments Off on Recovery is NOT #3

Trickle recoveries aren’t recoveries

Posted by Preston on 2009-09-25

In environments with satellite offices, a common “backup” technique described in a lot of situations is what I’d generically call a “trickle backup” technique. This uses either some form of asynchronous replication (either block or file), or some other state of file/block deduplication, etc., to achieve very small backups (after the first) back to a central site.

These are inevitably done for one or both of the following two reasons:

  1. Staff at the satellite office do not have the technical skills to manage local media or backup storage nodes/media servers.
  2. The WAN bandwidth is too little (or too costly) to do full-scale backups.

Remembering my definition of recoverable in The 7 Procedural Obligations of Backup Administrators, I’d like to suggest that for most situations, there’s no such thing as a trickle recovery. To reiterate, my definition of recoverable is:

  1. The item that was backed up can be retrieved from the backup media.
  2. The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
  3. The item can be retrieved within the required window.

Trickle backups, if not considered properly, cease to be valid backups if they violate item 3 above.

Whenever trickle backups are considered, there must be rigorous planning conducted (including discussions with appropriate stakeholders) to determine what recovery methods will be considered valid. Let’s discuss briefly what might need to be considered:

  1. Will individual file level recovery back across the WAN connection be possible?
  2. What will be the maximum amount of data that can be recovered across the WAN connection?
  3. How will larger data recoveries be facilitated?
  4. How will complete system recoveries be facilitated?
  5. Can all recoveries be completed within SLAs?
  6. Are HR and IT policies in place to prevent situations where satellite office recovery requirements may be abandoned or delayed due to staff shortages or local workloads?

If all 6 of those questions don’t have answers that are compatible with SLAs and business requirements, then there is no valid satellite backup system in place.

Posted in Backup theory | Tagged: , , , | Comments Off on Trickle recoveries aren’t recoveries

Basics – Directed Recoveries

Posted by Preston on 2009-09-21

Hi,

This blog post can now be read at the NetWorker Information Hub.

Posted in Basics, NetWorker | Tagged: , , , | 2 Comments »

My worst recovery ever

Posted by Preston on 2009-07-31

Everyone makes mistakes. That’s part of being human. Indeed, I’d suggest that anyone who expects you to never make mistakes in your job may perhaps be either demented or live at right angles to reality.

I believe the best, the most realistic and useful thing we can aim for is to never make the same mistake twice.

Thus, below I present my “worst recovery ever” as an example of a mistake that I certainly don’t intend to ever have happen to me again. It happened in my last job, and since then I’ve changed the way I work when it comes to recoveries.

It was Friday afternoon – about 3pm in fact. There was a training course running, and for once the training network was behaving itself. As our management had (once) bought into the notion of network computing, our training environment was sufficiently convoluted such that Sun-Rays referred to our production backup server+fileserver+SunRay server, then ran RDP to a VMware server.

One of our engineers who had little to do with NetWorker (ironically, he was hired to be my field replacement for NetWorker, but circumstances changed when he was hired) ran one of those notoriously bad RedHat updates that for a while killed glibc and a bunch of other system files if you didn’t happen to be in North America. So after much diagnosing and discussing, it was necessary to do an OS recovery. The only problem was that the OS on his laptop was so hosed that you couldn’t start any more login sessions, so you were stuck with what was there.

I started the recovery, selected files and viewed volumes. However, some of the files we needed were on media that was outside of the library, and because we were backing up to disk, then cloning, then staging, NetWorker wanted the (offsite) clones, not the onsite originals. His laptop didn’t have administration rights on the backup server, so I ssh’d across to the backup server, set the appropriate tapes to have a ‘suspect’ status, then ran up the recovery again. I selected the root filesystem (/) and kicked off the recovery.

About 2-3 minutes later someone came to ask me about why they couldn’t access the fileserver any more. Checking, I couldn’t log in either. It was odd – the training course was still working, but nothing new would work.

And then it hit me. I never logged out of the Sun system before I kicked off the recovery. Not only that, when I kicked off the recovery, I’d run “recover -c linuxClienton the Sun server.

That’s right, I was recovering a Linux filesystem on top of a Solaris system – including /dev, including all the base binaries. And because it was a recovery designed to overwrite a clobbered operating system, I’d told it to force-overwrite everything it came across.

I was …unhappy… with myself. I aborted the recovery, obviously, but the damage had already been done. No-one else could do anything, and I couldn’t start the recovery of the backup server because it was also the Sun-Ray server, and there were a bunch of paying students wrapping up their training course. So I had to wait for the training course to complete before I could even start the recovery. Practically everyone else got an early mark, since there wasn’t much they could do.

The recovery turned out to be somewhat problematic. Because the Solaris OS was hosed, no new processes could be started on that – the only solution, after much consideration, was an OS reinstall. At some point though the OS disks for that server had been taken out to a customer site because that customer had lost their disks, or something along those lines, so an earlier revision Solaris installer disk was used. However, it turned out that earlier revision disk wasn’t compatible with that hardware, and these were the days when Solaris took forever to install when you didn’t have tonnes of RAM. So by 10pm that night, I’d given up on being able to install the OS that night. Murphy’s law strikes as much with recoveries as it does with anything else in IT, so of course I’d not slept a wink the night before, and I lived one and a half hours away from the office by train at the best of times – at that time of the night, 3 hours at least.

I fell into bed around 2am, but fretting over the recovery, didn’t sleep much either. I’d recently purchased a Sun workstation myself, so I knew I had an install disk that would work, which was why I’d chosen to come home rather than download Solaris from the office. Later that morning saw me heading back into the office to keep going, and of course, my installer of Solaris was now faulty so the OS install hit bad sectors on the disk during the process and I hit my head repeatedly against the column in my office.

Lady luck then struck. Or at least wandered by and waved. There was a detached mirror left over from some disk swapping from about 2-3 months ago.

Finally I managed to boot the system from the previously attached mirror, get that mirror re-syncing, and installed NetWorker. Once the mirrors had resynced the recovery started in earnest, and the system finally came back.

By that stage though it was about 5pm on Saturday afternoon. I was wrecked from having not slept for a couple of days, not to mention still bloody angry with myself for the entire chain of events.

So I vowed I’d never make the same mistake again.

How, I hear you ask, will I prevent myself from ever making this mistake again? I check, check, check. In real recovery situations, I never run a recovery command any longer without first checking what host I’m logged into. It takes me an extra 10 seconds, maybe even fewer, but it guarantees that I don’t get that sick filling in the pit of my stomach that comes with FUBAR’ing one system when trying to recover another.

If I had done that simple check all those years ago, I would have had a lovely, quiet weekend.

Posted in NetWorker | Tagged: , , , , | Comments Off on My worst recovery ever

Mistakes you don’t want to make

Posted by Preston on 2009-07-25

Many years ago, a company switched from ArcServe to NetWorker. They did so around the time they made their end of year backups, the ones that they intended to keep ‘forever’ for legal requirements.

Fast-forward several years, and it was requested to recover Lotus Notes backups from those original end of year archives. That’s when the support call came through. You see, those end of year archives were done on a standalone tape drive, not a tape library, and both tapes had, say, ‘YEAR2002’ written on the label. There was a little “1” noted on the first label, and a little “2” noted on the second label. For convenience, we’ll call them the first and second tapes.

When they put the first tape into the library for recovery, their first issue was getting NetWorker to mount the tape, since it didn’t have a barcode. Some non-GUI commands later, the tape was in the drive, but NetWorker wouldn’t keep the tape mounted – every time they tried to mount the tape, NetWorker threw up an error saying that it was expecting tape YEAR2002 with a particular volume ID, not YEAR2002 with a different volume ID that wasn’t in the media database. The second YEAR2002 tape would mount though, but NetWorker couldn’t perform a recovery because all the media wasn’t available.

So, here’s what happened:

  • The manual backup was run of a bunch of systems and Lotus Notes.
  • A tape was labelled YEAR2002 within NetWorker, and the backup ran until the tape filled up.
  • A new tape was put into the tape drive, and since they had no exposure to NetWorker, they labelled that tape as YEAR2002 as well and the backup went on its way.

I’ll qualify here – the Lotus Notes backup was done using the module.

Now here’s the thing – while NetWorker works on the volume ID being unique, it also works on the volume label being unique as well. It won’t support two volumes in the media database at the same time with the same label. It gets pretty strident about that if you try to label one tape with another tapes’ label, but I guess if you’re new to NetWorker it might just seem like there’s a bunch of confirmation boxes you have to click before you can label your next tape.

So the net result was that the backup was written to two pieces of media that couldn’t co-exist in the media database at the same time. Scanning the first necessitates removing the second from the media database, and because this isn’t a filesystem backup, there are limitations that couldn’t be stepped around in recovering from partial savesets.

For a regular filesystem backup as a last resort this still would not be impossible to recover from – using scanner and uasm you can still suck the data off the tape(s) without NetWorker needing both in the media database. Tedious, and not as good as just being able to select data in a recovery program, but it’s better than no recovery at all. But you can’t use scanner and uasm for a non-filesystem recovery

(You also can’t write a new tape label to a fresh tape, then dd the NetWorker data after the label on the other tape onto the newly labelled tape. The volume ID (or some other unique volume identification system) is written into the savestream, and transferring that savestream onto another volume sees NetWorker reject it if you subsequently attempt to scan it.)

Net result? Data that could not be recovered short of sending it off to a specialist forensics data recovery company.

NetWorker’s fault? No. There is after all, only so much that software can do in order to prevent you from shooting yourself in the foot.

Posted in NetWorker | Tagged: , , , , , , | Comments Off on Mistakes you don’t want to make

Backups are not about being miserly

Posted by Preston on 2009-05-18

Recently Australia’s largest grocery chain followed some of the other chains and started offering unit pricing on their products. For example, packaged food includes not only its actual RRP but also the price per 100gm. That way, you can look at say, two blocks of cheese and work out which one is technically the better price, even if one is larger than the other.

This has reminded me of how miserly some companies can be with backup. While it’s something I cover in my book, it’s also something that’s worth explaining in a bit of detail.

To set this up, I want to use LTO-4 media as the backup destination, and look at one of those areas of systems that are frequently skipped from backups by miserly companies looking to save a buck here and there. That, of course, is the operating system. Too often it’s common to see backup configurations that back up data areas on servers, but leave the operating system unprotected because “that can be rebuilt”. That sort of argument is often a penny-wise/pound-foolish approach that fails to take into account the purpose of backup – recovery.

Sure, useless backups are a waste of money. That is, if you backup an Oracle database using the NetWorker module, but also let filesystem backups pick up the datafiles from the running database, then you’re not only backing up the database twice, but also the non-module backup in the scenario I’m describing is useless because it can’t be recovered from.

However, are operating system backups are waste of money or time? My argument is that except in circumstances where it is architecturally illogical/unnecessary, they’re neither a waste of money nor a waste of time. Let’s look at the why…

At the time of writing, a casual search of “LTO-4 site:.au best price” in Google yields within the first 10 results LTO-4 media as low as $80 for RRP. That’s RRP, which often has little correlation with bulk purchases, but miserly companies don’t make bulk media purchases, so we’ll work off that pricing.

Now, LTO-4 media has a native capacity of 800 GB. Rather than go fuzzy with any numbers, we’ll assume native capacity only for this example. So, at $80 for 800 GB, we’re talking about $0.10 per GB – 10c per GB.

So, our $80/800GB cartridge has a “unit cost” of 10c/GB, which sounds pretty cheap. However, that’s probably not entirely accurate. Let’s say that we’ve got a busy site and in order to facilitate backups of operating systems as well as all the other data, we need another LTO-4 tape drive. Again, looking around at list prices for standalone drives (“LTO-4 drive best price site:.au”) I see prices starting around the $4,500 to $5,000 mark. We should expect to see the average drive (with warranty) last for at least 3 years, so that’s $5,000 for 1,095 days, or $4.56 per day of usage. Let’s round that to $5 per day to account for electricity usage.

So, we’re talking about 10c per GB plus $5 per day. Let’s even round that up to $6 per day to account for staff time in dealing with any additional load caused by operational management of operating system backups.

I’ll go on the basis that the average operating system install is about 1.5GB, which means we’re talking about 15c to back that up as a base rate, plus our daily charge ($6). If you had say, 100 servers, that’s 150GB for full backups, or $15.00 for the full backups plus another $6 on that day. Operating system incremental backups tend to be quite small – let’s say a delta of 20% to be really generous. Over the course of a week then we have:

  • Full: 150GB at $15 + $6 for daily usage.
  • Incremental: 6 x (30GB at $3 + $6 for daily usage).

In total, I make that out to be $63 a week, or $3,276 a year for operating system backups to be folded into your current data backups. Does that seem a lot of money? Think of this: if you’re not backing up operating system data, this usually means that you’re working on the basis that “if it breaks, we’ll rebuild the server”.

I’d suggest to you that in most instances your staff will spend at least 4 hours trying to fix the average problem before the business decision is made to rebuild a server. Even with say, fast provisioning, we’re probably looking at 1 hour for full server reinstall/reprovision, current revision patching, etc. So that equals 5 hours of labour. Assuming a fairly low pay rate for Australian system administrators, we’ll assume you’re paying your sysadmins $25 per hour. So a 5 hour attempted fix + rebuild will cost you $125 in labour. Or will it? Servers are servers because they typically provide access or services for more than one person. Let’s assume 50 staff are also unable to work effectively while this is going on, and their average salary is even as low as $20 per hour. That’s $5,000 for their labour, or being more fair, we’ll assume they’re only 50% affected, so that’s $2,500 for their wasted labour.

How many server rebuilds does it take a year for operating system backups to suddenly be not only cost-effective but also a logically sound business decision? Even when we factor in say, an hour of effort for problem diagnosis plus recovery when actually backing up the operating system regions, there’s still a significant difference in price.

Now, I’m not saying that any company that chooses not to backup operating system data is being miserly, but I will confidently assert that most companies who choose not to backup their operating system data are being miserly. To be more accurate, I’d suggest that if the sole rationale for not doing such a backup is “to save money” (rather than “from an architectural standpoint it is unnecessary”) then it is likely that a company is wasting money, not saving it.

Posted in Backup theory, Policies | Tagged: , | 4 Comments »

Recovering with scanner and uasm

Posted by Preston on 2009-04-22

There are some types of recoveries that fall into the “last ditch effort” category – everything else has been tried, but the data just won’t come back. I would have to say that in 99.99% of cases this is due to one of the following two things:

  1. The data was never properly backed up in the first place.
  2. Cloning wasn’t done, and the only piece of media that holds a particular backup is broken.

Assuming the data was actually backed up, as a last thing to try*, you can recover filesystem data using scanner and uasm. This is documented in the man page for scanner – and also available as a PDF to Windows administrators in the command reference guide.

As per the man page, the way of running scanner and uasm to facilitate a recovery is as follows:

# scanner -S ssid device | uasm -r -v -m /source=/dest

or, if you prefer to avoid using the pipe,

# scanner -S ssid device -x uasm -r -v -m /source=/dest

Where in each of those commands:

  • ssid is the saveset ID you want to recover from
  • device is the path to the device where the volume holding that saveset can be accessed
  • /source is the source path you’re recovering that you want to replace with:
  • /dest is the replacement to the source path

Out of these, probably the “/source” and “/dest” makes the least sense, until you see an example.

Say for instance you have a backup of “/home” that you want to recover this way. Chances are, you don’t want to recover it on top of an existing /home; thus, you’d use “-m /home=/tmp/recovery”, or something like that. This instructs uasm to replace the “/home” path in the incoming data stream with “/tmp/recovery” when writing the data back out.

However, it’s not just a saveset ID argument that scanner takes in this mode; in fact, you can pass through most arguments in relation to restricting what is scanned – that is, clients, saveset names, and even file/record numbers.

If you’re dealing with potentially damaged media, the file/record numbers can sometimes be your saving grace. Say you’re doing a recovery and mid-way through the recovery NetWorker aborts saying there’s an error at file number 33 on the tape. At this point, you should at this point have a clone. Honestly, you really, really should have a clone.

Assuming though that for some reason you don’t have a clone, there’s a good chance your data is hosed** – recover, nwrecover, winworkr are not going to get you past that point.

Scanner might. There’s no guarantee, but it might. If the media error is limited to just a single portion of the tape, you may find that you can run a scanner/uasm combination that starts at the next file marker and hope that it gets your data back. Assuming we’re needing to get /home back for the client ‘baal’ on /dev/nst0***, this would make the command

# scanner -c baal -N /home -f 34 /dev/nst0 | uasm -r -v -m /home=/tmp/recovery

If you need to get even more data back, you could even use a starting record number. Say the recovery failed at file number 33, media record 1138; you might try the following:

# scanner -c baal -N /home -f 33 -r 1139 /dev/nst0 | uasm -r -v -m /home=/tmp/recovery

However, in each case, remember that the extent to which scanner and uasm will be able to recover your data will be limited to the amount of physical damage on the tape.

As I mentioned before, there’s no guarantees scanner is going to … ahem, save your bacon … if things have reached this point – but, what is there to lose by trying?


* This is also a valid technique where you have a decommissioned server and you just need to urgently pull back a few files without doing a full rebuild.

** And it’s not NetWorker’s fault of media or devices go bad, and it’s the only copy of the data!

*** If you’re needing to do this because of a media failure, make sure you use a different tape drive for your “last attempt”, just in case it was the tape drive that caused the failure.

Posted in NetWorker | Tagged: , , , , , | 1 Comment »

Basics – no_striped_recover

Posted by Preston on 2009-04-03

With the introduction of the advanced file type (adv_file) device in NetWorker, changes were made to support striped recoveries. This is a recovery where if all the savesets required to facilitate a recovery are online, NetWorker commences parallel reads, speeding up the process considerably. This applies both for file and tape based devices. Both in theory and in practice, it usually works great, but there is at least one key exception I’m aware of.

For many releases of NetWorker, striped recovery can fail on Linux if more media needs to be mounted than there are devices to read from. For instance, if you have a recovery that needs to read data from 4 tapes, but you only have 3 tape drives available, in many instances of NetWorker on Linux you’ll get the situation where NetWorker will mount 2 or 3 of the tapes, but then appear to just “hang” the recovery before it starts.

Thankfully, there’s actually a relatively easy solution.

Within the /nsr/debug directory, you can create the file:

no_striped_recover

At that point, NetWorker will revert to the traditional recovery style – reading in sequence from each volume, starting at the oldest saveset required and coming forward to the newest saveset required, pulling the requisite chunks of data from each saveset.

If you’re wondering, the content of the file is irrelevant; thus, you can simply:

# touch /nsr/debug/no_striped_recover

If the recovery is actually running, you’ll need to cancel it and run it again – note that you do not have to restart the NetWorker server though.

Posted in Basics, NetWorker | Tagged: , , , | Comments Off on Basics – no_striped_recover