NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Posts Tagged ‘performance’

NetWorker 7.6 Performance Tuning Guide – No longer an embarrassment

Posted by Preston on 2009-11-21

While NetWorker 7.6 is not available for download as of the time I write this, the documentation is available on PowerLink. For those of you chomping at the bit to at least read up on NetWorker 7.6, now is the time to wander over to PowerLink delve into the documentation.

The last couple of releases of NetWorker have been interesting for me when it comes to beta testing. In particular, I’ve let colleagues delve into VCB functionality, etc., and I’ve stuck to “niggly” things – e.g., checking for bugs that have caused us and our customers problems in earlier versions, focusing on the command line, etc.

For 7.6 I also decided to revisit the documentation, particularly in light of some of the comments that regularly appear on the NetWorker mailing list about the sorry state of the Performance Tuning and Optimisation Guide.

It’s pleasing, now that the documentation is out, to read the revised and up to date version of the Performance Tuning Guide. Regularly critics of the guide for instance will be pleased to note that FDDI does not appear once. Not once.

Does it contain every possible useful piece of information that you might use when trying to optimise your environment? No, of course not – nor should it. Everyone’s environment will differ in a multitude of ways. Any random system patch can affect performance. A single dodgy NIC can affect performance. A single misconfigured LUN or SAN port can affect performance.

Instead, the document now focuses on providing a high level overview of performance optimisation techniques.

Additionally, recommendations and figures have been updated to support current technology. For instance:

  • There’s a plethora of information on PCI-X vs PCIeXpress.
  • RAM guidelines for the server based on the number of clients has been updated.
  • NMC finally gets a mention as a resource hog! (Obviously, that’s not the words used, but it’s the implication for larger environments. I’ve been increasingly encouraging larger customers to put NMC on a separate host for this reason.)
  • There’s a whole chunk on client parallelism optimisation, both for the clients and the backup server itself.

I don’t think this document is perfect, but if we’re looking at the old document vs the new, and the old document scored a 1 out of 10 on the relevancy front, this at least scores a 7 or so, which is a vast improvement.

Oh, one final point – with the documentation now explicitly stating:

The best approach for client parallelism values is:

– For regular clients, use the lowest possible parallelism settings to best balance between the number of save sets and throughput.

– For the backup server, set highest possible client parallelism to ensure that index backups are not delayed. This ensures that groups complete as they should.

Often backup delays occur when client parallelism is set too low for the NetWorker server. The best approach to optimize NetWorker client performance is to eliminate client parallelism, reduce it to 1, and increase the parallelism based on client hardware and data configuration.

(My emphasis)

Isn’t it time that the default client parallelism value were decreased from the ridiculously high 12 to 1, and we got everyone to actually think about performance tuning? I was overjoyed when I’d originally heard that the (previous) default parallelism value of 4 was going to be changed, then horrified when I found out it was being revised up, to 12, rather than down to 1.

Anyway, if you’ve previously dismissed the Performance Tuning Guide as being hopelessly out of date, it’s time to go back and re-read it. You might like the changes.

Posted in NetWorker | Tagged: , , , | 4 Comments »

Routine filesystem checks on disk backup units

Posted by Preston on 2009-09-28

On Linu, filesystems typically have two settings regarding getting complete checks on boot. These are:

  • Maximum number of mounts before a check
  • Interval between checks

The default settings, while reasonably suitable for smaller partitions, are very unsuitable for large partitions, such as what you find in disk backup units. In fact, if you don’t pay particular attention to these settings, you may find after a routine reboot that your backup server (or storage node) can take hours to become available. For instance, it’s not unheard of to see even sub-20TB DBU environments (as say, 10 x 2TB filesystems) take several hours to complete mandatory checks on filesystems after what should have just been a routine reboot.

There are two approaches that you can take to this:

  • If you want to leave the checks enabled, it’s reasonably imperative to ensure that at most only one disk backup unit filesystem will be checked at one time after a reboot; this will at least reduce the size of any check-on-reboot. Thus, ensure you:
    • Configure each filesystem so that it will have a different number of maximum mounts before check than any other filesystem, and,
    • Configure the interval (days) between checks for each filesystem to be a significantly different number.
  • If you don’t want periodic filesystem checks to ever interfere with the reboot process, you need to:
    • Ensure that following a non-graceful restart of the server the DBU filesystems are unmounted and checked before any new backup or recovery activities are done, and,
    • Ensure that there are processes – planned maintenance windows if you will – for manual running of the filesystem checks that are being skipped.

Neither option is particularly “attractive”. In the first case, you can still, if you cherish uptime or don’t need to reboot your backup server often, get into a situation where multiple filesystems need to be checked on reboot if they’ve all exceeded their days-between-checks parameter. In the second instance, you’re having to insert human driven processes into what should normally be a routine operating system function. In particular with the manual option, there must be a process in place to NetWorker shutdown + checking even in the middle of the night if an OS crash occurs.

Actually, the above list is a little limited – there’s a couple of other options that you can consider as well – though they’re a little more left of field:

  • Build into the change control process the timings for complete filesystem checks in case they happen, or
  • Build into the change control process or reboot procedure for the backup server/storage nodes the requirement to temporarily disable filesystem checks (using say, tune2fs) so that you know the reboot to be done won’t be costly in terms of time.

Personally, I’m looking forward to btrfs – in reality, a modern filesystem such as that should solve most, if not all, of the problems discussed above.

Posted in Linux, NetWorker | Tagged: , , , , | Comments Off on Routine filesystem checks on disk backup units

Aside – The exciting world of SSD

Posted by Preston on 2009-09-23

The Register has some coverage at the moment of Intel demonstrating a (highly customised/optimised) 7 disk SSD configuration which delivered 1 million IOPS on a desktop configuration. As the article says, regardless of the level of tweaking to get there, this is a fabulous example of the world that is to come with SSD.

Clearly this is still a while off from regular commercial use for the “average” business, but regardless, it’s a fascinating development.

There’s good cause, when you look at these sorts of figures, to see why most storage vendors are getting onto the SSD bandwagon and declaring SSDs to be the “zeroth tier” in storage performance.

Posted in Aside | Tagged: , , , | Comments Off on Aside – The exciting world of SSD

Manual backups and the “quiet” option

Posted by Preston on 2009-08-07

When you run a manual backup in NetWorker (e.g., via the “save” command), NetWorker will by default give you a file-by-file listing of what is being backed up. In theory this is helpful for manual backups, because typically we do manual backups to debug issues, not as part of the production backup process.

If you’re wanting to do a manual backup and make it as high performance as possible, there’s an option you need to use: quiet. For the save command, it’s “-q”; for the GUI, it means going and bringing up a command prompt and learning how to use save*. You can’t turn off a file-by-file listing (currently at least) in the NetWorker user backup program.

So, backing up from a Solaris system to a Linux NetWorker server, using gigabit ethernet and the same backup device (ADV_FILE) each time, here’s some examples of the impact of viewing the per-file progress of the backup. (Each backup was run three times, with the run-time averaged.)

  1. Backing up 77,822 files:
    • Without per-file listing: 53 minutes, 11 seconds.
    • With per-file listing: 55 minutes, 35 seconds.
  2. Backing up 3,710,475 files:
    • Without per-file listing: 1 hour, 38 minutes, 30 seconds.
    • With per-file listing: 1 hour, 56 minutes, 9 seconds.

The time taken to print each file will be dependent on the performance of the console device you’re using. The above tests were run from an GigE ssh session from another host to the Sun client. (For instance, this problem also occurs in recoveries: I remember once running a recovery via a Sun serial console where I waited 6 hours for the recovery to complete only to discover when all the files stopped printing that the recovery had finished hours ago.)

The simple fact is – the more intensively you want to watch status of a backup (or for that matter, a recovery), the more you directly have an impact on its performance.


* Honestly, you should anyway – see here for a good reason.

Posted in NetWorker | Tagged: , , , , | Comments Off on Manual backups and the “quiet” option

Aside – Is NetWorker fast enough for my needs?

Posted by Preston on 2009-07-30

Most days my blog stats shows at least one search coming into the blog along the lines of “how fast is NetWorker”, etc. It’s understandable. A lot of people selling products other than NetWorker try to push old FUD that it’s not fast enough. Equally, a lot of people who are considering NetWorker are understandably curious as to whether it will be fast enough to suit their needs.

I thought I should write a (brief) piece on this.

To cut to the chase, NetWorker is as fast as your hardware will allow. Yes, there are obviously some software limitations, but that’s true of any backup product.

Looking at the facts though, we can refer back as far as 2003, where NetWorker broke the (let’s call it) “land speed record” for backup by achieving backup performance of 10TB per hour. Most companies now would still be happy with 10TB an hour, but obviously that performance metric was bound by the devices and infrastructure available at the time. These days, it would obviously come out much faster.

I’m currently struggling to find the original Legato piece about this performance record, but my recollection is that it was:

  • Averaging 10TB/h
  • Achieving 2.86GB/s (that’s gigabytes per second, not gigabits per second)
  • Using real customer data

I did find the (very brief) SGI announcement about the speed achieved here. I also found a Sun/Legato presentation here (search for “10TB/h”), and a “press clipping” here.

The net result? Well, I’m not claiming every environment will get that sort of speed, but what I will reasonably confidently assert is that NetWorker will scale to meet your needs, so long as you have budget.

Backup performance isn’t really a p–ssing competition that you want to get into – in reality, if you want to worry about “speeds and feeds”, look at restore performance. NetWorker does admirably there – that 10TB/h filesystem backup restored at 4.5TB/h, and a block level backup run at 7.2TB/h restored at 7.9TB/h.

So the next time someone tries to tell you that “NetWorker isn’t fast enough to be enterprise”, remember one thing: they’re wrong.

Posted in Aside, NetWorker | Tagged: , , | Comments Off on Aside – Is NetWorker fast enough for my needs?

Aside – That old spindle problem

Posted by Preston on 2009-07-20

When I used to be a system administrator, back when 2GB hard drives were the norm, I remember an Oracle system that only needed about 4GB of storage space, but our Oracle savvy system administrator configured an environment with around 30 x 2GB drives.

That was my first introduction to spindles and their relationship to performance.

As the years have gone by, and drive capacities have increased, the spindle problem only briefly appeared to go away – not due to capacity, but due to increasing rotational speed and other improved performance characteristics of drives.

However, perhaps even more so than high performance databases, virtualisation is forcing system administrators to become reacquainted with spindle configuration issues; multi-core systems supporting dozens of virtualised servers create IO problems even within relatively small environments.

If you’re interested in reading more about spindle issues, you may want to check out this article on Search Storage – Get More IOPS per dollar with SSD, 2.5″ Drives. Regardless of the actual vendors discussed, this article is a good overview of the spindle problem. If you’re struggling with virtualised (or even database) IO performance, and were not previously aware of spindle issues, check it out for an introduction. (As practically all storage vendors are moving into high performance options involving either SSDs and/or massive numbers of 2.5″ drives, the article is relevant regardless of your preferred storage platform.)

(If you’re looking for a backup angle to this posting, consider the following: as you virtualise and or put in applications/systems with increasingly higher demands for IOPS, you affect not only primary production operations, but also protection, maintenance and management functions, including (but not limited to) backups. Sometimes performance issues that exist, but are not yet plaguing production operations, manifest first in backup situations where there are high intensity prolonged sequential reads.)

Posted in Architecture, Aside | Tagged: , , | 4 Comments »

HSM implications for backup

Posted by Preston on 2009-06-11

If you’ve been following this blog for a while, you’ll know that one key ongoing performance issue I refer to is that created by costs associated with walking dense filesystems as part of backups.

One area that people sometimes don’t take into consideration is the implications of backing up filesystems that use HSM – Hierarchical Storage Management. In a HSM environment, files are migrated from primary to secondary (or even tertiary) storage based on age and access times. In order to make this seamless to the user, a small stub file with the same name is left behind on the filesystem. Therefore if a user attempts to access the file, they trigger a read from HSM storage.

So, in order to free up space (for more storage) on primary disk, big files are migrated, with tiny files being left behind. Over time, more big files are removed, and more tiny files left behind. You may understand where I’m heading now: this high number of little files can result in performance issues for the backup. Obviously, HSM systems are configured so that they recognise backup agents, and the stub is backed up rather than the original file being pulled back, so we’re not concerned about say, backing up 4TB for a 1TB filesystem with HSM; instead, our concern is that the cost of walking a big filesystem with an inordinately large number of small files will seriously impede the backup process.

If you’re planning HSM, think very carefully about how you’re going to backup the resulting filesystem.

(Coming soon: demonstrations of the impact of dense filesystems on backup performance.)

Posted in Backup theory, NetWorker | Tagged: , | 2 Comments »

Is your backup server fast enough?

Posted by Preston on 2009-02-26

Is your backup server a modern, state of the art machine with high speed disk, significant IO throughput capabilities and ample RAM so as to not be a bottleneck in your environment?

If not, why?

Given the nature of what it does – support systems via backup and recovery – your backup server is, by extension, “part of” your most critical production server(s). I’m not saying that your backup server should be more powerful than any of your production servers, but what I do want to say is that your backup server shouldn’t be a restricting agent in relation to the performance requirements of those production servers.

Let me give you an example – the NetWorker index region. Using Unix for convenience, we’re talking about /nsr/index. This region should either be on equally high speed drives as your fastest production system drives, or on something that is still suitably fast.

For instance, in much smaller companies, I’ve often seen the production servers have SCSI drives or SCSI JBODs, but the backup server just be a machine with a couple of mirrored SATA drives.

In larger companies, you’ll have the backup server connected to the SAN with the rest of the production systems, but while the production systems will get access to 15,000 RPM SCSI drives, the backup server will get instead 7,200 RPM SATA drives (or worse, previously, 5,400 RPM ATA drives).

This is a flawed design process for one very important reason – for every file you backup, you need to generate and maintain index data. That is, NetWorker server disk IO occurs in conjunction with backups*.

More importantly, when it comes time to do a recovery, and indices must be accessed, do you want to pull index records for say, 20,000,000 files from slow disk drives or fast disk drives?

(Now, as we move towards flash drives for critical performance systems, I’m not going to suggest that if you’re using flash storage for key systems you should also use it for backup systems. There is always a price point at which you have to start scaling back what you want vs what you need. However, in those instances I’d suggest that if you can afford flash drives for critical production systems, you can afford 15,000 RPM SCSI drives for the backup servers’ /nsr/index region.)

Where cost for higher speed drives becomes an issue, another option is to scale back the speed of the individual drives but use more spindles, even if the actual space used on each drive is less than the capacity of the drive**.

In that case for instance, you might have 15,000 RPM drives for your primary production servers, but the backup servers’ /nsr/index region might reside on 7,200 RPM SATA drives successfully, so long as they’re arrayed (no pun intended) in such a way that there’s sufficient spindles to make reading back data fast. Equally then, in such a situation, hardware RAID (or software RAID on systems that have sufficient CPUs and cores that it equals or exceeds hardware RAID performance) will allow for faster processing of data for writing (e.g., RAID-5 or RAID-3).

In the end, your backup server should be like a butler (or a personal assistant, if you prefer the term) – always there, always ready and able to assist with whatever it is you want done, but never, ever an impediment.


* I see this as a similar design flaw to say, using 7,200 RPM drives as a copy-on-write snapshot area for 15,000 RPM drives.
** Ah, back in the ‘old’ days, where a database might be spread across 40 x 2GB drives, using only 100 MB from each drive!

Posted in NetWorker, Policies | Tagged: , , | 2 Comments »