NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • Advertisements
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Posts Tagged ‘backup to disk’

Merits of target based deduplication

Posted by Preston on 2009-11-12

It goes without a doubt that we have to get smarter about storage. While I’m probably somewhat excessive in my personal storage requirements, I currently have 13TB of storage attached to my desktop machine alone. If I can do that at the desktop, think of what it means at the server level…

As disk capacities continue to increase, we have to work more towards intelligent use of storage rather than continuing the practice of just bolting on extra TBs whenever we want because it’s “easier”.

One of the things that we can do to more intelligently manage storage requirements for either operational or support production systems is to deploy deduplication where it makes sense.

That being said, the real merits of target based deduplication become most apparent when we compare it to source based deduplication, which is where the majority of this article will now take us.

A lot of people are really excited about source level deduplication, but like so many areas in backup, it’s not a magic bullet. In particular, I see proponents of source based deduplication start waving magic wands consisting of:

  1. “It will reduce the amount of data you transmit across the network!”
  2. “It’s good for WAN backups!”
  3. “Your total backup storage is much smaller!”

While each of these facts are true, they all come with big buts. From the outset, I don’t want it said that I’m vehemently opposed to source based deduplication; however, I will say that target based deduplication often has greater merits.

For the first item, this shouldn’t always be seen as a glowing recommendation. Indeed, it should only come into play if the network is a primary bottleneck – and that’s more likely going to be the case if doing WAN based backups as opposed to regular backups.

In regular backups while there may be some benefit to reducing the amount of data transmitted, what you’re often not told is that this reduction comes at a cost – that being increased processor and/or memory load on the clients. Source based deduplication naturally has to shift some of the processing load back across to the client – otherwise the data will be transmitted and thrown away. (And otherwise proponents wouldn’t argue that you’ll transmit less data by using source based backup.)

So number one, if someone is blithely telling you that you’ll push less data across your network, ask yourself the following questions:

(a) Do I really need to push less data across the network? (I.e., is the network the bottleneck at all?)

(b) Can my clients sustain a 10% to 15% load increase in processing requirements during backup activities?

This makes the first advantage of source based deduplication somewhat less tangible than it normally comes across as.

Onto the second proposed advantage of source based deduplication – faster WAN based backups. Undoubtedly, this is true, since we don’t have to ship anywhere near as much data across the network. However, consider that we backup in order to recover. You may be able to reduce the amount of data you send across the WAN to backup, but unless you plan very carefully you may put yourself into a situation where recoveries aren’t all that useful. That is – you need to be careful to avoid trickle based recoveries. This often means that it’s necessary to put a source based deduplication node in each WAN connected site, with those nodes replicating to a central location. What’s the problem with this? Well, none from a recovery perspective – but it can considerably blow out the cost. Again, informed decisions are very important to counter-balance source based deduplication hyperbole.

Finally – “your total backup storage is much smaller!”. This is true, but it’s equally an advantage of target based deduplication as well; while the rates may have some variance the savings are still great regardless.

Now let’s look at a couple of other factors of source based deduplication that aren’t always discussed:

  1. Depending on the product you choose, you may get less OS and database support than you’re getting from your current backup product.
  2. The backup processes and clients will change. Sometimes quite considerably, depending on whether your vendor supports integration of deduplication backup with your current backup environment, or whether you need to change the product entirely.

When we look at those above two concerns is when target based deduplication really starts to shine. You still get deduplication, but with significantly less interruption to your environment and your processes.

Regardless of whether target based deduplication is integrated into the backup environment as a VTL, or whether it’s integrated as a traditional backup to disk device, you’re not changing how the clients work. That means whatever operating systems and databases you’re currently backing up you’ll be able to continue to backup, and you won’t end up in the (rather unpleasant) situation of having different products for different parts of your backup environment. That’s hardly a holistic approach. It may also be the case that the hosts where you’d get the most out of deduplication aren’t eligible for it – again, something that won’t happen with target based deduplication.

The changes for integrating target based deduplication in your environment are quite small –  you just change where you’re sending your backups to, and let the device(s) handle the deduplication, regardless of what operating system or database or application or type of data is being sent. Now that’s seamless.

Equally so, you don’t need to change your backup processes for your current clients – if it’s not broken, don’t fix it, as the saying goes. While this can be seen by some as an argument for stagnation, it’s not; change for the sake of change is not always appropriate, whereas predictability and reliability are very important factors to consider in a data protection environment.

Overall, I prefer target based deduplication. It integrates better with existing backup products, reduces the number of changes required, and does not place restrictions on the data you’re currently backing up.

Advertisements

Posted in Backup theory, NetWorker | Tagged: , , , , | 5 Comments »

NetWorker on Linux – Ditching ext3 for xfs

Posted by Preston on 2009-11-05

Recently when I made an exasperated posting about lengthy ext3 check times and looking forward to btrfs, Siobhán Ellis pointed out that there was already a filesystem available for Linux that met a lot of my needs – particularly in the backup space, where I’m after:

  • Being able to create large filesystems that don’t take exorbitantly long to check
  • Being able to avoid checks on abrupt system resets
  • Speeding up the removal of files when staging completes or large backups abort

That filesystem of course is XFS.

I’ve recently spent some time shuffling data around and presenting XFS filesystems to my Linux lab servers in place of ext3, and I’ll fully admit that I’m horribly embarrassed I hadn’t thought to try this out earlier. If anything, I’m stuck looking for the right superlative to describe the changes.

Case in point – I was (and indeed still am) doing some testing where I need to generate >2.5TB of backup data from a Windows 32-bit client for a single saveset. As you can imagine, not only does this take a while to generate, but it also takes a while to clear from disk. I had got about 400 GB into the saveset the first time I was testing and realised I’d made a mistake with the setup so I needed to stop and start again. On an ext3 filesystem, it took more than 10 minutes after cancelling the backup before the saveset had been fully deleted. It may have taken longer – I gave up waiting at that point, went to another terminal to do something else and lost track of how long it actually took.

It was around that point that I recalled having XFS recommended to me for testing purposes, so I downloaded the extra packages required to use XFS within CentOS and reformatting the ~3TB filesystem to XFS.

The next test that I ran aborted due to a (!!!) comms error 1.8TB through the backup. Guess how long it took to clear the space? No, seriously, guess – because I couldn’t log onto the test server fast enough to actually see the space clearing. The backup aborted, and the space was suddenly back again. That’s a 1.8TB file deleted in seconds.

That’s the way a filesystem should work.

I’ve since done some (in VMs) nasty power-cycle mid-operation tests and the XFS filesystems come back up practically instantaneously – no extended check sessions that make you want to cry in frustration.

If you’re backing up to disk on Linux, you’d be mad to use anything other than XFS as your filesystem. Quite frankly, I’m kicking myself that I didn’t do this years ago.

Posted in Linux, NetWorker | Tagged: , , , , , | 8 Comments »

Design considerations for ADV_FILE devices

Posted by Preston on 2009-06-06

Introduction

When choosing to deploy backup to disk by using adv_file devices (instead of say, VTLs), there are some design considerations that you should keep in mind. It’s easy to just go in and start creating devices willy-nilly, with the consequence of that usually being poor performance and insufficient maintenance windows at some later date.

NetWorker doesn’t care what sort of physical devices (either layout, or connectivity properties) you place your ADV_FILE devices on; consequently for instance on a lab server of mine I have 3 x 1TB USB2 drives connected and each providing approximately 917GB of formatted disk backup capacity each. Now, this is something that I’d not recommend or even contemplate deploying for a production environment – but as I said, it’s a lab server, so my goal is to have copious amounts of space cheaply, not high performance.

There’s 3 layers of design factors you need to take into consideration:

  • Physical LUN layout/connectivity
  • Presented filesystem types and sizes
  • Ongoing maintenance

If you deploy disk backup without thinking about these three factors – without planning them – then at some point you’re going to come a cropper. So, let’s go through these options.

Physical LUN layout/connectivity

Except in lab environments where you can afford, at any point, to lose all content on disk backup units, you’ll need to have some form of redundancy on the disk backup units. It’s easy for businesses to … resent … having to spend money on redundancy, and I’m afraid that no-one will be able to make a coherent argument to me that it’s appropriate to run production backups to unprotected disk.

Assuming therefore that sanity prevails, and redundancy is designed into the system, care and consideration has to be made to layout LUNs and connectivity in such a way as to maximise throughput.

Probably the single best metric to consider is that it is necessary to ensure that physical layout and connectivity is such that it allows for reads from the disk backup units to exceed the performance of whatever tape is being written to when it comes to cloning, and for the requisite number of drives. That is, if your intent is to be able to clone from disk backup to at least 2 x LTO-3 drives simultaneously, your design needs to have a read performance of around 320 MB/s. Obviously, the design should allow for simultaneous writes (i.e., backups) while achieving those cloning objectives.

This need for speed affects both physical connectivity of disk as well as the layout of the LUNs presented to the host, and by layout I refer to both RAID level and number of spindles.

Presented filesystem types and sizes

Depending on the operating system being used for the backup host, the actual filesystem type selection may be somewhat limited. For example, on Windows NT based systems, there’s a very strong chance you’ll be using NTFS. (Obviously, Veritas Storage Foundation might be another option.) For Unix style operating systems, there will usually be a few more choices.

Within NetWorker, individual savesets are written as monolithic files to ADV_FILE devices. This invariably means that you don’t necessarily need to support say, millions of files on the ADV_FILE devices, but you do need to support large amounts of data.

My first concern therefore is to ensure that the filesystem selected is fast when it comes to a lesser considered activity – checking and error correction following a crash or unexpected reboot. To give you a simplistic example, when considering non-extent based filesystems, making a choice between journalled and non-journalled should be a “no-brainer”. So long as data integrity is not an issue*, you should always ensure that you pick the fastest checking/healing filesystem that also meets operational performance requirements.

Moving on to size, I usually follow the metric that any ADV_FILE device should be large enough to support two copies of the largest saveset that could conceivably be written to them. Obviously, there’ll be exceptions to that rule, and due to various design considerations, this may mean that there’s some savesets that you’ll have to consider going direct to tape (either physical or virtual), but it’s a good starting rule.

You have to also keep in mind the selection criteria used by NetWorker for picking the next volume to be written to. For instance, in standard configurations, it’s a good idea to set “target sessions” on disk backups all to 1. That way, new savesets achieve as close as possible to round-robining distribution.

However, bear in mind that when all devices are idle, and a new round of backups starts, NetWorker always picks the oldest labelled, non-empty volume to write to first, and works backwards from there. This, unfortunately is (for want of a better description), a stupid selection criteria for backup to disk. (It’s entirely appropriate for backup to tape.) The implications of this is that your disk backup units will typically “fill” in order of oldest labelled through to most recently labelled, and the first labelled disk backup unit often gets a lot more attention than the other disk backup units. Thus, if you’re going to have disk backup units of differing sizes, try to keep the “oldest” ones the largest, and remember that if you relabel a disk backup unit, it’s going to jump to the back of the queue.

Ultimately, it’s a careful balancing act you have to maintain – if you make your disk backup units too small, they may not fit some savesets on them at all (ever), or may too frequently fill during backups requiring staging.

On the other hand, if you make the disk backup units too large, you may find yourself in an unpleasant situation where the owner-host of the disk backup devices takes an unacceptably long period of time checking filesystems when it comes up following particular reboots. This is not something to be taken lightly: consider how a comprehensive and uninterruptable check of a 10TB filesystem on reboot may impact an SLA requiring recovery of Tier-1 data to start within 15 minutes of the request being made!

Not only that, given the serial nature of certain disk backup operations (e.g., cloning or staging), you can’t afford a situation where recoveries can’t run for say, 8 hours, because 10TB of data is being staged or cloned**.

Thus, for a variety of reasons, it’s quite unwise to design a system with a single, large/monolithic ADV_FILE device. Disk backup volumes should be spread across as many ADV_FILE devices as possible within the hardware configuration.

Ongoing maintenance

For backup systems that need 24×7 availability, there should be one rule here to follow: your design must support at least one disk backup unit being offline at any time.

Such a design allows backup, recovery, cloning and staging operations to continue even in the event of maintenance. These maintenance operations would include, but not be limited to, any of the following:

  • Evacuation of disk backup units to replace underlying disks and increase capacity (e.g., replacing 5 x 500GB disks with 5 x 1TB disks, etc.)
  • Evacuation of disk backup units to reformat the hosting filesystem to compensate for degraded performance from gradual fragmentation***.
  • Large-scale ad-hoc backups outside of the regular backup routine that require additional space.
  • Connectivity path failure or even (in a SAN), tray failure.

(In short, if you can’t perform maintenance on your disk backup environment, then it’s not designed correctly.)

In summary

It’s possible you’ll look at this list of considerations and want to throw your hands up in defeat thinking that ADV_FILE backups are too difficult. That’s certainly not the point. If anything, it’s quite the opposite – ADV_FILE backups are too easy, in that they allow you to start backing up without having considered any of the above details, and it’s that ease of use that ultimately gets people into trouble.

If planned correctly from the outset however, ADV_FILE devices will serve you well.


* Let’s face it – there shouldn’t be any filesystem where you have to question data integrity! However, I’ve occasionally seen some crazy “bleeding edge” designs – e.g., backing up to ext3 on Linux before it was (a) officially released as a stable filesystem or (b) supported by EMC/Legato.

** This is one of the arguments for VTLs within NetWorker – by having lots of small virtual tapes, the chances of a clone or stage operation blocking a recovery is substantially reduced. While I agree this is the case, I also feel it’s an artificial need based on implemented architecture rather than theoretical architecture.

*** The frequency with which this is required will of course greatly depend on the type of filesystem the disk backup units are hosted on.

Posted in Architecture, Backup theory, NetWorker | Tagged: , | 12 Comments »