NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Archive for the ‘Backup theory’ Category

Pictorial representation of management happiness over OpEx cost savings

Posted by Preston on 2009-12-16

There is a profoundly important backup adage that should be front and centre in the mind of any backup administrator, operator, manager, etc. This is:

It is always better to backup a little more than not quite enough.

This isn’t an invitation to wildly waste backup capacity on useless copies that can never be recovered from – or needlessly generate unnecessary backups. However, it should serve as a constant reminder that if you keep shaving important stuff out of your backups, you’ll eventually suffer a Titanic issue.

Now, people at this point often tell me either that (a) they’re being told to reduce the amount of data being backed up or (b) it makes their manager happy to be able to report less OpEx budget required or (c) some combination of the two, or (d) they’re reluctant to ask for additional budget for storage media.

The best way to counter these oppressive and dangerous memes is to draw up a graph of how happy your manager(s) will be over saving a few hundred dollars here and there on media versus potential recovery issues. To get you started, I’ve drawn up one which covers a lot of sites I’ve encountered over the years:

Manager happiness vs state of environment and needless cost savings on backupYou see, it’s easy to be happy about saving a few dollars here and there on backup media in the here and now, when your backups are running and you don’t need to recover.

However, as soon as the need for a recovery starts to creep in, previous happiness over saving a few hundred dollars rapidly evaporates in direct proportion to the level of the data loss. There might be minimal issues to a single lost document or email, but past that things start to get rather hairy. In fact, it’s very easy to switch from 100% management happiness to 100% management disgruntlement within the space of 24 hours in extreme situations.

You, as a backup administrator, may already be convinced of this. (I would hope you are.) Sometimes though, other staff or managers may need reminding that they too may be judged by more senior management on recoverability of systems under their supervision, so this graph equally applies to them. That continues right up the chain, further reinforcing the fact that backups are an activity which belong to the entire company, not just IT, and therefore they are a financial concern that need to be budgeted for by the entire company.

Posted in Architecture, Backup theory | Tagged: , , , | 4 Comments »

15 crazy things I never want to hear again

Posted by Preston on 2009-12-14

Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

In no particular order then, these are the crazy things I never want to hear again:

  1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
  2. We just backup /etc/passwd on that machine.
  3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
  4. Oh, hadn’t thought about that (In response to #3).
  5. Can you fax me some cleaning cartridge barcodes?
  6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
  7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
  8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
  9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
  10. Why would I want to match barcode labels to tape labels? That’s crazy!
  11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
  12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
  13. If a user is dumb enough to delete their file, I don’t care about getting it back.
  14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
  15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

I will say that numbers #1 and #15 are my personal favourites for crazy statements.

Posted in Backup theory, General Technology, Policies, Quibbles | Tagged: | 1 Comment »

This holiday season, give your inner geek a gift

Posted by Preston on 2009-12-11

As it approaches that time for giving, it’s worth pointing out that with just a simple purchase, you can simultaneously give yourself and me a present. I’m assuming regular readers of the blog would like to thank me, and the best thanks I could get this year would be to get a nice spike in sales in my book before the end of the year.

Enterprise Systems Backup and Recovery: A corporate insurance policy” is a book aimed not just at companies only now starting to look at implementing a comprehensive backup system. It’s equally aimed at companies who are already doing enterprise backup and need that extra direction to move from a collection of backup products to an actual backup system.

What’s a backup system? At the most simple, it’s an environment that is geared towards recovery. However, it’s not just having the right software and the right hardware – it’s also about having:

  • The right policies
  • The right procedures
  • The right people
  • The right attitude

Most organisations actually do pretty well in relation to getting the right software and the right hardware. However, that’s only about 40% of achieving a backup system. It’s the human components – that last remaining 60% that’s far more challenging and important to get right. For instance, at your company:

  • Are backups seen as an “IT” function?
  • Are backups assigned to junior staff?
  • Are results not checked until there’s a recovery required?
  • Are backups only tested in an adhoc manner?
  • Are recurring errors that aren’t really errors tolerated?
  • Are procedures for requesting recoveries adhoc?
  • Are backups thought of after systems are added or expanded?
  • Are backups highly limited to “save space”?
  • Is the backup server seen as a “non-production” server?

If the answer to even a single one of those questions is yes, then your company doesn’t have a backup system, and your ability to guarantee recoverability is considerably diminished.

Backup systems, by integrating the technical and the human aspect of a company, provide a much better guarantee of recoverability than a collection of untested random copies that have no formal procedures for their creation and use.

And if the answer to even a single one of those questions is yes, you’ll get something useful and important out of my book.

So, if you’re interested in buying the book, you can grab it from Amazon using this link, or from the publisher, CRC press, using this link.

Posted in Backup theory, General thoughts | Tagged: , , | 4 Comments »

Funny attitude adjustments

Posted by Preston on 2009-12-08

It’s funny sometimes seeing attitude adjustments that come from various companies as they’re acquired by others.

One could never say that EMC has been a big fan of tape (I’ve long since given up any hopes of them actually telling the 100% data protection story and buying a tape company), but at least they’ve tended to admit that tape is necessary over the years.

So this time the attitude adjustment now seems to be coming from Data Domain as they merge into the backup and recovery division at EMC following the acquisition. Over at SearchStorage, we have an article by Christine Cignoli called “Data deduplication goes mainstream, but tape lives on“, which has this insightful quote:

Even Shane Jackson, director of product marketing at Data Domain, agrees. “We’ve never gone to the extreme of ‘tape is dead,'” he said. “As an archive medium, keeping data for seven years for HIPAA compliance in a box on a shelf is still a reasonable thing to do.”

That’s interesting, I could have sworn I have a Data Domain bumper sticker that says this:

Data Domain Bumper Sticker

Now don’t get me wrong, I’m not here to rub salt into Data Domain’s wounds, but I would like to take the opportunity to point out that tape has been killed more times than the iPhone, so next time an up and coming company trumpets their “tape-is-dead” story, and some bright eyed eager and naïve journalist reports on it, remember that they always come around … eventually.

Posted in Backup theory, General Technology | Tagged: , | 2 Comments »

How complex is your backup environment?

Posted by Preston on 2009-12-07

Something I’ve periodically mentioned to various people over the years is that when it comes to data protection, simplicity is King. This can be best summed up with the following rule to follow when designing a backup system:

If you can’t summarise your backup solution on the back of a napkin, it’s too complicated.

Now, the first reaction a lot of people have to that is “but if I do X and Y and Z and then A and B on top, then it’s not going to fit, but we don’t have a complex environment”.

Well, there’s two answers to that:

  1. We’re not talking a detailed technical summary of the environment, we’re talking a high level overview.
  2. If you still can’t give a high level overview on the back of a napkin, it is too complicated.

Another way to approach the complexity issue, if you happen to have a phobia about using the back of a napkin is – if you can’t give a 30 second elevator summary of your solution, it’s too complicated.

If you’re struggling to think of why it’s important you can summarise your solution in such a short period of time, or such limited space, I’ll give you a few examples:

  1. You need to summarise it in a meeting with senior management.
  2. You need to summarise it in a meeting with your management and a vendor.
  3. You’ve got 5 minutes or less to pitch getting an upgrade budget.
  4. You’ve got a new assistant starting and you’re about to go into a meeting.
  5. You’ve got a new assistant starting and you’re about to go on holiday.
  6. You’ve got consultant(s) (or contractors) coming in to do some work and you’re going to have to leave them on their own.
  7. The CIO asks “so what is it?” as a follow-up question when (s)he accosts you in the hallway and asks, “Do we have a backup policy?”

I can think of a variety of other reasons, but the point remains – a backup system should not be so complex that it can’t be easily described. That’s not to ever say that it can’t either (a) do complex tasks or (b) have complex components, but if the backup administrator can’t readily describe the functioning whole, then the chances are that there is no functioning whole, just a whole lot of mess.

Posted in Backup theory, General thoughts, Policies | Tagged: , , , , , | Comments Off on How complex is your backup environment?

Enhancing NetWorker Security: A theoretical architecture

Posted by Preston on 2009-11-18

It’s fair to say that no one backup product can be all things to all people. More generally, it’s fair to say that no product can be all things to all people.

Security has had a somewhat interesting past in NetWorker; much of the attention to security for a lot of the time has been to with (a) defining administrators, (b) ensuring clients are who they say they are and (c) providing access controls for directed recoveries.

There’s a bunch of areas though that have remained somewhat lacking in NetWorker for security. Not 100% lacking, just not complete. For instance, user accounts that are accessed for the purposes of module backup and recovery frequently need higher levels of authority than standard users. Equally so, some sites want their <X> admins to be able to control as much as possible of the <X> backups, but not to be able to have any administrator privileges over the <Y> backups. I’d like to propose an idea that, if implemented, would both improve security and make NetWorker more flexible.

The change would be to allow the definition of administrator zones. An “administrator zone” would be a subset of a datazone. It would consist of:

  1. User groups:
    • A nominated “administrator” user group.
    • A nominated “user” user group.
    • Any other number of nominated groups with intermediate privileges.
  2. A collection of the following:
    • Clients
    • Groups
    • Pools
    • Devices
    • Schedules
    • Policies
    • Directives
    • etc

These obviously would still be accessible in the global datazone for anyone who is a datazone administrator. Conceptually, this would look like the following:

Datazone with subset "administrator" zonesThe first thing this should point out to you is that administrator zones could, if desired, overlap. For instance, in the above diagram we have:

  1. Minor overlap between Windows and Unix admin zones (e.g., they might both have administrative rights over tape libraries).
  2. Overlap between Unix and Oracle admin zones.
  3. Overlap between Windows and Oracle admin zones.
  4. Overlap between Windows and Exchange admin zones.
  5. Overlap between Windows and MSSQL admin zones.

Notably though, the DMZ Admin zone indicates that you can have some zones that have no overlap/commonality with other zones.

There’d need to be a few rules established in order to make this work. These would be:

  1. Only the global datazone can support “<x>@*” user or group definitions in a user group.
  2. If there is overlap between two zones, then the user will inherit the rights of the highest authority they belong to. I.e., if a user is editing a shared feature between the Windows and Unix admin zones, and is declared an admin in the Unix zone, but only an end-user in the Windows zone, then the user will edit that shared feature with the rights of an admin.
  3. Similarly to the above, if there’s overlap between privileges at the global datazone level and a local administrator zone, the highest privileges will “win” for the local resource.
  4. Resources can only be created and deleted by someone with data zone administrator privileges.
  5. Updates for resources that are shared between multiple administrator zones need to be “approved” by an administrator from each administrator zone that overlaps or a datazone administrator.

Would this be perfect? Not entirely – for instance, it would still require a datazone administrator to create the resources that are then allocated to an administrator zone for control. However, this would prevent a situation occurring where an unprivileged user with “create” options could go ahead and create resources they wouldn’t have authority over. Equally, in an environment that permits overlapping zones, it’s not appropriate for someone from one administrator zone to delete a resource shared by multiple administrator zones. Thus, for safety’s sake, administrator zones should only concern themselves with updating existing resources.

How would the approval process work for edits of resources that are shared by overlapping zones? To start with, the resource that has been updated would continue to function “as is”, and a “copy” would be created (think of it as a temporary resource), with a notification used to trigger a message to the datazone administrators and the other, overlapping administrators. Once the appropriate approval has been done (e.g., an “edit” process in the temporary resource), then the original resource would be overwritten with the temporary resource, and the temporary resource removed.

So what sort of extra resources would we need to establish this? Well, we’ve already got user groups, which is a starting point. The next step is to define an “admin zone” resource, which has fields for:

  1. Administrator user group.
  2. Standard user group.
  3. “Other” user groups.
  4. Clients
  5. Groups
  6. Pools
  7. Schedules
  8. Policies
  9. Directives
  10. Probes
  11. Lockboxes
  12. Notifications
  13. Labels
  14. Staging Policies
  15. Devices
  16. Autochangers
  17. etc.

In fact, pretty much every resource except for the server resource itself, and licenses, should be eligible for inclusion into a localised admin group. In it’s most basic, you might expect to see the following:

nsradmin> print type: NSR admin zone; name: Oracle
type: NSR admin zone;
name: Oracle;
administrators: Oracle Admins;
users: Oracle All Users;
other user groups: ;
clients: delphi, pythia;
groups: Daily Oracle FS, Monthly Oracle FS,
Daily Oracle DB, Monthly Oracle DB;
pools: ;
schedules: Daily Oracle, Monthly Oracle;
policies: Oracle Daily, Oracle Monthly;
directives: pythia exclude oracle, delphi exclude oracle;
...etc...

To date, NetWorker’s administration focus has been far more global. If you’re an administrator, you can do anything to any resource. If you’re a user, you can’t do much with any resource. If you’ve been given a subset of privileges, you can use those privileges against all resources touched by those privileges.

An architecture that worked along these lines would allow for much more flexibility in terms of partial administrative privileges in NetWorker – zones of resources and local administrators for those resources would allow for more granular control of configuration and backup functionality, while still keeping NetWorker configuration maintained at the central server.

Posted in Architecture, Backup theory, NetWorker, Security | Tagged: , , , , | 2 Comments »

Merits of target based deduplication

Posted by Preston on 2009-11-12

It goes without a doubt that we have to get smarter about storage. While I’m probably somewhat excessive in my personal storage requirements, I currently have 13TB of storage attached to my desktop machine alone. If I can do that at the desktop, think of what it means at the server level…

As disk capacities continue to increase, we have to work more towards intelligent use of storage rather than continuing the practice of just bolting on extra TBs whenever we want because it’s “easier”.

One of the things that we can do to more intelligently manage storage requirements for either operational or support production systems is to deploy deduplication where it makes sense.

That being said, the real merits of target based deduplication become most apparent when we compare it to source based deduplication, which is where the majority of this article will now take us.

A lot of people are really excited about source level deduplication, but like so many areas in backup, it’s not a magic bullet. In particular, I see proponents of source based deduplication start waving magic wands consisting of:

  1. “It will reduce the amount of data you transmit across the network!”
  2. “It’s good for WAN backups!”
  3. “Your total backup storage is much smaller!”

While each of these facts are true, they all come with big buts. From the outset, I don’t want it said that I’m vehemently opposed to source based deduplication; however, I will say that target based deduplication often has greater merits.

For the first item, this shouldn’t always be seen as a glowing recommendation. Indeed, it should only come into play if the network is a primary bottleneck – and that’s more likely going to be the case if doing WAN based backups as opposed to regular backups.

In regular backups while there may be some benefit to reducing the amount of data transmitted, what you’re often not told is that this reduction comes at a cost – that being increased processor and/or memory load on the clients. Source based deduplication naturally has to shift some of the processing load back across to the client – otherwise the data will be transmitted and thrown away. (And otherwise proponents wouldn’t argue that you’ll transmit less data by using source based backup.)

So number one, if someone is blithely telling you that you’ll push less data across your network, ask yourself the following questions:

(a) Do I really need to push less data across the network? (I.e., is the network the bottleneck at all?)

(b) Can my clients sustain a 10% to 15% load increase in processing requirements during backup activities?

This makes the first advantage of source based deduplication somewhat less tangible than it normally comes across as.

Onto the second proposed advantage of source based deduplication – faster WAN based backups. Undoubtedly, this is true, since we don’t have to ship anywhere near as much data across the network. However, consider that we backup in order to recover. You may be able to reduce the amount of data you send across the WAN to backup, but unless you plan very carefully you may put yourself into a situation where recoveries aren’t all that useful. That is – you need to be careful to avoid trickle based recoveries. This often means that it’s necessary to put a source based deduplication node in each WAN connected site, with those nodes replicating to a central location. What’s the problem with this? Well, none from a recovery perspective – but it can considerably blow out the cost. Again, informed decisions are very important to counter-balance source based deduplication hyperbole.

Finally – “your total backup storage is much smaller!”. This is true, but it’s equally an advantage of target based deduplication as well; while the rates may have some variance the savings are still great regardless.

Now let’s look at a couple of other factors of source based deduplication that aren’t always discussed:

  1. Depending on the product you choose, you may get less OS and database support than you’re getting from your current backup product.
  2. The backup processes and clients will change. Sometimes quite considerably, depending on whether your vendor supports integration of deduplication backup with your current backup environment, or whether you need to change the product entirely.

When we look at those above two concerns is when target based deduplication really starts to shine. You still get deduplication, but with significantly less interruption to your environment and your processes.

Regardless of whether target based deduplication is integrated into the backup environment as a VTL, or whether it’s integrated as a traditional backup to disk device, you’re not changing how the clients work. That means whatever operating systems and databases you’re currently backing up you’ll be able to continue to backup, and you won’t end up in the (rather unpleasant) situation of having different products for different parts of your backup environment. That’s hardly a holistic approach. It may also be the case that the hosts where you’d get the most out of deduplication aren’t eligible for it – again, something that won’t happen with target based deduplication.

The changes for integrating target based deduplication in your environment are quite small –  you just change where you’re sending your backups to, and let the device(s) handle the deduplication, regardless of what operating system or database or application or type of data is being sent. Now that’s seamless.

Equally so, you don’t need to change your backup processes for your current clients – if it’s not broken, don’t fix it, as the saying goes. While this can be seen by some as an argument for stagnation, it’s not; change for the sake of change is not always appropriate, whereas predictability and reliability are very important factors to consider in a data protection environment.

Overall, I prefer target based deduplication. It integrates better with existing backup products, reduces the number of changes required, and does not place restrictions on the data you’re currently backing up.

Posted in Backup theory, NetWorker | Tagged: , , , , | 5 Comments »

The 5 Golden Rules of Recovery

Posted by Preston on 2009-11-10

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

Posted in Backup theory, NetWorker, Policies | Tagged: , , , , | Comments Off on The 5 Golden Rules of Recovery

Dedupe: Leading Edge or Bleeding Edge?

Posted by Preston on 2009-10-28

If you think you can’t go a day without hearing something about dedupe, you’re probably right. Whether it’s every vendor arguing the case that their dedupe offerings are the best, or tech journalism reporting on it, or pundits explaining why you need it and why your infrastructure will just die without it, it seems that it’s equally the topic of the year along with The Cloud.

There is (from some at least) an argument that backup systems should be “out there” in terms of innovation; I question that in as much as I believe that the term bleeding edge is there for a reason – it’s much sharper, it’s prone to accidents, and if you have an accident at the bleeding edge level, well, you’ll bleed.

So, I always argue that there’s nothing wrong with leading edge in backup systems (so long as it is warranted), but bleeding edge is far more riskier a proposition – not just in terms of potentially wasted investment, but due to the side effect of that wasted investment. If a product is outright bleeding edge then having it involved in data protection is a particularly dangerous proposition. (Only when technology is a mix of bleeding edge and leading edge can you at least start to make the argument that it should be at least considered in the data protection sphere.)

Personally I like the definitions of Bleeding Edge and Leading Edge in the article at Wikipedia on Technology Lifecycle. To quote:

Bleeding edge – any technology that shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus. Early adopters may win big, or may be stuck with a white elephant.

Leading edge – a technology that has proven itself in the marketplace but is still new enough that it may be difficult to find knowledgeable personnel to implement or support it.

So the question is – is deduplication leading edge, or is it still bleeding edge?

To understand the answer, we first have to consider that there’s actually 5 classified stages to the technology lifecycle. These are:

  1. Bleeding edge.
  2. Leading edge.
  3. State of the art.
  4. Dated.
  5. Obsolete.

What we have to consider is – what happens when a technology exhibits attributes of more than one classification or stage of technology? To me, working in the conservative field of data protection, I think there’s only one answer: it should be classified by the “least mature” or “most dangerous” stage that it exhibits attributes for.

Thus, deduplication is still bleeding edge.

Why dedupe is still bleeding edge

Clearly there are attributes of deduplication which are leading edge. It has, in field deployments, proven itself to be valuable in particular instances.

However, there are attributes of deduplication which are definitely still bleeding edge. In particular, the distinction for bleeding edge (to again quote from the Wikipedia article on Technology Lifecycle) is that it:

…shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus.

(My emphasis added.)

Clearly in at least some areas, deduplication has demonstrated its value – my rationale for it still being bleeding edge though is the second (and equally important) attribute: I’m not convinced that deduplication has sufficiently settled down into any kind of consensus.

Within deduplication, you can:

  • Dedupe primary data (less frequent, but talk is growing about this)
  • Dedupe virtualised systems
  • Dedupe archive/HSM systems (whether literally, or via single instance storage, or a combination thereof)
  • Dedupe NAS
  • For backup:
    • Do source based dedupe:
      • At the file level
      • At a fixed block level
      • At a variable block level
    • Do target based dedupe:
      • Post-backup, maintaining two pools of storage, one deduplicated, one normal. Most frequently accessed data is typically “hydrated”, whereas the deduped storage is longer term/less frequently accessed data.
      • Inline (at ingest), maintaining only one deduplicated pool of storage
    • For long term storage of deduplicated backups:
      • Replicate, maintaining two deduplicated systems
      • Transfer out to tape, usually via rehydration (the slightly better term for “undeduplicating”)
      • Transfer deduped data out to tape “as is”

Does this look like any real consensus to you?

One comfort in particular that we can take from all these disparate dedupe options is that clearly there’s a lot of innovation going on. The fundamental basics behind dedupe as well are tried and trusted – we use them every time we compress a file or bunch of files. It’s just scanning for common blocks and reducing the data to the smallest possible amount.

It’s also an intelligent and logical method of moving forward in storage – i.e., we’ve reached a point in storage where both companies that purchase storage, and the vendors that provide it, are moving towards using storage more efficiently rather than just continuing to buy it. This trend started with the development of SAN and NAS, so dedupe is just the logical continuation of those storage centralisation/virtualisation paths. More so, the trend towards more intelligent use of technology is not new – consider even recent changes in products from the CPU manufacturers. Targeting Intel as a prime example, for years their primary development strategy was “fast, faster, fastest.” However, that strategy ended up hitting a brick wall – it doesn’t matter how fast an individual processor is if you actually need to do multiple things at once. Hence multi-core really hit the mainstream. Previously reserved in multi-CPU environments for high end workstations and servers, it’s now common for any new computer to come with multiple cores. (Heck, I have 2 x Quad Core processors in the machine I’m writing this article on. The CPU speeds are technically slower than my lab ESX server, but with multi-core, multi-threading, it smacks the ESX server out of the lab every time on performance. It’s more intelligent use of the resources.)

So dedupe is about shifting away from big, bigger biggest storage to smart, smarter and smartest storage.

We’re certainly not at smartest yet.

We’re probably not even at smarter yet.

As an overall implementation strategy, deduplication is practically infantile in terms of actual industry-state vs potential industry-state. You can do it on your primary production data, or your virtualised systems or your archived data or your secondary NAS data or your backups, but so far there’s been little tangible, usable advances towards being able to use it throughout your entire data lifecycle in a way which is compatible and transparent regardless of vendor or product in use.

For dedupe to be able to make that leap fully out of bleeding edge territory, it needs to make some inroads into complete data lifecycle deduplication – starting at the primary data level and finishing at backups and archives.

(And even when we can use it through the entire product lifecycle, we’ll still be stuck with working out what to do with it once it’s been generated, for longer term storage. Do we replicate between sites? Do we rehydrate to tape or do we send out the deduped data to tape? Obviously based on recent articles I don’t (yet) have much faith in the notion of writing deduped data to tape.)

If you think that there isn’t a choice for long term storage – that it has to be replication, and dedupe is a “tape killer”, think again. Consider smaller sites with constrained budget, consider sites that can’t afford dedicated disaster recovery systems, and consider sites that want to actually limit their energy impact. (I.e., sites that understand the difference in energy savings between offsite tapes and MAID for long term data storage.)

So should data protection environments implement dedupe?

You might think, based on previous comments, that my response to this is going to be a clear-cut no. That’s not quite correct however. You see, because dedupe falls into both leading edge and bleeding edge, it is something that can be implemented into specific environments, in specific circumstances.

That is, the suitability of dedupe for an environment can be evaluated on a case by case basis, so long as sites are aware that when implementing dedupe they’re not getting the full promise of the technology, but just specific windows on the technology. It may be that companies:

  • Need to reduce their backup windows, in which case source-based dedupe could be one option (among many).
  • Need to reduce their overall primary production data, in which case single instance archive is a likely way to go.
  • Need to keep more data available for recovery in VTLs (or for that matter on disk backup units), in which case target based dedupe is the likely way to go.
  • Want to implement more than one of the above, in which case they will be buying disparate technology that don’t share common architectures or operational management systems.

I’d be mad if I were to say that dedupe is still too immature for any site to consider – yet equally I’d charge that anyone who says that every site should go down a dedupe path, and that every site will get fantastic savings from implementing dedupe is equally mad.

Posted in Backup theory, General Technology, General thoughts | Tagged: , , , , , , , , | 1 Comment »

Laptop/Desktop Backups as easy as 1-2-3!

Posted by Preston on 2009-10-23

When I first mentioned probe based backups a while ago, I suggested that they’re going to be a bit of a sleeper function – that is, I think they’re being largely ignored at the moment because people aren’t quite sure how to make use of them. My take however is that over time we’re going to see a lot of sites shifting particular backups over to probe groups.

Why?

Currently a lot of sites shoe-horn ill-fitting backup requirements into rigid schedules. This results in frequent violations of the best practices approach to backup of Zero Error Policies. Here’s a prime example: for those sites that need to do laptop and/or desktop backups using NetWorker, the administrators are basically resigned on those sites to having failure rates in such groups of 50% or more depending on how many machines are currently not connected to the network.

This doesn’t need to be the case – well, not any more thanks to probe based backups. So, if you’ve been scratching your head looking for a practical use for these backups, here’s something that may whet your appetite.

Scenario

Let’s consider a site where there are group of laptops and desktops that are integrated into the NetWorker backup environment. However, there’s never a guarantee of which machines may be connected to the network at any given time. Therefore administrators typically configure laptop/desktop backup groups to start at say, 10am, on the premise that the most systems are likely to be available at that time.

Theory of Resolution

Traditional time-of-day start backups aren’t really appropriate to this scenario. What we want is a situation where the NetWorker server waits for those infrequently connected clients to be connected, then runs a backup at the next opportunity.

Rather than having a single group for all clients and accepting that the group will suffer significant failure rates, split each irregularly connected client into its own group, and configure a backup probe.

The backup system will loop probes of the configured clients during nominated periods in the day/night at regular intervals. When the client is connected to the network and the probe successfully returns that (a) the client is running and (b) a backup should be done, the backup is started on the spot.

Requirements

In order to get this working, we’ll need the following:

  • NetWorker 7.5 or higher (clients and server)
  • A probe script – one per operating system type
  • A probe resource – one per operating system type
  • A 1:1 mapping between clients of this type and groups.

Practical Application

Probe Script

This is a command which is installed on the client(s), in the same directory as the “save” or “save.exe” binary (depending on OS type), and starts with either nsr or save. I’ll be calling my script:

nsrcheckbackup.sh

I don’t write Windows batch scripts. Therefore, I’ll give an example as a Linux/Unix shell script, with an overview of the program flow. Anyone who wants to write a batch script version is welcome to do so and submit it.

The “proof of concept” algorithm for the probe script works as follows:

  • Establish a “state” directory in the client nsr directory called bckchk. I.e., if the directory doesn’t exist, create it.
  • Establish a “README” file in that directory for reference purposes, if it doesn’t already exist.
  • Determine the current date.
  • Check for a previous date file. If there was a previous date file:
    • If the current date equals the previous date found:
      • Write a status file indicating that no backup is required.
      • Exit, signaling that no backup is required.
    • If the current date does not equal the previous date found:
      • Write the current date to the “previous” date file.
      • Write a status file indicating that the current date doesn’t match the “previous” date, so a new backup is required.
      • Exit, signaling that a backup is required.
  • If there wasn’t a previous date file:
    • Write the current date to the “previous” date file.
    • Write a status file indicating that no previous date was found so a backup will be signaled.
    • Exit, signaling backup should be done.

Obviously, this is a fairly simplistic approach, but is suitable for a proof of concept demonstration. If you were wishing to make the logic more robust for production deployment, my first suggestion would be to build in mminfo checks to determine (even if the dates match), whether there has been a backup “today”. If there hasn’t, that would override and force a backup to start. Additionally, if users can connect via VPN and the backup server can communicate with connected clients, you may want to introduce some logic into the script to deny probe success over the VPN.

If you were wanting a OS independent script for this, you may wish to code in Perl, but I’ve hung off doing that in this case simply because a lot of sites have reservations about installing Perl on Windows systems. (Sigh.)

Without any further guff, here’s the sample script:

preston@aralathan ~
$ cat /usr/sbin/nsrcheckbackup.sh
#!/bin/bash 

PATH=$PATH:/bin:/sbin:/usr/sbin:/usr/bin
CHKDIR=/nsr/bckchk

README=`cat <<EOF
==== Purpose of this directory ====

This directory holds state file(s) associated with the probe based
laptop/desktop backup system. These state file(s) should not be
deleted without consulting the backup administrator.
EOF
`

if [ ! -d "$CHKDIR" ]
then
   mkdir -p "$CHKDIR"
fi

if [ ! -f "$CHKDIR/README" ]
then
   echo $README > "$CHKDIR/README"
fi

DATE=`date +%Y%m%d`
COMPDATE=`date "+%Y%m%d %H%M%S"`
LASTDATE="none"
STATUS="$CHKDIR/status.txt"
CHECK="$CHKDIR/datecheck.lck"

if [ -f "$CHECK" ]
then
   LASTDATE=`cat $CHKDIR/datecheck.lck`
else
   echo $DATE > "$CHECK"
   echo "$COMPDATE Check file did not exist. Backup required" > "$STATUS"
   exit 0
fi

if [ -z "$LASTDATE" ]
then
   echo "$COMPDATE Previous check was null. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

if [ "$DATE" = "$LASTDATE" ]
then
   echo "$COMPDATE Last backup was today. No action required" > "$STATUS"
   exit 1
else
   echo "$COMPDATE Last backup was not today. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

As you can see, there’s really not a lot to this in the simplest form.

Once the script has been created, it should be made executable and (for Linux/Unix/Mac OS X systems), be placed in /usr/sbin.

Probe Resource

The next step is, within the NetWorker, to create a probe resource. This will be shared by all the probe clients of the same operating system type.

A completed probe resource might resemble the following:

Configuring the probe resource

Configuring the probe resource

Note that there’s no path in the above probe command – that’s because NetWorker requires the probe command to be in the same location as the save command.

Once this has been done, you can either configure the client or the probe group next. Since the client has to be reconfigured after the probe group is created, we’ll create the probe group first.

Creating the Probe Groups

First step in creating the probe groups is to come up with a standard so that they can be easily identified in relation to all other standard groups within the overall configuration. There are two approaches you can take towards this:

  • Preface each group name with a keyword (e.g., “probe”) followed by the host name the group is for.
  • Name each group after the client that will be in the group, but set a comment along the lines of say, “Probe Backup for <hostname>”.

Personally, I prefer the second option. That way you can sort by comment to easily locate all probe based groups but the group name clearly states up front which client it is for.

When creating a new probe based group, there are two tabs you’ll need to configure – Setup and Advanced – within the group configuration. Let’s look at each of these:

Probe group configuration – Setup Tab

Probe group configuration – Setup Tab

You’ll see from the above that I’m using the convention where the group name matches the client name, and the comment field is configured appropriately for easy differentiation of probe based backups.

You’ll need to set the group to having an autostart value of Enabled. Also, the Start Time field does have relevance exactly once for probe based backups – it still seems to define the first start time of the probe. After that, the probe backups will follow the interval and start/finish times defined on the second tab.

Here’s the second tab:

Probe Group - Advanced Tab

Probe Group - Advanced Tab

The key thing on this obviously is the configuration of the probe section. Let’s look at each option:

  • Probe based group – Checked
  • Probe interval – Set in minutes. My recommendation is to have each group a different number of minutes. (Or at least reduce the number of groups that have exactly the same probe interval.) That way over time as probes run, there’s less likelihood of multiple groups starting at the same time. For instance, in my test setup, I have 5 clients, set to intervals of 90 minutes, 78 minutes, 104 minutes, 82 minutes and 95 minutes*.
  • Probe start time – Time of day that probing starts. I’ve left this on the defaults, which may be suitable for desktops, but for laptops where there’s a very high chance of machines being disconnected of a night time, you may wish to start probing closer to the start of business hours.
  • Probe end time – Time of day that NetWorker stops probing the client. Same caveats as per the probe start time above.
  • Probe success criteria – Since there’s only one client per group, you can leave this at all.
  • Time since successful backup – How many days NetWorker should allow probing to run unsuccessfully before it forcibly sets a backup running. If set to zero it will never force a backup running. I’ve actually changed, since I took the screen-shot, that value, and set it to 3 on my configured clients. Set yours to a site-optimal value. Note that since the aim is to run only one backup every 24 hours, setting this to “1” is probably not all that logical an idea.

(The last field, “Time of the last successful backup” is just a status field, there’s nothing to configure there.)

If you have schedules enforced out of groups, you’ll want to set the schedule up here as well.

With this done, we’re ready to move onto the client configuration!

Configuring the Client for Probe Backups

There’s two changes required here. In the General tab of the client properties, move the client into the appropriate group:

Adding the client to the correct group

Adding the client to the correct group

In the “Apps & Modules” tab, identify the probe resource to be used for that client:

Configuring the client probe resource

Configuring the client probe resource

Once this has been done, you’ve got everything configured, and it’s just a case of sitting back and watching the probes run and trigger backups of clients as they become available. You’ll note, in the example above, that you can still use savepnpc (pre/post commands) with clients that are configured for probe backups. The pre/post commands will only be run if the backup probe confirms that a backup should take place.

Wrapping Up

I’ll accept that this configuration can result in a lot of groups if you happen to have a lot of clients that require this style of backup. However, that isn’t the end of the world. Reducing the number of errors reported in savegroup completion notifications does make the life of backup administrators easier, even if there’s a little administrative overhead.

Is this suitable for all types of clients? E.g., should you use this to shift away from standard group based backups for the servers within an environment? The answer to that is a big unlikely. I do really see this as something that is more suitable for companies that are using NetWorker to backup laptops and/or desktops (or a subset thereof).

If you think no-one does this, I can think of at least five of my customers alone who have requirements to do exactly this, and I’m sure they’re not unique.

Even if you don’t particularly need to enact this style of configuration for your site, what I’m hoping is that by demonstrating a valid use for probe based backup functionality, I may get you thinking about where it could be used at your site for making life easier.

Here’s a few examples I can immediately think of:

  • Triggering a backup+purge of Oracle archived redo logs that kick in once the used capacity of the filesystem the logs are stored on exceed a certain percentage (e.g., 85%).
  • Triggering a backup when the number of snapshots of a fileserver exceed a particular threshold.
  • Triggering a backup when the number of logged in users falls below a certain threshold. (For example, on development servers.)
  • Triggering a backup of a database server whenever a new database is added.

Trust me: probe based backups are going to make your life easier.


* There currently appears to be a “feature” with probe based backups where changes to the probe interval only take place after the next “probe start time”. I need to do some more review on this and see whether it’s (a) true and (b) warrants logging a case.

Posted in Backup theory, NetWorker, Policies, Scripting | Tagged: , , , | Comments Off on Laptop/Desktop Backups as easy as 1-2-3!