NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Archive for the ‘Architecture’ Category

Pictorial representation of management happiness over OpEx cost savings

Posted by Preston on 2009-12-16

There is a profoundly important backup adage that should be front and centre in the mind of any backup administrator, operator, manager, etc. This is:

It is always better to backup a little more than not quite enough.

This isn’t an invitation to wildly waste backup capacity on useless copies that can never be recovered from – or needlessly generate unnecessary backups. However, it should serve as a constant reminder that if you keep shaving important stuff out of your backups, you’ll eventually suffer a Titanic issue.

Now, people at this point often tell me either that (a) they’re being told to reduce the amount of data being backed up or (b) it makes their manager happy to be able to report less OpEx budget required or (c) some combination of the two, or (d) they’re reluctant to ask for additional budget for storage media.

The best way to counter these oppressive and dangerous memes is to draw up a graph of how happy your manager(s) will be over saving a few hundred dollars here and there on media versus potential recovery issues. To get you started, I’ve drawn up one which covers a lot of sites I’ve encountered over the years:

Manager happiness vs state of environment and needless cost savings on backupYou see, it’s easy to be happy about saving a few dollars here and there on backup media in the here and now, when your backups are running and you don’t need to recover.

However, as soon as the need for a recovery starts to creep in, previous happiness over saving a few hundred dollars rapidly evaporates in direct proportion to the level of the data loss. There might be minimal issues to a single lost document or email, but past that things start to get rather hairy. In fact, it’s very easy to switch from 100% management happiness to 100% management disgruntlement within the space of 24 hours in extreme situations.

You, as a backup administrator, may already be convinced of this. (I would hope you are.) Sometimes though, other staff or managers may need reminding that they too may be judged by more senior management on recoverability of systems under their supervision, so this graph equally applies to them. That continues right up the chain, further reinforcing the fact that backups are an activity which belong to the entire company, not just IT, and therefore they are a financial concern that need to be budgeted for by the entire company.

Posted in Architecture, Backup theory | Tagged: , , , | 4 Comments »

EMC, Data Domain, VTLs and Disk Backup

Posted by Preston on 2009-11-30

With their recent acquisition of Data Domain, some people at EMC have become table thumping experts overnight on why you it’s absolutely imperative that you backup to Data Domain boxes as disk backup over NAS, rather than a fibre-channel connected VTL.

Their argument seems to come from the numbers – the wrong numbers.

The numbers constantly quoted are number of sales of disk backup Data Domain vs VTL Data Domain. That is, some EMC and Data Domain reps will confidently assert that by the numbers, a significantly higher percentage of Data Domain for Disk Backup has been sold than Data Domain with VTL. That’s like saying that Windows is superior to Mac OS X because it sells more. Or to perhaps pick a little less controversial topic, it’s like saying that DDS is better than LTO because there’s been more DDS drives and tapes sold than there’s ever been LTO drives and tapes.

I.e., an argument by those numbers doesn’t wash. It rarely has, it rarely will, and nor should it. (Otherwise we’d all be afraid of sailing too far from shore because that’s how it had always been done before…)

Let’s look at the reality of how disk backup currently stacks up in NetWorker. And let’s preface this by saying that if backup products actually started using disk backup properly tomorrow, I would be the first to shout “Don’t let the door hit your butt on the way out” to every VTL on the planet. As a concept, I wish VTLs didn’t have to exist, but in the practical real world, I recognise their need and their current ascendency over ADV_FILE. I have, almost literally at times, been dragged kicking and screaming to that conclusion.

Disk Backup, using ADV_FILE type devices in NetWorker:

  • Can’t move a saveset from a full disk backup unit to a non-full one; you have to clear the space first.
  • Can’t simultaneously clone from, stage from, backup to and recover from a disk backup unit. No, you can’t do that with tape either, but when disk backup units are typically in the order of several terabytes, and virtual tapes are in the order of maybe 50-200 GB, that’s a heck of a lot less contention time for any one backup.
  • Use tape/tape drive selection algorithms for deciding which disk backup unit gets used in which order, resulting in worst case capacity usage scenarios in almost all instances.
  • Can’t accept a saveset bigger than the disk backup unit. (It’s like, “Hello, AMANDA, I borrowed some ideas from you!”)
  • Can’t be part-replicated between sites. If you’ve got two VTLs and you really need to do back-end replication, you can replicate individual pieces of media between sites – again, significantly smaller than entire disk backup units. When you define disk backup units in NetWorker, that’s the “smallest” media you get.
  • Are traditionally space wasteful. NetWorker’s limited staging routines encourages clumps of disk backup space by destination pool – e.g., “here’s my daily disk backup units, I use them 30 days out of 31, and those over there that occupy the same amount of space (practically) are my monthly disk backup units, I use them 1 day out of 31. The rest of the time they sit idle.”
  • Have poor staging options (I’ll do another post this week on one way to improve on this).

If you get a table thumping sales person trying to tell you that you should buy Data Domain for Disk Backup for NetWorker, I’d suggest thumping the table back – you want the VTL option instead, and you want EMC to fix ADV_FILE.

Honestly EMC, I’ll lead the charge once ADV_FILE is fixed. I’ll champion it until I’m blue in the face, then suck from an oxygen tank and keep going – like I used to, before the inadequacies got too much. Until then though, I’ll keep skewering that argument of superiority by sales numbers.

Posted in Architecture, NetWorker | Tagged: , , , , , , | 3 Comments »

Storage Tiering vs ILM

Posted by Preston on 2009-11-24

Over at StorageNerve, and on Twitter, Devang Panchigar has been asking Is Storage Tiering ILM or a subset of ILM, but where is ILM? I think it’s an important question with some interesting answers.

Devang starts with defining ILM from a storage perspective:

1) A user or an application creates data and possibly over time that data is modified.
2) The data needs to be stored and possibly be protected through RAID, snaps, clones, replication and backups.
3) The data now needs to be archived as it gets old, and retention policies & laws kick in.
4) The data needs to be search-able and retrievable NOW.
5) Finally the data needs to be deleted.

I agree with items 1, 3, 4 and 5 – as per previous posts, for what it’s worth, I believe that 2 belongs to a sister activity which I define as Information Lifecycle Protection (ILP) – something that Devang acknowledges as an alternative theory. (I liken the logic to separation between ILM and ILP to that between operational production servers and support production servers.)

The above list, for what it’s worth, is actually a fairly astute/accurate summary of the involvement of the storage industry thus far in ILM. Devang rightly points out that Storage Tiering (migrating data between different speed/capacity/cost storage based on usage, etc.), doesn’t address all of the above points – in particular, data creation and data deletion. That’s certainly true.

What’s missing from ILM from a storage perspective are the components that storage can only peripherally control. Perhaps that’s not entirely accurate – the storage industry can certainly participate in the remaining components (indeed, particularly in NAS systems it’s absolutely necessary, as a prime example) – but it’s more than just the storage industry. It’s operating system vendors. It’s application vendors. It’s database vendors. It is, quite frankly, the whole kit and caboodle.

What’s missing in the storage-centric approach to ILM is identity management – or to be more accurate in this context, identity management systems. The brief outline of identity management is that it’s about moving access control and content control out of the hands of the system, application and database administrators, and into the hands of human resources/corporate management. So a system administrator could have total systems access over an entire host and all its data but not be able to open files that (from a corporate management perspective) they have no right to access. A database administrator can fully control the corporate database, but can’t access commercially sensitive or staff salary details, etc.

Most typically though, it’s about corporate roles, as defined in human resources, being reflected from the ground up in system access options. That is, human resources, when they setup a new employee as having a particular role within the organisation (e.g., “personal assistant”), triggering the appropriate workflows to setup that person’s accounts and access privileges for IT systems as well.

If you think that’s insane, you probably don’t appreciate the purpose of it. System/app/database administrators I talk to about identity management frequently raise trust (or the perceived lack thereof) involved in such systems. I.e., they think that if the company they work for wants to implement identity management they don’t trust the people who are tasked with protecting the systems. I won’t lie, I think in a very small number of instances, this may be the case. Maybe 1%, maybe as high as 2%. But let’s look at the bigger picture here – we, as system/application/database administrators currently have access to such data not because we should have access to such data but because until recently there’s been very few options in place to limit data access to only those who, from a corporate governance perspective, should have access to that data. As such, most system/app/database administrators are highly ethical – they know that being able to access data doesn’t equate to actually accessing that data. (Case in point: as the engineering manager and sysadmin at my last job, if I’d been less ethical, I would have seen the writing on the wall long before the company fell down under financial stresses around my ears!)

Trust doesn’t wash in legal proceedings. Trust doesn’t wash in financial auditing. Particularly in situations where accurate logs aren’t maintained in an appropriately secured manner to prove that person A didn’t access data X. The fact that the system was designed to permit A to access X (even as part of A’s job) is in some financial, legal and data sensitivity areas, significant cause for concern.

Returning to the primary point though, it’s about ensuring that the people who have authority over someone’s role within a company (human resources/management) having control over the the processes that configure the access permissions that person has. It’s also about making sure that those work flows are properly configured and automated so there’s no room for error.

So what’s missing – or what’s only at the barest starting point, is the integration of identity/access control with ILM (including storage tiering) and ILP. This, as you can imagine, is not an easy task. Hell, it’s not even a hard task – it’s a monumentally difficult task. It involves a level of cooperation and coordination between different technical tiers (storage, backup, operating systems, applications) that we rarely, if ever see beyond the basic “must all work together or else it will just spend all the time crashing” perspective.

That’s the bit that gives the extra components – control over content creation and destruction. The storage industry on its own does not have the correct levels of exposure to an organisation in order to provide this functionality of ILM. Nor do the operating system vendors. Nor do the database vendors or the application vendors – they all have to work together to provide a total solution on this front.

I think this answers (indirectly) Devang’s question/comment on why storage vendors, and indeed, most of the storage industry, has stopped talking about ILM – the easy parts are well established, but the hard parts are only in their infancy. We are after all seeing some very early processes around integrating identity management and ILM/ILP. For instance, key management on backups, if handled correctly, can allow for situations where backup administrators can’t by themselves perform the recovery of sensitive systems or data – it requires corporate permissions (e.g., the input of a data access key by someone in HR, etc.) Various operating systems and databases/applications are now providing hooks for identity management (to name just one, here’s Oracle’s details on it.)

So no, I think we can confidently say that storage tiering in and of itself is not the answer to ILM. As to why the storage industry has for the most part stopped talking about ILM, we’re left with one of two choices – it’s hard enough that they don’t want to progress it further, or it’s sufficiently commercially sensitive that it’s not something discussed without the strongest of NDAs.

We’ve seen in the past that the storage industry can cooperate on shared formats and standards. We wouldn’t be in the era of pervasive storage we currently are without that cooperation. Fibre-channel, SCSI, iSCSI, FCoE, NDMP, etc., are proof positive that cooperation is possible. What’s different this time is the cooperation extends over a much larger realm to also encompass operating systems, applications, databases, etc., as well as all the storage components in ILM and ILP. (It makes backups seem to have a small footprint, and backups are amongst the most pervasive of technologies you can deploy within an enterprise environment.)

So we can hope that the reason we’re not hearing a lot of talk about ILM any more is that all the interested parties are either working on this level of integration, or even making the appropriate preparations themselves in order to start working together on this level of integration.

Fingers crossed people, but don’t hold your breath – no matter how closely they’re talking, it’s a long way off.

Posted in Architecture, General Technology, General thoughts, Security | Tagged: , , , , , , , , | 2 Comments »

Enhancing NetWorker Security: A theoretical architecture

Posted by Preston on 2009-11-18

It’s fair to say that no one backup product can be all things to all people. More generally, it’s fair to say that no product can be all things to all people.

Security has had a somewhat interesting past in NetWorker; much of the attention to security for a lot of the time has been to with (a) defining administrators, (b) ensuring clients are who they say they are and (c) providing access controls for directed recoveries.

There’s a bunch of areas though that have remained somewhat lacking in NetWorker for security. Not 100% lacking, just not complete. For instance, user accounts that are accessed for the purposes of module backup and recovery frequently need higher levels of authority than standard users. Equally so, some sites want their <X> admins to be able to control as much as possible of the <X> backups, but not to be able to have any administrator privileges over the <Y> backups. I’d like to propose an idea that, if implemented, would both improve security and make NetWorker more flexible.

The change would be to allow the definition of administrator zones. An “administrator zone” would be a subset of a datazone. It would consist of:

  1. User groups:
    • A nominated “administrator” user group.
    • A nominated “user” user group.
    • Any other number of nominated groups with intermediate privileges.
  2. A collection of the following:
    • Clients
    • Groups
    • Pools
    • Devices
    • Schedules
    • Policies
    • Directives
    • etc

These obviously would still be accessible in the global datazone for anyone who is a datazone administrator. Conceptually, this would look like the following:

Datazone with subset "administrator" zonesThe first thing this should point out to you is that administrator zones could, if desired, overlap. For instance, in the above diagram we have:

  1. Minor overlap between Windows and Unix admin zones (e.g., they might both have administrative rights over tape libraries).
  2. Overlap between Unix and Oracle admin zones.
  3. Overlap between Windows and Oracle admin zones.
  4. Overlap between Windows and Exchange admin zones.
  5. Overlap between Windows and MSSQL admin zones.

Notably though, the DMZ Admin zone indicates that you can have some zones that have no overlap/commonality with other zones.

There’d need to be a few rules established in order to make this work. These would be:

  1. Only the global datazone can support “<x>@*” user or group definitions in a user group.
  2. If there is overlap between two zones, then the user will inherit the rights of the highest authority they belong to. I.e., if a user is editing a shared feature between the Windows and Unix admin zones, and is declared an admin in the Unix zone, but only an end-user in the Windows zone, then the user will edit that shared feature with the rights of an admin.
  3. Similarly to the above, if there’s overlap between privileges at the global datazone level and a local administrator zone, the highest privileges will “win” for the local resource.
  4. Resources can only be created and deleted by someone with data zone administrator privileges.
  5. Updates for resources that are shared between multiple administrator zones need to be “approved” by an administrator from each administrator zone that overlaps or a datazone administrator.

Would this be perfect? Not entirely – for instance, it would still require a datazone administrator to create the resources that are then allocated to an administrator zone for control. However, this would prevent a situation occurring where an unprivileged user with “create” options could go ahead and create resources they wouldn’t have authority over. Equally, in an environment that permits overlapping zones, it’s not appropriate for someone from one administrator zone to delete a resource shared by multiple administrator zones. Thus, for safety’s sake, administrator zones should only concern themselves with updating existing resources.

How would the approval process work for edits of resources that are shared by overlapping zones? To start with, the resource that has been updated would continue to function “as is”, and a “copy” would be created (think of it as a temporary resource), with a notification used to trigger a message to the datazone administrators and the other, overlapping administrators. Once the appropriate approval has been done (e.g., an “edit” process in the temporary resource), then the original resource would be overwritten with the temporary resource, and the temporary resource removed.

So what sort of extra resources would we need to establish this? Well, we’ve already got user groups, which is a starting point. The next step is to define an “admin zone” resource, which has fields for:

  1. Administrator user group.
  2. Standard user group.
  3. “Other” user groups.
  4. Clients
  5. Groups
  6. Pools
  7. Schedules
  8. Policies
  9. Directives
  10. Probes
  11. Lockboxes
  12. Notifications
  13. Labels
  14. Staging Policies
  15. Devices
  16. Autochangers
  17. etc.

In fact, pretty much every resource except for the server resource itself, and licenses, should be eligible for inclusion into a localised admin group. In it’s most basic, you might expect to see the following:

nsradmin> print type: NSR admin zone; name: Oracle
type: NSR admin zone;
name: Oracle;
administrators: Oracle Admins;
users: Oracle All Users;
other user groups: ;
clients: delphi, pythia;
groups: Daily Oracle FS, Monthly Oracle FS,
Daily Oracle DB, Monthly Oracle DB;
pools: ;
schedules: Daily Oracle, Monthly Oracle;
policies: Oracle Daily, Oracle Monthly;
directives: pythia exclude oracle, delphi exclude oracle;
...etc...

To date, NetWorker’s administration focus has been far more global. If you’re an administrator, you can do anything to any resource. If you’re a user, you can’t do much with any resource. If you’ve been given a subset of privileges, you can use those privileges against all resources touched by those privileges.

An architecture that worked along these lines would allow for much more flexibility in terms of partial administrative privileges in NetWorker – zones of resources and local administrators for those resources would allow for more granular control of configuration and backup functionality, while still keeping NetWorker configuration maintained at the central server.

Posted in Architecture, Backup theory, NetWorker, Security | Tagged: , , , , | 2 Comments »

Why /nsr/tmp is wrong

Posted by Preston on 2009-11-02

On both Windows and Unix platforms, NetWorker maintains a “tmp” directory within nsr.

This directory contains a variety of information, from output received by savegroup completion notifications to lock/state files for certain NetWorker resource.

To first explain why /nsr/tmp is wrong, let me first tell you a little story about the first system administration team I joined. They rigorously followed RFC-1178, and it’s ever since then that I’ve also done my best to follow that RFC – I’ve even written an article here on the blog about choosing appropriate names for backup servers. Sometime before I joined the team, they were in the process of setting up a replacement DNS server for local datacentre. There was either a dispute about what to name it, or it was only meant to hang around for a short while, but for whatever reason, it was named tmp.

I worked in the group from 1996 through to 2000, and from what I heard, it wasn’t until several years after I left that tmp was decommissioned.

One of the most valuable lessons I took away is name things appropriately. The DNS server tmp was not named appropriately. Thus, the name tmp or temp should be used only for transient data or systems. (To this day I never give machines names along the lines of ‘tmp’; the closest I’ll go is naming them after synonyms to do with trash or garbage – meaning that I’m fully aware that at any moment they can be blown away.)

To return to our topic, /nsr/tmp is wrong because it’s misnamed. Temporary files only make up some of its content. Other files, state files, can hang around between restarts of NetWorker and (particularly if NetWorker was incorrectly shutdown) give backup administrators really bad days. In fact, the “magical random” nature of /nsr/tmp is so well known that it’s actually started to really bug EMC engineering. My understanding is that engineering want the contents of /nsr/tmp captured any time an EMC support representative tells some to shutdown+delete+restart so that if it does fix the problem, they can try to debug why and remove the need.

The problem with shutdown+delete+restart is that in doing so, you clear out other information as well. Selectively deleting “the right file” can sometimes be a bit of a needle in a hay stack operation, and I suspect that debugging these deletes post-event will either be frustratingly slow or a bit like whack-a-mole.

Architecturally, to include both state and temporary files in the same common directory structure is silly. Having a few extra directories in the ‘nsr’ base directory on the other hand is a minor change. I’d suggest that more improvements might be made by first actually splitting /nsr/tmp into:

  • /nsr/lck – Resource lock files
  • /nsr/tmp – Real temporary files (e.g., savegroup output text)
  • /nsr/state – State files (if necessary)

That way /nsr/tmp will actually start to obey the Principle of Least Astonishment.

Posted in Architecture, NetWorker | Tagged: , , | Comments Off

Dedupe to tape is “crazy bad” if the architecture is crazy

Posted by Preston on 2009-10-26

Over at Backup Central, Curtis Preston says he’s convinced that dedupe to tape according to the CommVault model is a good idea, in a “crazy good” way rather than a “crazy bad” way. To summarise Curtis’ argument (and thereby establish my understanding of it), the process is:

  1. Day to day recovery of deduped tape backup would be crazy (I agree with this)
  2. Design the system so that you still facilitate most recoveries from dedupe on disk (I have no issue with this)
  3. Periodically effectively stage out the dedupe data to tape (first objection)
  4. Long-term recoveries are done from tape written in dedupe format (holy cow that’s insane!)

So, let’s look at why I think this is “crazy bad” by examining each point.

Point one – day to day recovery of deduped tape backup would be crazy

Fully agreed. I’d liken recovery from deduped data on tape to recovery of highly fragmented files from a block level backup. Block level backup products (e.g., EMC’s SnapImage) allows you to bypass the inefficiencies of the filesystem on dense structures to do a block by block backup. This can deliver fantastic time savings. For. Backup.

For recovery, file level reconstruction from block level backups can suck in a terribly horrendous way. File level reconstruction from block level backups requires recovery of the required blocks into a cache, and then the files are put back together. If your files are heavily fragmented (which is often the case on dense filesystems), the number of reads from tape required – and the amount of seeking required – is very high. Real world example: 400 GB dense filesystem (about 40,000,000 files) had full backups reduced from 15 hours to 4 hours using block level backup. Recovery of the entire filesystem took less than 4 hours – recovery of a 40 GB directory took 12 hours. Having a very large cache is one way to get around this, but that starts to get costly (and in my experience is frequently poached).

Recovery from deduped data on tape will very likely suck just as badly.

Point two – design the system so that you facilitate most recoveries from dedupe on disk

Again, fully agreed. So far I’m in complete agreement with Curtis and CommVault. This point can be said of any backup design – design your system so that the most frequently performed recoveries are done from the fastest backup medium.

Point three – Periodically effectively stage out all dedupe data to tape

This is the crazy part, and not crazy good, but out and out crazy bad. To quote Curtis on this:

If you’re going to dedupe to tape, you first have to dedupe to disk.  You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests.  (Typically that’s 30 days, but it could be longer in your environment.)

What’s so crazy-bad about this?

Now, I’ll profess that I don’t know for sure which way this is being done, but it reads that new full backups are generated periodically in the dedupe environment, allowing the previous dependency chains of fulls + incrementals to be transferred out to tape. (Based on my reading of the CommVault marketing documentation, which refers to “reducing” the number of fulls required for retention cycles, this appears to be an accurate assessment.)

So this means that every X days (whatever your period-between-fulls is going to be) you have to do new fulls. Now while this isn’t so much of an issue in regular backups, in dedupe backups it’s a known fact that the initial full backups are hideously slow. This can be worn by most organisations when it’s a once-off. Every month? Even every 3 months or 6 months? Far less likely.

Point four – Long-term recoveries are done from tape written in dedupe format

Obviously some of my objections to this have already been expressed in my comments for point two, but to continue with my objections, let’s look at what Curtis has to say on this point as well:

But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal.  (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)

Now, I agree that recovery of longer term backups can be done from slower media in most instances.

There’s a difference between “slower media” and “a snail just overtook our data recovery”.

In the first case, I don’t believe that recovery from deduped data on tape will be in the order of “several minutes” … I think this would turn out to be a highly optimistic rather than terribly realistic time-frame. I would need to see a large number of real world instances of short recovery times to really believe this will be in an order of “several minutes”. Yes, I’m going on a gut feeling, but I feel it’s somewhat justified.

In the second case … “you could actually bring it back in to disk in its entirety” … how much storage do you want to be using here? If we’re talking bringing back the entire “silo”, that’s a lot of storage to bring back  – I’d suggest it’s going to be comparable to but orders of magnitude worse than say, recovering a 1TB virtual machine fileserver to a separate location in order to pull out a 100KB Excel spreadsheet. Let’s be accurate about this: recovering the entire silo would mean recovering all deduped backups – most notably a full of your entire environment.

If we’re talking about recovering just portions of the data on tape, then again, it’s going to be like the file-level recovery from block-level backup issue previously described, and we’ll be back to square one.

In Summary

I’ve got to be entirely blunt here – CommVault’s approach reminds me of the old (crude) expression (made as “G Rated” as possible):

“You can’t polish a poo, but you can roll it in gold dust”.

If the supporting architecture is crazy, it doesn’t matter that it can do something “nifty” – particularly if that something “nifty” will result in significantly slower recoveries (even in limited circumstances).

Yes, it’s undoubtedly the case that the CommVault approach will reduce the amount of data stored on tape, which will result in some cost savings. However, penny pinching in backup environments has a tendency to result in recovery impacts – often significant recovery impacts. For example, NetBackup gives “media savings” by not enforcing dependencies. Yes, this can result in in saving money here and there on media, but can result in being unable to do complete filesystem recoveries approaching the end of a total retention period, which is plain dumb.

The CommVault approach while saving some money on tape will significantly expand recovery times (or require large cache areas and still take a lot of recovery time). Saving money is good. Wasting a little time during longer-term recoveries is likely to be perceived as being OK – until there’s a pressing need. Wasting a lot of time during longer-term recoveries is rarely going to be perceived as being OK.

The other saying that springs to mind is: The road to hell is paved with good intentions.

If I’m correct in my understanding of how the CommVault dedupe-to-tape strategy works based on a review of the CommVault marketing material (typically for any vendor, slim information) and Curtis’ summary, I can only say that their approach is not crazy good as Curtis concludes, but crazy bad.

Posted in Architecture, General thoughts | Tagged: , | 4 Comments »

NetWorker Resource Relationships

Posted by Preston on 2009-10-07

With NetWorker having many components that can link together in a variety of ways, it’s not always easy (particularly for new-comers) to have a mental map of how all those components interact. Having made repeated stabs over the years to come up with a coherent diagram showing those relationships, I have a frustrated understanding of the difficulty of drawing the relationships.

Lately I decided to take a slightly different approach – to reduce the level of the diagram to the bare basic components so as to try to give a big overview rather than every possible detail. It’s highly likely I’ve left stuff off, and my diagramming skills aren’t the best – but hopefully if you’re not sure of how everything fits together in NetWorker it may help to improve your mental map of it.

NetWorker Resource Relationships

NetWorker Resource Relationships

For the most part, I’ve tried to stick to components that are defined resource types within NetWorker. A couple of notable exceptions are “Volume” and “Level” … neither of these are defined resources as per the NetWorker resource database, but knowing where they appear in usage helps to fill in a few gaps that would otherwise be confusing.

Posted in Architecture, NetWorker | Tagged: , | Comments Off

How much aren’t you backing up?

Posted by Preston on 2009-10-05

Do you have a clear picture of everything that you’re not backing up? For many sites, the answer is not as clear cut as they may think.

It’s easy to quantify the simple stuff – QA or test servers/environments that literally aren’t configured within the backup environment.

It’s also relatively easy to quantify the more esoteric things within a datacentre – PABXs, switch configurations, etc. (Though in a well run backup environment, there’s no reason why you can’t configure scripts that, as part of the backup process, logs onto such devices and retrieves the configuration, etc.)

It should also be very, very easy to quantify what data on any individual system that you’re not backing up – e.g., knowing that for fileservers you may be backing up everything except for files that have a “.mp3″ extension.

What most sites find difficult to quantify is the quasi-backup situations – files and/or data that they are backing up, but which is useless in a recovery scenario. Now, many readers of that last sentence will probably think of one of the more immediate examples: live database files that are being “accidentally” picked up in the filesystem backup (even if they’re being backed up elsewhere, by a module). Yes, such a backup does fall into this category, but there are other types of backups which are even less likely to be considered.

I’m talking about information that you only need during a disaster recovery – or worse, a site disaster recovery. Let’s consider an average Unix (or Linux) system. (Windows is no different, I just want to give some command line details here.) If a physical server goes up in smoke, and a new one has to be built, there’s a couple of things that have to be considered pre-recovery:

  • What was the partition layout?
  • What disks were configured in what styles of RAID layout?

In an average backup environment, this sort of information isn’t preserved. Sure, if you’ve got say, HomeBase licenses (taking the EMC approach), or using some other sort of bare metal recovery system, and that system supports your exact environment*, then you may find that such information is preserved and is available.

But what about the high percentage of cases where it’s not?

This is where the backup process needs to be configured/extended to support generation of system or disaster recovery information. It’s all very good for instance, for a Linux machine to say that you can just recover “/etc/fstab”, but what if you can’t remember the size of the partitions referenced by that file system table? Or, what if you aren’t there to remember what the size of the partitions were? (Memory is a wonderful yet entirely fallible and human-dependent process. Disaster recovery situations shouldn’t be bound by what we can or can’t remember about the systems, and so we have to gather all the information required to support disaster recovery.)

On a running system, there’s all sorts of tools available to gather this sort of information, but when the system isn’t running, we can’t run the tools, so we need to run them in advance, either as part of the backup process or as a scheduled, checked-upon function. (My preference is to incorporate it into the backup process.)

For instance, consider that Linux scenario – we can quickly assemble the details of all partition sizes on a system with one simple command – e.g.:

[root@nox ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sda2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sda3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sda4           19458      121601   820471680    5  Extended
/dev/sda5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sda6           19702      121601   818511718+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         250     2008093+  82  Linux swap / Solaris
/dev/sdb2             251      121601   974751907+  83  Linux

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   83  Linux

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sdd2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sdd3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sdd4           19458      121601   820471680    5  Extended
/dev/sdd5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sdd6           19702      121601   818511718+  fd  Linux raid autodetect

That wasn’t entirely hard. Scripting that to occur at the start of the backup process isn’t difficult either. For systems that have RAID, there’s another, equally simple command to extract RAID layouts as well – again, for Linux:

[root@nox ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda3[0] sdd3[1]
 138456128 blocks [2/2] [UU]

md2 : active raid1 sda6[0] sdd6[1]
 818511616 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdd1[1]
 16779776 blocks [2/2] [UU]

unused devices: <none>

I don’t want to consume realms of pages discussing what, for each operating system you should be gathering. The average system administrator for any individual platform should, with a cup of coffee (or other preferred beverage) in hand, should be able to sit down and in under 10 minutes jot down the sorts of information that would need to be gathered in advance of a disaster to assist in the total system rebuild of an operating system of a machine they administer.

Once these information gathering steps have been determined, they can be inserted into the backup process as a pre-backup command. (In NetWorker parlance, this would be via a savepnpc “pre” script. Other backup products will equally feature such options.) Once the information is gathered, a copy should be kept on the backup server as well as in an offsite location. (I’ll give you a useful cloud backup function now: it’s called Google Mail. Great for offsiting bootstraps and system configuration details.)

When it comes to disaster recovery, such information can take the guess work or reliance on memory out of the equation, allowing a system or backup administrator in any (potentially sleep-deprived) state, with any level of knowledge about the system in question, to conduct the recovery with a much higher degree of certainty.


* Due to what they offer to do, bare metal recovery (BMR) products tend to be highly specific in which operating system variants, etc., they support. In my experience a significantly higher number of sites don’t use BMR than do.

Posted in Architecture, Backup theory, Linux, NetWorker, Policies, Scripting | Tagged: , , , , | 2 Comments »

Vendors! Listen up! Stop talking about archive when you mean HSM

Posted by Preston on 2009-09-22

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit - 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? :-)

I won’t give up on ILP though!

Posted in Architecture, Backup theory, General Technology, General thoughts, Quibbles | Tagged: , , | 6 Comments »

Think backup belongs in ILM? Think again

Posted by Preston on 2009-09-12

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.

Posted in Architecture, Backup theory, General Technology, General thoughts | Tagged: , , , , , , , , , | 2 Comments »

 
Follow

Get every new post delivered to your Inbox.