NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.

Archive for the ‘Architecture’ Category

Pictorial representation of management happiness over OpEx cost savings

Posted by Preston on 2009-12-16

There is a profoundly important backup adage that should be front and centre in the mind of any backup administrator, operator, manager, etc. This is:

It is always better to backup a little more than not quite enough.

This isn’t an invitation to wildly waste backup capacity on useless copies that can never be recovered from – or needlessly generate unnecessary backups. However, it should serve as a constant reminder that if you keep shaving important stuff out of your backups, you’ll eventually suffer a Titanic issue.

Now, people at this point often tell me either that (a) they’re being told to reduce the amount of data being backed up or (b) it makes their manager happy to be able to report less OpEx budget required or (c) some combination of the two, or (d) they’re reluctant to ask for additional budget for storage media.

The best way to counter these oppressive and dangerous memes is to draw up a graph of how happy your manager(s) will be over saving a few hundred dollars here and there on media versus potential recovery issues. To get you started, I’ve drawn up one which covers a lot of sites I’ve encountered over the years:

Manager happiness vs state of environment and needless cost savings on backupYou see, it’s easy to be happy about saving a few dollars here and there on backup media in the here and now, when your backups are running and you don’t need to recover.

However, as soon as the need for a recovery starts to creep in, previous happiness over saving a few hundred dollars rapidly evaporates in direct proportion to the level of the data loss. There might be minimal issues to a single lost document or email, but past that things start to get rather hairy. In fact, it’s very easy to switch from 100% management happiness to 100% management disgruntlement within the space of 24 hours in extreme situations.

You, as a backup administrator, may already be convinced of this. (I would hope you are.) Sometimes though, other staff or managers may need reminding that they too may be judged by more senior management on recoverability of systems under their supervision, so this graph equally applies to them. That continues right up the chain, further reinforcing the fact that backups are an activity which belong to the entire company, not just IT, and therefore they are a financial concern that need to be budgeted for by the entire company.

Posted in Architecture, Backup theory | Tagged: , , , | 4 Comments »

EMC, Data Domain, VTLs and Disk Backup

Posted by Preston on 2009-11-30

With their recent acquisition of Data Domain, some people at EMC have become table thumping experts overnight on why you it’s absolutely imperative that you backup to Data Domain boxes as disk backup over NAS, rather than a fibre-channel connected VTL.

Their argument seems to come from the numbers – the wrong numbers.

The numbers constantly quoted are number of sales of disk backup Data Domain vs VTL Data Domain. That is, some EMC and Data Domain reps will confidently assert that by the numbers, a significantly higher percentage of Data Domain for Disk Backup has been sold than Data Domain with VTL. That’s like saying that Windows is superior to Mac OS X because it sells more. Or to perhaps pick a little less controversial topic, it’s like saying that DDS is better than LTO because there’s been more DDS drives and tapes sold than there’s ever been LTO drives and tapes.

I.e., an argument by those numbers doesn’t wash. It rarely has, it rarely will, and nor should it. (Otherwise we’d all be afraid of sailing too far from shore because that’s how it had always been done before…)

Let’s look at the reality of how disk backup currently stacks up in NetWorker. And let’s preface this by saying that if backup products actually started using disk backup properly tomorrow, I would be the first to shout “Don’t let the door hit your butt on the way out” to every VTL on the planet. As a concept, I wish VTLs didn’t have to exist, but in the practical real world, I recognise their need and their current ascendency over ADV_FILE. I have, almost literally at times, been dragged kicking and screaming to that conclusion.

Disk Backup, using ADV_FILE type devices in NetWorker:

  • Can’t move a saveset from a full disk backup unit to a non-full one; you have to clear the space first.
  • Can’t simultaneously clone from, stage from, backup to and recover from a disk backup unit. No, you can’t do that with tape either, but when disk backup units are typically in the order of several terabytes, and virtual tapes are in the order of maybe 50-200 GB, that’s a heck of a lot less contention time for any one backup.
  • Use tape/tape drive selection algorithms for deciding which disk backup unit gets used in which order, resulting in worst case capacity usage scenarios in almost all instances.
  • Can’t accept a saveset bigger than the disk backup unit. (It’s like, “Hello, AMANDA, I borrowed some ideas from you!”)
  • Can’t be part-replicated between sites. If you’ve got two VTLs and you really need to do back-end replication, you can replicate individual pieces of media between sites – again, significantly smaller than entire disk backup units. When you define disk backup units in NetWorker, that’s the “smallest” media you get.
  • Are traditionally space wasteful. NetWorker’s limited staging routines encourages clumps of disk backup space by destination pool – e.g., “here’s my daily disk backup units, I use them 30 days out of 31, and those over there that occupy the same amount of space (practically) are my monthly disk backup units, I use them 1 day out of 31. The rest of the time they sit idle.”
  • Have poor staging options (I’ll do another post this week on one way to improve on this).

If you get a table thumping sales person trying to tell you that you should buy Data Domain for Disk Backup for NetWorker, I’d suggest thumping the table back – you want the VTL option instead, and you want EMC to fix ADV_FILE.

Honestly EMC, I’ll lead the charge once ADV_FILE is fixed. I’ll champion it until I’m blue in the face, then suck from an oxygen tank and keep going – like I used to, before the inadequacies got too much. Until then though, I’ll keep skewering that argument of superiority by sales numbers.

Posted in Architecture, NetWorker | Tagged: , , , , , , | 3 Comments »

Storage Tiering vs ILM

Posted by Preston on 2009-11-24

Over at StorageNerve, and on Twitter, Devang Panchigar has been asking Is Storage Tiering ILM or a subset of ILM, but where is ILM? I think it’s an important question with some interesting answers.

Devang starts with defining ILM from a storage perspective:

1) A user or an application creates data and possibly over time that data is modified.
2) The data needs to be stored and possibly be protected through RAID, snaps, clones, replication and backups.
3) The data now needs to be archived as it gets old, and retention policies & laws kick in.
4) The data needs to be search-able and retrievable NOW.
5) Finally the data needs to be deleted.

I agree with items 1, 3, 4 and 5 – as per previous posts, for what it’s worth, I believe that 2 belongs to a sister activity which I define as Information Lifecycle Protection (ILP) – something that Devang acknowledges as an alternative theory. (I liken the logic to separation between ILM and ILP to that between operational production servers and support production servers.)

The above list, for what it’s worth, is actually a fairly astute/accurate summary of the involvement of the storage industry thus far in ILM. Devang rightly points out that Storage Tiering (migrating data between different speed/capacity/cost storage based on usage, etc.), doesn’t address all of the above points – in particular, data creation and data deletion. That’s certainly true.

What’s missing from ILM from a storage perspective are the components that storage can only peripherally control. Perhaps that’s not entirely accurate – the storage industry can certainly participate in the remaining components (indeed, particularly in NAS systems it’s absolutely necessary, as a prime example) – but it’s more than just the storage industry. It’s operating system vendors. It’s application vendors. It’s database vendors. It is, quite frankly, the whole kit and caboodle.

What’s missing in the storage-centric approach to ILM is identity management – or to be more accurate in this context, identity management systems. The brief outline of identity management is that it’s about moving access control and content control out of the hands of the system, application and database administrators, and into the hands of human resources/corporate management. So a system administrator could have total systems access over an entire host and all its data but not be able to open files that (from a corporate management perspective) they have no right to access. A database administrator can fully control the corporate database, but can’t access commercially sensitive or staff salary details, etc.

Most typically though, it’s about corporate roles, as defined in human resources, being reflected from the ground up in system access options. That is, human resources, when they setup a new employee as having a particular role within the organisation (e.g., “personal assistant”), triggering the appropriate workflows to setup that person’s accounts and access privileges for IT systems as well.

If you think that’s insane, you probably don’t appreciate the purpose of it. System/app/database administrators I talk to about identity management frequently raise trust (or the perceived lack thereof) involved in such systems. I.e., they think that if the company they work for wants to implement identity management they don’t trust the people who are tasked with protecting the systems. I won’t lie, I think in a very small number of instances, this may be the case. Maybe 1%, maybe as high as 2%. But let’s look at the bigger picture here – we, as system/application/database administrators currently have access to such data not because we should have access to such data but because until recently there’s been very few options in place to limit data access to only those who, from a corporate governance perspective, should have access to that data. As such, most system/app/database administrators are highly ethical – they know that being able to access data doesn’t equate to actually accessing that data. (Case in point: as the engineering manager and sysadmin at my last job, if I’d been less ethical, I would have seen the writing on the wall long before the company fell down under financial stresses around my ears!)

Trust doesn’t wash in legal proceedings. Trust doesn’t wash in financial auditing. Particularly in situations where accurate logs aren’t maintained in an appropriately secured manner to prove that person A didn’t access data X. The fact that the system was designed to permit A to access X (even as part of A’s job) is in some financial, legal and data sensitivity areas, significant cause for concern.

Returning to the primary point though, it’s about ensuring that the people who have authority over someone’s role within a company (human resources/management) having control over the the processes that configure the access permissions that person has. It’s also about making sure that those work flows are properly configured and automated so there’s no room for error.

So what’s missing – or what’s only at the barest starting point, is the integration of identity/access control with ILM (including storage tiering) and ILP. This, as you can imagine, is not an easy task. Hell, it’s not even a hard task – it’s a monumentally difficult task. It involves a level of cooperation and coordination between different technical tiers (storage, backup, operating systems, applications) that we rarely, if ever see beyond the basic “must all work together or else it will just spend all the time crashing” perspective.

That’s the bit that gives the extra components – control over content creation and destruction. The storage industry on its own does not have the correct levels of exposure to an organisation in order to provide this functionality of ILM. Nor do the operating system vendors. Nor do the database vendors or the application vendors – they all have to work together to provide a total solution on this front.

I think this answers (indirectly) Devang’s question/comment on why storage vendors, and indeed, most of the storage industry, has stopped talking about ILM – the easy parts are well established, but the hard parts are only in their infancy. We are after all seeing some very early processes around integrating identity management and ILM/ILP. For instance, key management on backups, if handled correctly, can allow for situations where backup administrators can’t by themselves perform the recovery of sensitive systems or data – it requires corporate permissions (e.g., the input of a data access key by someone in HR, etc.) Various operating systems and databases/applications are now providing hooks for identity management (to name just one, here’s Oracle’s details on it.)

So no, I think we can confidently say that storage tiering in and of itself is not the answer to ILM. As to why the storage industry has for the most part stopped talking about ILM, we’re left with one of two choices – it’s hard enough that they don’t want to progress it further, or it’s sufficiently commercially sensitive that it’s not something discussed without the strongest of NDAs.

We’ve seen in the past that the storage industry can cooperate on shared formats and standards. We wouldn’t be in the era of pervasive storage we currently are without that cooperation. Fibre-channel, SCSI, iSCSI, FCoE, NDMP, etc., are proof positive that cooperation is possible. What’s different this time is the cooperation extends over a much larger realm to also encompass operating systems, applications, databases, etc., as well as all the storage components in ILM and ILP. (It makes backups seem to have a small footprint, and backups are amongst the most pervasive of technologies you can deploy within an enterprise environment.)

So we can hope that the reason we’re not hearing a lot of talk about ILM any more is that all the interested parties are either working on this level of integration, or even making the appropriate preparations themselves in order to start working together on this level of integration.

Fingers crossed people, but don’t hold your breath – no matter how closely they’re talking, it’s a long way off.

Posted in Architecture, General Technology, General thoughts, Security | Tagged: , , , , , , , , | 2 Comments »

Enhancing NetWorker Security: A theoretical architecture

Posted by Preston on 2009-11-18

It’s fair to say that no one backup product can be all things to all people. More generally, it’s fair to say that no product can be all things to all people.

Security has had a somewhat interesting past in NetWorker; much of the attention to security for a lot of the time has been to with (a) defining administrators, (b) ensuring clients are who they say they are and (c) providing access controls for directed recoveries.

There’s a bunch of areas though that have remained somewhat lacking in NetWorker for security. Not 100% lacking, just not complete. For instance, user accounts that are accessed for the purposes of module backup and recovery frequently need higher levels of authority than standard users. Equally so, some sites want their <X> admins to be able to control as much as possible of the <X> backups, but not to be able to have any administrator privileges over the <Y> backups. I’d like to propose an idea that, if implemented, would both improve security and make NetWorker more flexible.

The change would be to allow the definition of administrator zones. An “administrator zone” would be a subset of a datazone. It would consist of:

  1. User groups:
    • A nominated “administrator” user group.
    • A nominated “user” user group.
    • Any other number of nominated groups with intermediate privileges.
  2. A collection of the following:
    • Clients
    • Groups
    • Pools
    • Devices
    • Schedules
    • Policies
    • Directives
    • etc

These obviously would still be accessible in the global datazone for anyone who is a datazone administrator. Conceptually, this would look like the following:

Datazone with subset "administrator" zonesThe first thing this should point out to you is that administrator zones could, if desired, overlap. For instance, in the above diagram we have:

  1. Minor overlap between Windows and Unix admin zones (e.g., they might both have administrative rights over tape libraries).
  2. Overlap between Unix and Oracle admin zones.
  3. Overlap between Windows and Oracle admin zones.
  4. Overlap between Windows and Exchange admin zones.
  5. Overlap between Windows and MSSQL admin zones.

Notably though, the DMZ Admin zone indicates that you can have some zones that have no overlap/commonality with other zones.

There’d need to be a few rules established in order to make this work. These would be:

  1. Only the global datazone can support “<x>@*” user or group definitions in a user group.
  2. If there is overlap between two zones, then the user will inherit the rights of the highest authority they belong to. I.e., if a user is editing a shared feature between the Windows and Unix admin zones, and is declared an admin in the Unix zone, but only an end-user in the Windows zone, then the user will edit that shared feature with the rights of an admin.
  3. Similarly to the above, if there’s overlap between privileges at the global datazone level and a local administrator zone, the highest privileges will “win” for the local resource.
  4. Resources can only be created and deleted by someone with data zone administrator privileges.
  5. Updates for resources that are shared between multiple administrator zones need to be “approved” by an administrator from each administrator zone that overlaps or a datazone administrator.

Would this be perfect? Not entirely – for instance, it would still require a datazone administrator to create the resources that are then allocated to an administrator zone for control. However, this would prevent a situation occurring where an unprivileged user with “create” options could go ahead and create resources they wouldn’t have authority over. Equally, in an environment that permits overlapping zones, it’s not appropriate for someone from one administrator zone to delete a resource shared by multiple administrator zones. Thus, for safety’s sake, administrator zones should only concern themselves with updating existing resources.

How would the approval process work for edits of resources that are shared by overlapping zones? To start with, the resource that has been updated would continue to function “as is”, and a “copy” would be created (think of it as a temporary resource), with a notification used to trigger a message to the datazone administrators and the other, overlapping administrators. Once the appropriate approval has been done (e.g., an “edit” process in the temporary resource), then the original resource would be overwritten with the temporary resource, and the temporary resource removed.

So what sort of extra resources would we need to establish this? Well, we’ve already got user groups, which is a starting point. The next step is to define an “admin zone” resource, which has fields for:

  1. Administrator user group.
  2. Standard user group.
  3. “Other” user groups.
  4. Clients
  5. Groups
  6. Pools
  7. Schedules
  8. Policies
  9. Directives
  10. Probes
  11. Lockboxes
  12. Notifications
  13. Labels
  14. Staging Policies
  15. Devices
  16. Autochangers
  17. etc.

In fact, pretty much every resource except for the server resource itself, and licenses, should be eligible for inclusion into a localised admin group. In it’s most basic, you might expect to see the following:

nsradmin> print type: NSR admin zone; name: Oracle
type: NSR admin zone;
name: Oracle;
administrators: Oracle Admins;
users: Oracle All Users;
other user groups: ;
clients: delphi, pythia;
groups: Daily Oracle FS, Monthly Oracle FS,
Daily Oracle DB, Monthly Oracle DB;
pools: ;
schedules: Daily Oracle, Monthly Oracle;
policies: Oracle Daily, Oracle Monthly;
directives: pythia exclude oracle, delphi exclude oracle;
...etc...

To date, NetWorker’s administration focus has been far more global. If you’re an administrator, you can do anything to any resource. If you’re a user, you can’t do much with any resource. If you’ve been given a subset of privileges, you can use those privileges against all resources touched by those privileges.

An architecture that worked along these lines would allow for much more flexibility in terms of partial administrative privileges in NetWorker – zones of resources and local administrators for those resources would allow for more granular control of configuration and backup functionality, while still keeping NetWorker configuration maintained at the central server.

Posted in Architecture, Backup theory, NetWorker, Security | Tagged: , , , , | 2 Comments »

Why /nsr/tmp is wrong

Posted by Preston on 2009-11-02

On both Windows and Unix platforms, NetWorker maintains a “tmp” directory within nsr.

This directory contains a variety of information, from output received by savegroup completion notifications to lock/state files for certain NetWorker resource.

To first explain why /nsr/tmp is wrong, let me first tell you a little story about the first system administration team I joined. They rigorously followed RFC-1178, and it’s ever since then that I’ve also done my best to follow that RFC – I’ve even written an article here on the blog about choosing appropriate names for backup servers. Sometime before I joined the team, they were in the process of setting up a replacement DNS server for local datacentre. There was either a dispute about what to name it, or it was only meant to hang around for a short while, but for whatever reason, it was named tmp.

I worked in the group from 1996 through to 2000, and from what I heard, it wasn’t until several years after I left that tmp was decommissioned.

One of the most valuable lessons I took away is name things appropriately. The DNS server tmp was not named appropriately. Thus, the name tmp or temp should be used only for transient data or systems. (To this day I never give machines names along the lines of ‘tmp’; the closest I’ll go is naming them after synonyms to do with trash or garbage – meaning that I’m fully aware that at any moment they can be blown away.)

To return to our topic, /nsr/tmp is wrong because it’s misnamed. Temporary files only make up some of its content. Other files, state files, can hang around between restarts of NetWorker and (particularly if NetWorker was incorrectly shutdown) give backup administrators really bad days. In fact, the “magical random” nature of /nsr/tmp is so well known that it’s actually started to really bug EMC engineering. My understanding is that engineering want the contents of /nsr/tmp captured any time an EMC support representative tells some to shutdown+delete+restart so that if it does fix the problem, they can try to debug why and remove the need.

The problem with shutdown+delete+restart is that in doing so, you clear out other information as well. Selectively deleting “the right file” can sometimes be a bit of a needle in a hay stack operation, and I suspect that debugging these deletes post-event will either be frustratingly slow or a bit like whack-a-mole.

Architecturally, to include both state and temporary files in the same common directory structure is silly. Having a few extra directories in the ‘nsr’ base directory on the other hand is a minor change. I’d suggest that more improvements might be made by first actually splitting /nsr/tmp into:

  • /nsr/lck – Resource lock files
  • /nsr/tmp – Real temporary files (e.g., savegroup output text)
  • /nsr/state – State files (if necessary)

That way /nsr/tmp will actually start to obey the Principle of Least Astonishment.

Posted in Architecture, NetWorker | Tagged: , , | Comments Off on Why /nsr/tmp is wrong

NetWorker Resource Relationships

Posted by Preston on 2009-10-07

With NetWorker having many components that can link together in a variety of ways, it’s not always easy (particularly for new-comers) to have a mental map of how all those components interact. Having made repeated stabs over the years to come up with a coherent diagram showing those relationships, I have a frustrated understanding of the difficulty of drawing the relationships.

Lately I decided to take a slightly different approach – to reduce the level of the diagram to the bare basic components so as to try to give a big overview rather than every possible detail. It’s highly likely I’ve left stuff off, and my diagramming skills aren’t the best – but hopefully if you’re not sure of how everything fits together in NetWorker it may help to improve your mental map of it.

NetWorker Resource Relationships

NetWorker Resource Relationships

For the most part, I’ve tried to stick to components that are defined resource types within NetWorker. A couple of notable exceptions are “Volume” and “Level” … neither of these are defined resources as per the NetWorker resource database, but knowing where they appear in usage helps to fill in a few gaps that would otherwise be confusing.

Posted in Architecture, NetWorker | Tagged: , | Comments Off on NetWorker Resource Relationships

How much aren’t you backing up?

Posted by Preston on 2009-10-05

Do you have a clear picture of everything that you’re not backing up? For many sites, the answer is not as clear cut as they may think.

It’s easy to quantify the simple stuff – QA or test servers/environments that literally aren’t configured within the backup environment.

It’s also relatively easy to quantify the more esoteric things within a datacentre – PABXs, switch configurations, etc. (Though in a well run backup environment, there’s no reason why you can’t configure scripts that, as part of the backup process, logs onto such devices and retrieves the configuration, etc.)

It should also be very, very easy to quantify what data on any individual system that you’re not backing up – e.g., knowing that for fileservers you may be backing up everything except for files that have a “.mp3” extension.

What most sites find difficult to quantify is the quasi-backup situations – files and/or data that they are backing up, but which is useless in a recovery scenario. Now, many readers of that last sentence will probably think of one of the more immediate examples: live database files that are being “accidentally” picked up in the filesystem backup (even if they’re being backed up elsewhere, by a module). Yes, such a backup does fall into this category, but there are other types of backups which are even less likely to be considered.

I’m talking about information that you only need during a disaster recovery – or worse, a site disaster recovery. Let’s consider an average Unix (or Linux) system. (Windows is no different, I just want to give some command line details here.) If a physical server goes up in smoke, and a new one has to be built, there’s a couple of things that have to be considered pre-recovery:

  • What was the partition layout?
  • What disks were configured in what styles of RAID layout?

In an average backup environment, this sort of information isn’t preserved. Sure, if you’ve got say, HomeBase licenses (taking the EMC approach), or using some other sort of bare metal recovery system, and that system supports your exact environment*, then you may find that such information is preserved and is available.

But what about the high percentage of cases where it’s not?

This is where the backup process needs to be configured/extended to support generation of system or disaster recovery information. It’s all very good for instance, for a Linux machine to say that you can just recover “/etc/fstab”, but what if you can’t remember the size of the partitions referenced by that file system table? Or, what if you aren’t there to remember what the size of the partitions were? (Memory is a wonderful yet entirely fallible and human-dependent process. Disaster recovery situations shouldn’t be bound by what we can or can’t remember about the systems, and so we have to gather all the information required to support disaster recovery.)

On a running system, there’s all sorts of tools available to gather this sort of information, but when the system isn’t running, we can’t run the tools, so we need to run them in advance, either as part of the backup process or as a scheduled, checked-upon function. (My preference is to incorporate it into the backup process.)

For instance, consider that Linux scenario – we can quickly assemble the details of all partition sizes on a system with one simple command – e.g.:

[root@nox ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sda2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sda3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sda4           19458      121601   820471680    5  Extended
/dev/sda5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sda6           19702      121601   818511718+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         250     2008093+  82  Linux swap / Solaris
/dev/sdb2             251      121601   974751907+  83  Linux

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   83  Linux

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sdd2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sdd3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sdd4           19458      121601   820471680    5  Extended
/dev/sdd5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sdd6           19702      121601   818511718+  fd  Linux raid autodetect

That wasn’t entirely hard. Scripting that to occur at the start of the backup process isn’t difficult either. For systems that have RAID, there’s another, equally simple command to extract RAID layouts as well – again, for Linux:

[root@nox ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda3[0] sdd3[1]
 138456128 blocks [2/2] [UU]

md2 : active raid1 sda6[0] sdd6[1]
 818511616 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdd1[1]
 16779776 blocks [2/2] [UU]

unused devices: <none>

I don’t want to consume realms of pages discussing what, for each operating system you should be gathering. The average system administrator for any individual platform should, with a cup of coffee (or other preferred beverage) in hand, should be able to sit down and in under 10 minutes jot down the sorts of information that would need to be gathered in advance of a disaster to assist in the total system rebuild of an operating system of a machine they administer.

Once these information gathering steps have been determined, they can be inserted into the backup process as a pre-backup command. (In NetWorker parlance, this would be via a savepnpc “pre” script. Other backup products will equally feature such options.) Once the information is gathered, a copy should be kept on the backup server as well as in an offsite location. (I’ll give you a useful cloud backup function now: it’s called Google Mail. Great for offsiting bootstraps and system configuration details.)

When it comes to disaster recovery, such information can take the guess work or reliance on memory out of the equation, allowing a system or backup administrator in any (potentially sleep-deprived) state, with any level of knowledge about the system in question, to conduct the recovery with a much higher degree of certainty.


* Due to what they offer to do, bare metal recovery (BMR) products tend to be highly specific in which operating system variants, etc., they support. In my experience a significantly higher number of sites don’t use BMR than do.

Posted in Architecture, Backup theory, Linux, NetWorker, Policies, Scripting | Tagged: , , , , | 2 Comments »

Vendors! Listen up! Stop talking about archive when you mean HSM

Posted by Preston on 2009-09-22

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit – 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? :-)

I won’t give up on ILP though!

Posted in Architecture, Backup theory, General Technology, General thoughts, Quibbles | Tagged: , , | 6 Comments »

Think backup belongs in ILM? Think again

Posted by Preston on 2009-09-12

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.

Posted in Architecture, Backup theory, General Technology, General thoughts | Tagged: , , , , , , , , , | 2 Comments »

Wishlist – Server/Storage Node for Mac OS X

Posted by Preston on 2009-08-13

For some time I’ve wished NetWorker would support both storage node and server functions on Mac OS X. When I had a PPC 17″ PowerBook, this mostly came from the glacially slow performance of running Linux within VirtualPC so as to run up a NetWorker server for testing. (Windows-within-Virtual PC was a dead-loss: the then-current version of NetWorker would not even start within VirtualPC.)

Since Apple made the jump to Intel machines, running a NetWorker server for lab work within a virtual machine has been far more efficient, given that now it’s just virtualisation rather than emulation. However, I’ve been thinking for a while that given the performance options available on Mac OS X, and the amount of data frequently stored on Mac OS X machines, not supporting at least a storage node is foolish.

Now that I have a Mac Pro, my personal belief is that it’s crazy not to support Mac OS X both as server and storage node.

Why, you may ask, would I think this? Is it just some weird combination of the “Mac Fan Boy” and “NetWorker Fan Boy” that I want them joined at the hip like some bizarre Doctor Frankenstein experiment?

[Here’s an aside. Why is it that people who defend Apple, and Macs, are immediately declared to be Apple Fan Boys, when PC/Windows users just as vehemently defend their own platforms declare themselves ‘realistic’? There’s only one answer: sad hypocrisy. Defending one platform is “hysterial frothing at the mouth buy-in to the reality distortion effect of Steve Jobs”, whereas equally defending another platform is “logical”. Please, spare me.]

So, jumping off that little soap box, I do actually have a method to my madness here. I honestly think, bang for buck, that the Mac Pro (using Apple’s high end machines as a reference point) represent the sort of significant processing and expansion capability often sought in backup servers. I snapped up a bargain previous generation Mac Pro that features that Intel Xeon 5400 CPUs rather than the current top-of-the-line Nehalem based processors, but this machine has serious processing power. The reason it’s called a workstation by Apple is because of it’s ability to handle complex graphics – but in reality it’s basically a server in a nice shell. With 8 x 3.2GHz cores and (currently) 12GB of RAM, this is a machine that just absolutely flies at data throughput. With expansion of up to 32GB of RAM, Mac Pros represent in one shiny shell more than enough processing power to run a backup server/storage node for any sized business*.

For companies that are space-conscious, there’s the “server” version of the Mac Pro, the Xserve, which is quite a powerful host in a 1RU enclosure.

Given the client software has already been ported to Mac OS X, the hard work has effectively already been done; server and storage node options are not going to take a significant amount of development effort.

Is there justification in porting server and storage node to Mac OS X? The cynical part of me wants to answer that there’s a hell of a lot better justification in porting server/storage node to Mac OS X than there was in porting the client to Linux PPC, but undoubtedly that would have been done to service some large-scale deal for EMC – i.e., there would have been significant business-incentive to do so.

Is there a business incentive for supporting more than client capabilities on Mac OS X? Well, market share is continuing to grow, as evidenced by Microsoft breaking what is almost universally acknowledged as the golden rule of advertising**. Then there’s the high frequency of use of Mac OS X systems in academia. This may not seem a compelling business case for EMC now, but let’s think a little longer-term – as more and more people become exposed to Macs again during their education (either secondary or tertiary), that exposure is going to influence them in their buying decisions as they move into employment. I.e., short of some catastrophic collapse***, Apple is going to see market share continue to increase – not flatten, not drop, but continue to increase.

In the short term though, another compelling reason is where Apple’s market share is at its highest – multimedia: graphic design, advertising, etc., all feature large amounts of data storage. While there’s some support for client software within Mac OS X, the backup server market in that arena is owned almost exclusively by Retrospect. (Retrospect is a good product, but it is still reasonably limited – definitely a workgroup, rather than an enterprise product.) In short, it seems mad that machines that routinely store tens or more terabytes of storage are denied storage node/dedicated storage node capabilities.

Now, some might argue that the lack of support for Sybase DBAnywhere (which powers GST, the back-end to NMC) would be sufficient cause to stop at the client (or at most, the storage node); after all, if you can’t run the GST/NMC server on the backup server, what’s the point? I have two (I believe valid) responses to this: first, it’s reasonably common to see separation between NMC/GST services and the NetWorker server, not only in environments that have multiple backup servers, but also just for reducing the potential for one service impacting the other. Secondly, there’s already examples of NetWorker server platforms that don’t have an accompanying NMC/GST server option – Solaris/AMD springs to mind immediately, and I know there’s other examples as well.

I do honestly think the point is rapidly approaching (if it is not already here) where there are more compelling reasons to port NetWorker server and storage node to Mac OS X than there are for not doing so. Architecturally, data storage volumes, increasing market share and an existing client all point to this having solid reasons.


* Note that I’m not saying that they are capable of being the sole backup server for a massive company; just like any other platform, in larger environments, the three-tier approach is always required.

** That rule is that the number one company in an industry should never refer to the number two company in the industry in their advertising.

*** With a market cap that now periodically bounces above Google’s, this seems somewhat unlikely now. (Apple’s market cap now exceeds the combined market cap of HP and Dell.)

Posted in Architecture, NetWorker | Tagged: , , , | Comments Off on Wishlist – Server/Storage Node for Mac OS X