NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • Advertisements
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Posts Tagged ‘usability’

Uptime is an inappropriate metric

Posted by Preston on 2009-10-09

I’m going to start with a statement that may make some people angry. Some might suggest that I’m just goading the blogosphere for traffic, but that’s not my point.

System administrators and managers who focus on keeping uptime as high as possible for no other reason than having good numbers are usually showing an arrogant disrespect to the users of their systems.

There – I said it. (I believe I am now required to walk the Midrange plank and dive into the Mainframe sea of Mediocrity). As a “long term” Unix admin and user, I found it galling when I initially realised that the midrange environment has always had the wrong attitude towards uptime. Uptime for the sake of uptime, that is. These days, I use a term you might more expect to hear in the mainframe world: negotiated uptime.

You see, there’s uptime, and there’s system usefulness. That’s the significant difference between uptime and agreed uptime. Confusing these items only achieves one thing: unhappy users.

Here’s just a few examples of rigid adherence to ‘uptime’ gone wrong:

  • Systems performing badly for days at a time while system administrators hunt for the cause when they know that a suitable workaround would be to reboot the system of a night time – or even during the time that most users take a lunch break.
  • Systems that don’t get patched for months at a time because the patching would require a reboot and that would affect uptime. (If it’s not broken, don’t fix it, can be used for regular patch avoidance, but very, very rarely for security patching.)
  • Applications that don’t get upgraded, despite obvious (or even required!) fixes in new releases, because the application administrators don’t like restarting the application.

I’ll go so far as to say that uptime, measured at the individual server level, is irrelevant and inappropriate. Uptime should never be about the servers, or even the applications – it’s about the services for the business.

Clusters typically represent a far healthier approach to uptime: a recognition that one or more nodes can fail so long as the service continues to be delivered. There are clusters (particularly OpenVMS clusters) that are known to have been presenting services for a decade or more, all the while continuing to get OS upgrades and hardware replacements and undoubtedly having single node failures/changes.

The healthiest approach to uptime however is to recognise that individual system or application uptime is irrelevant. The net effect experienced by users for availability of services is what should be measured. All the uptime stats, SNMP monitoring stats, etc., in the world are irrelevant when compared with how useful an actual IT service is to a business function.

The challenge of course is that availability is significantly harder to measure than is uptime. Uptime is after all, dead simple – on any Unix platform there’s a single command – ‘uptime’ to get you that measurement. Presumably on Windows there’s easy ways to get that information too. (E.g., without any other experience myself in trying to measure this, I know you can (usually) at least get last boot time out of the event logs.) Half of why it’s simple is the ease at which the statistic can be gathered.

What makes availability harder to measure though is that it’s not all boolean measurements. The other half of why uptime is an easy measurement is because it’s a boolean statistic. A host is either up or down (and when transitioning between it’s usually considered to be ‘down’ unless it’s fully ‘up’).

Services however can be up but not available. That is, they can be technically, yet not practically available.

Here’s an old example I like to dredge out regarding availability. An engineering company invested staggering amounts of money in putting together a highly customised implementation of SAP. Included in this implementation was a fairly in-depth timesheet module that did all sorts of calculations on the fly for timesheet entry.

Over time, administering this system, complaints grew practically on a week-by-week basis that come Friday (when timesheets had to be entered by – which caused a load rush at the end of each week), the SAP server was getting slower, and slower. Memory was upgraded. Hard drive layout was tweaked, etc., but in the end the system just got slower and slower and slower.

Eventually it was determined that the problem wasn’t in the OS or the hardware, but in the SQL coding in the timesheet system. You see, every time a user went to add a new timesheet entry, a logical error in the SQL code would first retrieve all timesheet entries made by that employee since the system was commissioned or the employee started, whichever was first. As you can imagine as the months and years went by, this amounted to a lot of heavy selects going on each week.

With that corrected, users reacted with awe – they thought the system had been massively upgraded, but instead it had just been a (relatively minor) SQL tweak.

What does this have to do with availability, you may be wondering? Well, everything.

You see, the SAP server was up for lengthy periods of time. The application also was up for lengthy periods of time. Yet the service – timesheets, and more generally the entire SAP database was increasingly unavailable. Timesheet entry for a week for many users took 2+ elapsed hours of initiating a new entry, waiting infuriatingly long numbers of minutes for the system to respond and then often inputting the entry later, after having switched away to something else while waiting for the system to respond. Under no stretch of the imagination could that service be said to be available.

So how do you measure availability? Well, the act of measuring is perhaps more challenging, and going to be handled on a service-type by service-type basis. (E.g., measuring web services will be different from measuring desktop services which will be different from measuring local server services, etc.)

The key step is defining useful, quantifiable metrics. That is, a metric such as “users should get snappy response” is vague, useless and (regrettably) all too easy to define. The real metrics are timing/accuracy based metrics. (Accuracy metrics are mainly useful for systems with analogue styled inputs.) Sticking to timing based metrics for simplicity, measuring availability comes down to having specific timings associated with events. The following are closer to being valid metrics:

  • All searches should start presenting data to the user within 3 seconds, and finish within 8 seconds.
  • Confirmation of successful data input should take place within 0.3 seconds.
  • A 20 page text-only document should complete printing within 11 seconds.
  • Scrubbing through raw digital media should occur with no more than a 0.2 second lag between mouse position and displayed frame.

(Using weight scales as an example, an analogue metric might be that the scales will be accurate to within 10 grams.)

While metrics are more challenging to quantify than boolean statistics, they allow the usability and availability of a system to be properly measured. Without accurate metrics, uptime is like digging for fool’s gold.

Advertisements

Posted in General Technology, General thoughts | Tagged: , , , , , | 3 Comments »

NetWorker Single Server Edition

Posted by Preston on 2009-07-27

Truth be told, I don’t have any real involvement in EBS (Enterprise Backup Software) these days. If you’re unaware of it, EBS is EMC NetWorker, rebadged. When I did have involvement with it, it was back in the days when it was called Solstice Backup.

One of the things that I liked about Solstice Backup was that it basically came with all new copies of Solaris with what was called a “Single Server” edition. That meant that it would support 1 tape drive, no tape library, and only be used to backup the backup server itself. Yes, single server edition would effectively mean a decentralised backup environment, but the purpose of single server edition wasn’t to get everyone to go down the blitheringly idiotic path of decentralised backups. Instead, it had two purposes, viz.:

(a) to provide a basic but very reliable way of backing up servers, and,

(b) to give companies an introduction to enterprise backup software.

You see, you could jump from single server edition to workgroup edition, or network edition, just by replacing the licenses. Your configuration would remain in place, meaning all you had to do was to start extending that configuration to cater for the expanded functionality of the product. Your existing backups were recoverable. Your existing backups would continue to backup. You could just do more.

I can’t say for sure whether EBS still supports single server edition – but it’s not really all that relevant to what I’m about to say, so it doesn’t really matter one way or another.

To this day I think it’s a shame that (Legato first, now) EMC hasn’t come up with a OEM model for NetWorker to allow for the inclusion of single server edition in one or more of the major Enterprise Linux distributions – e.g,. RedHat or Novell/SuSE. Obviously such a model would require an appropriate support system – when effectively giving it away for free (i.e., as part of a base system), there would need to be adequate training to allow the OEM/OS partner to adequately do first level support of the product as part of regular support work, but as we’ve seen with Sun and Solstice Backup Single Server Edition, that can be done. It’s a great way of getting the foot in the door, and in my personal experience at least, many companies that actually took the time to configure single server edition ended up upgrading to at least Workgroup, if not Network edition of NetWorker. Note what I said there: companies that actually took the time. I.e., there’s no guarantee that every single company will want to go ahead with configuring it – particularly with the size of current NetWorker documentation*. In other words, there were, and still are, some impediments to easy untrained roll-outs of NetWorker.

Those impediments to having NetWorker more approachable for rapid roll-out with easy instructions in a ‘single server’ environment however are readily quantifiable and easily resolvable. To prove that I’m not talking out of my butt, I’ll do my best to quantify the “top 5” items that would be necessary:

  1. Documentation – Rumblings on the NetWorker mailing list aside, NetWorker documentation has significantly improved over the last year. There’s been a big push to get useful documentation – hence the technical upgrade guides, the “continuous improvement” that’s going into PowerLink articles**, etc. However, quick start guides are still needed.
  2. Server merge functionality – If you’re going to do a single server edition, it’s necessary to support merging multiple NetWorker server media databases, configuration files and indices into a single datazone. That’s to allow for companies that might initially start down the path of a few standalone servers before realising they need to consolidate and have grown-up backups.
  3. Backup to disk + tape – In this day and age, single server edition should support say, 1TB of disk backup in a single device + a tape drive. That allows for basic cloning/staging and support for high speed devices, but doesn’t give away so much functionality that it discourages purchase of a full license. (Indeed, I’m inclined to suggest that it’s high time EMC includes in all the base NetWorker licenses, support for 1TB of disk backup space.)
  4. Manual Backup in NMC – This would take effort, but it’s something that would feed into all versions of NetWorker, so it would be worth the effort, giving NetWorker better selling points. I’m not talking about running a group manually – I’m talking about browsing a client (the wizard in 7.5 supports this, after all), and manually selecting files for backup, as is currently available in the Windows user program and used to be available in nwbackup. It should be available in NMC.
  5. Recovery in NMC – As above, and even more important than the above, we should see the complete ditching of (filesystem) client GUIs – nwrecover and winworkr, and see NMC support recovery as a standard option within that GUI.

Will the above points take time? Yes. Are they worth it? Yes. Will they carry through to other versions (Workgroup/Network/Power)? Well, point 3 is irrelevant to those versions, but all the other points are very relevant to all tiers of NetWorker, so implementing them will certainly help continued adoption of NetWorker – not only that, they’re all highly logical.


* Having a Getting Started With NetWorker guide would probably help in that sort of scenario too. (Yes, I’m getting closer to formalising what I’m going to do on that front.)

** Yes, there are some outdated PowerLink articles regarding NetWorker, but that’s true for any product that’s been around for as long as NetWorker. The point is, there are active and ongoing efforts to improve the documentation in PowerLink. Credit where credit is due.

Posted in Architecture, Aside, General thoughts, NetWorker | Tagged: , | Comments Off on NetWorker Single Server Edition

Basics – Adding new clients in 7.4.4 and higher

Posted by Preston on 2009-05-13

One of the policy changes made in NetWorker 7.4.4 (and which applies to 7.5.x as well) is that of client parallelism when it comes to new clients.

I have to say, and I’ll be blunt here, I find the policy change reasonably inappropriate.

In a post 7.4.4 world, NetWorker defaults to giving new clients that you create a parallelism of 12. I’d always thought that 4 was a terrible default setting, being too high, in a modern environment; you can imagine then what I thought when I found the new default setting was 12.

There’s a good reason why I find this inappropriate. In fact, it’s implicitly covered in my book by the sheer number of pages I devote to discussing how to plan client parallelism settings. In short, client parallelism settings are typically not something that you should set blindly. Unless you already have very clear ideas of filesystem/LUN layout, processing capabilities, bandwidth, etc., on a client, in my opinion you must start with a parallelism of 1 and work your way up as a result of clear and considered performance testing.

Given the amount of effort that’s been put into the latest NetWorker releases for VMware integration – i.e., the Virtual Client Connection license, etc. – it seems a less than logical choice to increase parallelism settings rather than decrease them (as a default) when you know that over time the number of virtualised hosts being backed up are going to increase.

This is obviously just a small inconvenience, but if you’ve not picked up on this yet, you should be aware of it when you start working with these newer versions of NetWorker.

What the real solution is

For what it’s worth, I actually don’t think the solution is to change the default client parallelism setting to 1, but to start maintaining a “defaults” component within the NetWorker server resource where local administrators can configure default settings for a plethora of new resources to be created (most typically clients, groups and pools).

For example, you might have options where you can specify the following defaults for any new client:

  • Parallelism
  • Priority*
  • Group
  • Schedule
  • Browse Policy
  • Retention Policy
  • Remote Access
  • etc.

These all have their own defaults, but it’s time to move past the point where NetWorker suggests standard defaults, and have all these default settings modifiable by the administrator. I realise that when the server bootstraps itself, it still needs to fall back on standard defaults, and that’s fine. However, once the server is up and running, being able to modify these defaults would be a Time Saving Feature.

This would reduce the amount of work administrators have to do when creating new resources – let’s face it, most of us spend most of the time in new resource creation changing the “default” settings. It also eliminates the amount of human errors introduced when adding to the configuration in a hurry. This sort of “defaults” component would preferably be run as a wizard in NMC on first install, and administrators would be asked if they want to re-run it upon updates.


* Adding priority to this might suggest a need to have the priority field work better than it has of late…

Posted in Aside, Basics, General thoughts, NetWorker, Policies | Tagged: , | 2 Comments »