NetWorker Blog

Commentary from a long term NetWorker consultant and Backup Theorist

  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Enterprise Systems Backup and Recovery

    If you find this blog interesting, and either have an interest in or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system rather than just a bunch of copies made by unrelated software, hardware and processes.
  • Advertisements
  • This blog has moved!

    This blog has now moved to nsrd.info/blog. Please jump across to the new site for the latest articles (and all old archived articles).
  •  


     


     

  • Twitter

    Error: Twitter did not respond. Please wait a few minutes and refresh this page.

Posts Tagged ‘availability’

Uptime is an inappropriate metric

Posted by Preston on 2009-10-09

I’m going to start with a statement that may make some people angry. Some might suggest that I’m just goading the blogosphere for traffic, but that’s not my point.

System administrators and managers who focus on keeping uptime as high as possible for no other reason than having good numbers are usually showing an arrogant disrespect to the users of their systems.

There – I said it. (I believe I am now required to walk the Midrange plank and dive into the Mainframe sea of Mediocrity). As a “long term” Unix admin and user, I found it galling when I initially realised that the midrange environment has always had the wrong attitude towards uptime. Uptime for the sake of uptime, that is. These days, I use a term you might more expect to hear in the mainframe world: negotiated uptime.

You see, there’s uptime, and there’s system usefulness. That’s the significant difference between uptime and agreed uptime. Confusing these items only achieves one thing: unhappy users.

Here’s just a few examples of rigid adherence to ‘uptime’ gone wrong:

  • Systems performing badly for days at a time while system administrators hunt for the cause when they know that a suitable workaround would be to reboot the system of a night time – or even during the time that most users take a lunch break.
  • Systems that don’t get patched for months at a time because the patching would require a reboot and that would affect uptime. (If it’s not broken, don’t fix it, can be used for regular patch avoidance, but very, very rarely for security patching.)
  • Applications that don’t get upgraded, despite obvious (or even required!) fixes in new releases, because the application administrators don’t like restarting the application.

I’ll go so far as to say that uptime, measured at the individual server level, is irrelevant and inappropriate. Uptime should never be about the servers, or even the applications – it’s about the services for the business.

Clusters typically represent a far healthier approach to uptime: a recognition that one or more nodes can fail so long as the service continues to be delivered. There are clusters (particularly OpenVMS clusters) that are known to have been presenting services for a decade or more, all the while continuing to get OS upgrades and hardware replacements and undoubtedly having single node failures/changes.

The healthiest approach to uptime however is to recognise that individual system or application uptime is irrelevant. The net effect experienced by users for availability of services is what should be measured. All the uptime stats, SNMP monitoring stats, etc., in the world are irrelevant when compared with how useful an actual IT service is to a business function.

The challenge of course is that availability is significantly harder to measure than is uptime. Uptime is after all, dead simple – on any Unix platform there’s a single command – ‘uptime’ to get you that measurement. Presumably on Windows there’s easy ways to get that information too. (E.g., without any other experience myself in trying to measure this, I know you can (usually) at least get last boot time out of the event logs.) Half of why it’s simple is the ease at which the statistic can be gathered.

What makes availability harder to measure though is that it’s not all boolean measurements. The other half of why uptime is an easy measurement is because it’s a boolean statistic. A host is either up or down (and when transitioning between it’s usually considered to be ‘down’ unless it’s fully ‘up’).

Services however can be up but not available. That is, they can be technically, yet not practically available.

Here’s an old example I like to dredge out regarding availability. An engineering company invested staggering amounts of money in putting together a highly customised implementation of SAP. Included in this implementation was a fairly in-depth timesheet module that did all sorts of calculations on the fly for timesheet entry.

Over time, administering this system, complaints grew practically on a week-by-week basis that come Friday (when timesheets had to be entered by – which caused a load rush at the end of each week), the SAP server was getting slower, and slower. Memory was upgraded. Hard drive layout was tweaked, etc., but in the end the system just got slower and slower and slower.

Eventually it was determined that the problem wasn’t in the OS or the hardware, but in the SQL coding in the timesheet system. You see, every time a user went to add a new timesheet entry, a logical error in the SQL code would first retrieve all timesheet entries made by that employee since the system was commissioned or the employee started, whichever was first. As you can imagine as the months and years went by, this amounted to a lot of heavy selects going on each week.

With that corrected, users reacted with awe – they thought the system had been massively upgraded, but instead it had just been a (relatively minor) SQL tweak.

What does this have to do with availability, you may be wondering? Well, everything.

You see, the SAP server was up for lengthy periods of time. The application also was up for lengthy periods of time. Yet the service – timesheets, and more generally the entire SAP database was increasingly unavailable. Timesheet entry for a week for many users took 2+ elapsed hours of initiating a new entry, waiting infuriatingly long numbers of minutes for the system to respond and then often inputting the entry later, after having switched away to something else while waiting for the system to respond. Under no stretch of the imagination could that service be said to be available.

So how do you measure availability? Well, the act of measuring is perhaps more challenging, and going to be handled on a service-type by service-type basis. (E.g., measuring web services will be different from measuring desktop services which will be different from measuring local server services, etc.)

The key step is defining useful, quantifiable metrics. That is, a metric such as “users should get snappy response” is vague, useless and (regrettably) all too easy to define. The real metrics are timing/accuracy based metrics. (Accuracy metrics are mainly useful for systems with analogue styled inputs.) Sticking to timing based metrics for simplicity, measuring availability comes down to having specific timings associated with events. The following are closer to being valid metrics:

  • All searches should start presenting data to the user within 3 seconds, and finish within 8 seconds.
  • Confirmation of successful data input should take place within 0.3 seconds.
  • A 20 page text-only document should complete printing within 11 seconds.
  • Scrubbing through raw digital media should occur with no more than a 0.2 second lag between mouse position and displayed frame.

(Using weight scales as an example, an analogue metric might be that the scales will be accurate to within 10 grams.)

While metrics are more challenging to quantify than boolean statistics, they allow the usability and availability of a system to be properly measured. Without accurate metrics, uptime is like digging for fool’s gold.

Advertisements

Posted in General Technology, General thoughts | Tagged: , , , , , | 3 Comments »

Aside – A valuable lesson in SLAs

Posted by Preston on 2009-05-29

I’ve recently discovered a site with a prosaic name of “DailyWTF” … obviously aimed at technical people, it frequently covers some of the more nonsensical happenings in IT. I thoroughly recommend periodically visiting it.

I was amused to read this story about SLAs regarding uptime today – it reminded me of a company I once was involved with that promised 1 hour restoration time on backups, yet sent media to an offsite location 1.5 hours away as soon as backups completed without keeping clones on site.

This raises the obvious point so frequently missed – ensure that SLAs are achievable.

Posted in Aside | Tagged: , | 1 Comment »