Feeds:
Posts
Comments

When I was at University, a philosophy lecturer remarked rather sagely that University is the last place people can go to learn for the sake of learning.

That’s sort of correct, but not always so. People can fumble through their jobs on a day to day basis learning what they have to, but they can also work along the basis of trying to soak up as much information as they can along the way. I’m not always a knowledge sponge – particularly if my caffeine quota is on the light side for the day, but I like to think I learn the odd thing here and there.

In the spirit of knowledge acquisition, here’s a few smaller things I’ve learned recently:

  • When simulating network connectivity problems, there’s a big difference between yanking the network cable and shutting down the network interface. (I was doing the interface shutdown, another person was doing the network cable unplug – and our results didn’t correlate.) Lesson: When escalating a case to vendor support, always spell out how you’re simulating the “comms failure” a customer is having.
  • The ‘bigasm’ utility starts to fall in a heap and becomes extremely unreliable once you exceed about 2100 GB of data generated for a single file. Lesson: When setting out to generate 2.3+ TB of backup data, create a bunch of files and have a bigasm directive to generate a smaller amount of data per file.
  • When setting up tests that will take a couple of days to run, always triple check what you’re about to do before you start it. Lesson: If you make a typo of 250 files at 100 GB each instead of 250 files at 10 GB each, bigasm/NetWorker won’t interpolate what you really meant.
  • There’s a hell of a difference between Solaris 10 AMD release 2 and release 8. Lesson: If wanting to get a Solaris 10 AMD 64-bit OS working in Parallels Desktop for Mac v5 with networking, go for release 8. It will save many forehead bruises.
  • ext3 is about as “modern” a filesystem as I am an elite sportsperson. Lesson: If wanting to achieve decent operational activities with backup to disk under Linux, use XFS instead of ext3.
  • All eSATA is not created equal. Lesson: When using an motherboard SATA -> eSATA converter, make sure the dual drive dock you order doesn’t work as a port multiplier.

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

20 years…

I remember mentioning earlier this year that there are few events I remember in stunning detail. Tiananmen Square, the Challenger Disaster and 9/11 are up high in that list.

The fall of the Berlin Wall stands equally high.

I remember sitting up late one night with my mother watching people standing atop the Berlin Wall, striking it with sledgehammers and fists and everything they had at their disposal, tearing down a fading remnant of the Cold War.

We are about to hit the 20th Anniversary of the fall of the Berlin Wall. I find it strange to think that there are adults now who grew up with the wall a memory rather than a real thing. I find it equally fantastic.

When IT people discuss Mean Time Between Failure (MTBF), the most common focus is on disk drives. We all know the basics for instance – the more drives you put in an array, the lower the cumulative MTBF, etc.

What impact does virtualisation have on MTBF though? Are there any published studies? I suspect not yet.

I’ll be clear from the outset: I like virtualisation.

Just because I like it though doesn’t lead me to question how many sites (particularly smaller ones) implement it, and the risks that they carry of effectively decreased MTBF by putting too many eggs in one basket.

Consider for instance a small business that decides, as part of an infrastructure refresh, to replace their current fileserver, directory server, mail server, database server and internet gateway server with a single VMware ESX server. (We’ll assume of course that they do not virtualise their backup server – something you should never do.)

So, instead of having five primary production servers, each of which has some chance of experiencing a catastrophic failure, we now have one primary production server which can still experience catastrophic failure. I’m not talking at the OS layer here (though that’s still relevant), but at the hardware layer.

Let’s be honest with ourselves – this is IT, and things can go wrong in IT just as they can anywhere else.

Now, in a small business such as the above, it can be argued that the loss of any one server is likely to cause a fair to serious inconvenience, but in each case, other functions are likely to still be carried out while the hardware is being repaired. If people can’t email, they may be able to catch up on some documentation or file related work. If people can’t access the database, they may be able to process things manually while still emailing, etc.

If all five servers go down at once, that’s a significantly more challenging proposition.

Anyone with exposure to virtualisation, high availability/redundancy or data protection should see what is needed here – a second server, shared storage and the ability to have guest systems moved from one virtualisation server to the other. (In smaller companies it may be achieved instead by just having a standby server with storage that can be accessed by the other host if necessary.)

However, it’s clear there’s more to running a virtualised environment than just whacking a big server in and virtualising the hosts that are already in the computer room.

Companies that are now just starting to adopt virtualisation may feel that it’s a mature enough industry that the time is ripe for jumping in – and they’re right. In fact, it’s been mature enough for long enough that virtualisation is practically old hat.

Regardless of the maturity of virtualisation though, it doesn’t change the fact that you’re still at the mercy of hardware failures (or other critical virtualisation-host failures), and you still have to design your systems to provide the appropriate level of protection you can (a) afford and (b) is necessary. When doing cost comparisons, it’s not appropriate to compare say, the cost of replacing 5 servers with another 5 servers vs replacing 5 servers with 1 beefier server – virtualised services should never be about putting all the eggs in just one basket.

Without that consideration, it’s too easy to see MTBF for your computing environment fall through the floor – and blame virtualisation technology instead of the real culprit: the practical implementation.

…that the folks over at OpenOffice have concentrated so much on mimicking interfaces rather than trying to come up with their own interface.

If this is the best that can be achieved, I’m not surprised that OpenOffice takes longer to launch and gets uglier with each iteration. This is not interface design. It reminds me of the Kill-o-Zap gun from Hitch Hikers Guide to the Galaxy, which was described thusly in the book:

The designer of the gun had clearly not been instructed to beat about the bush. ‘Make it evil,’ he’d been told. “Make it totally clear that this gun has a right end and a wrong end. Make it totally clear to anyone standing at the wrong end that things are going badly for them. If that means sticking all sort of spikes and prongs and blackened bits all over it then so be it. This is not a gun for hanging over the fireplace or sticking in the umbrella stand, it is a gun for going out and making people miserable with.’

I keep hoping this is a joke, but I can’t see anything that suggests it’s anything other than something serious. It’s not a mouse – it’s an obscenity of undesign.

Recently we’re seeing a lot of people upgrading to Windows 2008 SP2, without first checking to see that release notes and compatibility guides state that NetWorker doesn’t yet support this release.

I fully agree that this represents monumental slowness on the part of EMC … there’s absolutely no excuse – none whatsoever – for them not to be on the relevant developer programmes and partner programmes for all the supported operating systems so they get access to the new releases before they come out and then make sure there’s either hot-fixes or cumulative updates available to support new operating systems.

They don’t have to be on the same day, but it’s foolish short-sightedness at best that they don’t support a new OS release within say, 2 weeks of it hitting the general public, given that partner and developer programmes will give access to it for months in advance of that point.

Now, back to my original point – if you’re planning on rolling out a new service pack to an operating system, please take a few minutes to read the release notes or software compatibility guides – or ask your support team to fill you in, and if it’s not supported, roll out to a test client first so you can confirm the impact to your backup environment.

It takes two to tango – EMC needs to improve their response to new operating systems and new major updates to operating systems, but it’s equally important for people to remember to check these things before they upgrade, not after they get the first backup (or worse! recovery) error.

(Do I do these checks all the time? No – only in lab environments. It’s my job to identify bugs and issues before my customers find them as much as possible.)

Recently when I made an exasperated posting about lengthy ext3 check times and looking forward to btrfs, Siobhán Ellis pointed out that there was already a filesystem available for Linux that met a lot of my needs – particularly in the backup space, where I’m after:

  • Being able to create large filesystems that don’t take exorbitantly long to check
  • Being able to avoid checks on abrupt system resets
  • Speeding up the removal of files when staging completes or large backups abort

That filesystem of course is XFS.

I’ve recently spent some time shuffling data around and presenting XFS filesystems to my Linux lab servers in place of ext3, and I’ll fully admit that I’m horribly embarrassed I hadn’t thought to try this out earlier. If anything, I’m stuck looking for the right superlative to describe the changes.

Case in point – I was (and indeed still am) doing some testing where I need to generate >2.5TB of backup data from a Windows 32-bit client for a single saveset. As you can imagine, not only does this take a while to generate, but it also takes a while to clear from disk. I had got about 400 GB into the saveset the first time I was testing and realised I’d made a mistake with the setup so I needed to stop and start again. On an ext3 filesystem, it took more than 10 minutes after cancelling the backup before the saveset had been fully deleted. It may have taken longer – I gave up waiting at that point, went to another terminal to do something else and lost track of how long it actually took.

It was around that point that I recalled having XFS recommended to me for testing purposes, so I downloaded the extra packages required to use XFS within CentOS and reformatting the ~3TB filesystem to XFS.

The next test that I ran aborted due to a (!!!) comms error 1.8TB through the backup. Guess how long it took to clear the space? No, seriously, guess – because I couldn’t log onto the test server fast enough to actually see the space clearing. The backup aborted, and the space was suddenly back again. That’s a 1.8TB file deleted in seconds.

That’s the way a filesystem should work.

I’ve since done some (in VMs) nasty power-cycle mid-operation tests and the XFS filesystems come back up practically instantaneously – no extended check sessions that make you want to cry in frustration.

If you’re backing up to disk on Linux, you’d be mad to use anything other than XFS as your filesystem. Quite frankly, I’m kicking myself that I didn’t do this years ago.

This is no time to talk about time. We don’t have the time! (Star Trek: First Contact)

Are you one of the unlucky IT staff who don’t record timesheets? If you are, here’s what you can do to (a) convince yourself that you’re unlucky, and (b) convince your management that they need to change, quickly!

When I was first a system administrator, I had to do timesheets. Each week I’d grumble about having to do them, but compared to others in my team who would frequently be months or more behind, I’d usually get them in “on time”.

In my next system administration job, I didn’t have to do timesheets and initially thought I was blessed. After a while I noticed though that a lack of timesheets made for some very odd management attitudes. I’ll get to those in a minute – they’re at the crux of what I want to discuss.

Then I moved out of system administration into consulting, and pretty much ever since then I’ve had to do timesheeting. In fact, I ran a team of up to 20+ engineers for a couple of years and they can probably all attest to the fact that I used to bug them weekly to make sure they had their timesheets up to date.

You might say that I’m a believer.

I’ll do some initial qualifications here – I’m not a believer in a micromanagement timesheet approach that requires accurate measurement of time in 6 minute incrementals. Anything less than 15 minute intervals is entirely wasteful and creates too much administrative overhead. (Alternatively, if time recording should be done in increments of less than 15 minutes, then it should only be done in systems where the work flow, and the recording, can be done automatically. E.g., in call centres, etc.)

There are two common complaints that IT staff tend to have about timesheets:

  1. They’re a waste of time that could be better spent doing “real” work.
  2. The perception that they’re used by management to micromanage or even “punish” staff.

In any normal work environment, neither of these are remotely true.

Let’s look at each one to see why it’s wrong, and point out the counter argument to it.

They’re a waste of time that could be better spent doing “real” work.

I used to get this argument from time to time – heck, when I was just doing standard system administration work, I used to give this argument all the time. The simple fact of the matter is – it’s wrong.

When we work for someone, our work duties entail … what we’re told they entail. If part of that is timesheet entry, then that’s real work too. Call it “meta work” if you will … it’s “work about work”, but it’s still work.

They’re not a waste of time either. If a timesheet system is properly run, managed and reported from, it exists to allow the company to determine how resources are currently being utilised and how they’re trending. It also allows management to see how long tasks take. For IT staff themselves, that’s by and large the most liberating aspect of timesheets.

Particularly in companies that do consulting, timesheets fulfill another very important role – they allow the company to see how profitable they are, which in turn allows them to plan work cycles, which (ultimately for us) means they know they can pay us. This Is A Good Thing.

They’re used by management to micromanage or even “punish” staff

OK, there’s no backing away from this one – in some companies this has been the case. But those companies are relatively rare. Most companies want this information so they can work out how much it costs them (for in business, time is money) to do particular tasks – most good companies want this information so they can plan for personnel growth – i.e., to help normalise hours. Since (in Australia at least) it seems that practically 99% of IT people are on salaries rather than wages, this helps people get their personal lives back by having sufficient staff to meet business activity requirements.

Why you should be thankful for timesheets

Recall before I said that timesheets are liberating, because they track how long tasks take? I’ll bet at least 50% of people who read that statement chortle and think I’m off my rocker. Trust me, I’m not.

Regardless of whether you’re a consultant, or an administrator, or an operator, having empirical evidence of how long an activity takes can provide significant shielding from unrealistic expectations. Not only in my own personal experience, but talking to people who complain about unrealistic management expectations of tasks, there’s a very common link between management who request extremely unrealistic deadlines from IT staff across companies – it has a high tendency to happen in companies that don’t do timesheeting.

Why? Because the management don’t know how long things take. They don’t know how long an operating system install takes. They don’t know how long it takes to setup a lab AD server. They don’t know how long it takes to fix networking between branch offices. They don’t know how long it takes to do anything – not because they’re deliberately ignorant, but because they have no data available. That’s why, for instance, at 4pm on a Friday afternoon they’ll flick someone an email about needing to have a full Oracle development server ready by Monday morning when the hardware has been waiting idle for 2 weeks.

Guess how you make management understand how long things take? Timesheets.

If you work in an environment where timesheets aren’t done, stop kidding yourself that you’re lucky. You’re not.

On both Windows and Unix platforms, NetWorker maintains a “tmp” directory within nsr.

This directory contains a variety of information, from output received by savegroup completion notifications to lock/state files for certain NetWorker resource.

To first explain why /nsr/tmp is wrong, let me first tell you a little story about the first system administration team I joined. They rigorously followed RFC-1178, and it’s ever since then that I’ve also done my best to follow that RFC – I’ve even written an article here on the blog about choosing appropriate names for backup servers. Sometime before I joined the team, they were in the process of setting up a replacement DNS server for local datacentre. There was either a dispute about what to name it, or it was only meant to hang around for a short while, but for whatever reason, it was named tmp.

I worked in the group from 1996 through to 2000, and from what I heard, it wasn’t until several years after I left that tmp was decommissioned.

One of the most valuable lessons I took away is name things appropriately. The DNS server tmp was not named appropriately. Thus, the name tmp or temp should be used only for transient data or systems. (To this day I never give machines names along the lines of ‘tmp’; the closest I’ll go is naming them after synonyms to do with trash or garbage – meaning that I’m fully aware that at any moment they can be blown away.)

To return to our topic, /nsr/tmp is wrong because it’s misnamed. Temporary files only make up some of its content. Other files, state files, can hang around between restarts of NetWorker and (particularly if NetWorker was incorrectly shutdown) give backup administrators really bad days. In fact, the “magical random” nature of /nsr/tmp is so well known that it’s actually started to really bug EMC engineering. My understanding is that engineering want the contents of /nsr/tmp captured any time an EMC support representative tells some to shutdown+delete+restart so that if it does fix the problem, they can try to debug why and remove the need.

The problem with shutdown+delete+restart is that in doing so, you clear out other information as well. Selectively deleting “the right file” can sometimes be a bit of a needle in a hay stack operation, and I suspect that debugging these deletes post-event will either be frustratingly slow or a bit like whack-a-mole.

Architecturally, to include both state and temporary files in the same common directory structure is silly. Having a few extra directories in the ‘nsr’ base directory on the other hand is a minor change. I’d suggest that more improvements might be made by first actually splitting /nsr/tmp into:

  • /nsr/lck – Resource lock files
  • /nsr/tmp – Real temporary files (e.g., savegroup output text)
  • /nsr/state – State files (if necessary)

That way /nsr/tmp will actually start to obey the Principle of Least Astonishment.

A short while ago I’d posted a brief piece about Cisco shopping for Tandberg. Unfortunately I sometimes leap before I look, and it occurred to me a short while ago that Cisco was shopping for Tandberg, not Tandberg Data.

Apologies to all.

Older Posts »