Tuesday, April 29, 2014

7 habits of highly successful Unix admins


You can spend 50-60 hours a week managing your Unix servers and responding to your users' problems and still feel as if you're not getting much done or you can adopt some good work habits that will both make you more successful and prepare you for the next round of problems.

By   9
Unix admins generally work a lot of hours, juggle a large set of priorities, get little credit for their work, come across as arrogant by admins of other persuasions, tend to prefer elegant solutions to even the simplest of problems, take great pride in their ability to apply regular expressions to any challenge that comes their way, and are inherently lazy -- at least they're constantly on the lookout for ways to type fewer characters even when they're doing the most routine work.
While skilled and knowledgeable, they could probably get a whole lot more done and get more credit for their work if they adopted some habits akin to those popularized in the 1989 book by Stephen R. Covey -- The 7 Habits of Highly Effective People. In that light, here are some habits for highly successful Unix administration.

Habit 1: Don't wait for problems to find you

One of the best ways to avoid emergencies that can throw your whole day out of kilter is to be on the alert for problems in their infancy. I have found that installing scripts on the servers that report unusual log entries, check performance and disk space statistics, report application failures or missing processes, and email me reports when anything looks "off" can be of considerable value. The risks are getting so much of this kind of email that you don't actually read it or failing to notice when these messages stop arriving or start landing in your spam folder. Noticing what messages *aren't* arriving is not unlike noticing who from your team of 12 or more people hasn't shown up for a meeting.
Being proactive, you are likely to spot a number of problems long before they turn into outages and before you users notice the problems or find that they can no longer get their work done.
It's also extremely beneficial if you have the resources needed to plan for disaster. Can you fail over a service if one of your primary servers goes down? Can you rely on your backups to rebuild a server environment quickly? Do you test your backups periodically to be sure they are complete and usable? Preparing disaster recovery plans for critical services (e.g., the mail service could be migrated to the spare server in the data center and the NIS+ service has been set up with a replica) can keep you from scrambling and wasting a lot of time when the pressure is on.

Habit 2: Know your tools and your systems

Probably the best way to recognize that one of your servers is in trouble is to know how that server looks under normal conditions. If a server typically uses 50% of its memory and starts using 99%, you're going to want to know what is different. What process is running now that wasn't before? What application is using more resources than usual?
Be familiar with a set of tools for looking into performance issues, memory usage, etc. I use and encourage others to use the sar command routinely, both to see what's happening now on a system and to look back in time to get an idea when the problems began. One of the scripts that I run on my most critical servers sends me enough data that I can get a quick view of the last week or two of performance measures.
It's also a good idea to be practiced with all of the commands that you might need to run when a problem occurs. Can you construct a find command that helps you identify suspect files, large files, files with permissions problems? Knowing how to use a good debugger can also be a godsend when you need to analyze a process. Knowing how to check network connections can also be an important thing to do when your systems might be under attack.

Habit 3: Prioritize, prioritize, prioritize

Putting first things first is something of a no brainer when it comes to how you organize your work, but sometimes selecting which priority problem qualifies as "first" may be more difficult than it seems. To properly prioritize your tasks, you should consider the value to be derived from the fix. For me, this often involves how many people are affected by the problem, but it also involves who is affected. Your CEO might have to be counted as equivalent to 1,000 people in your development staff. Only you (or your boss) can make this decision. You also need to consider how much they're affected. Does the problem imply that they can't get any work done at all or is it just an inconvenience?
Another critical element in prioritizing your tasks is how long a problem will take to resolve.
Unless the problem that I'm working on is related to an outage, I try to "whack out" those that are quick to resolve. For me, this is analogous to the "ten items or fewer" checkout at the supermarket. If I can resolve a problem in a matter of minutes and then get back to the more important problem that is likely to take me the rest of the day to resolve, I'll do it.
You can devise your own numbering system for calculating priorities if you find this "trick" to be helpful, but don't let it get too complicated. Maybe your "value" ratings should only go from 1 (low) to 5 (critical), your number of people might go from 1 (an individual) to 5 (everybody), and your time required might be 1 (weeks), 2 (days), 3 (hours) or 4 (minutes). But some way to quantify and defend your priotities is always a good idea.
value * # people affected * time req'd = priority (highest # = highest priority)
3 * 2 * 2 = 12 problem #1
5 * 1 * 4 = 20 problem #2
Problem #2 would get to the top of your list in this scenario.

Habit 4: Perform post mortems, but don't get lost in them

Some Unix admins get far too carried away with post mortems. It's a good idea to know why you ran into a problem, but maybe not something that rates too many hours of your time. If a problem you encountered was a very serious, high profile problem, and could happen again, you should probably spend the time to understand exactly what happened. Far less serious problems might not warrant that kind of scrutiny, so you should probably put a limit on how much time you devote to understanding the cause of a problem that was fairly easily resolved and had no serious consequences.
If you do figure out why something broke, not just what happened, it's a good idea to keep some kind of record that you or someone else can find if the same thing happens months or years from now. As much as I'd like to learn from the problems I have run into over the years, I have too many times found myself facing a problem and saying "I've seen this before ..." and yet not remembered the cause or what I had done to resolve the problem. Keeping good notes and putting them in a reliable place can save you hours of time somewhere down the line.
You should also be careful to make sure your fix really works. You might find a smoking gun only to learn that what you thought you fixed still isn't working. Sometimes there's more than one gun. Try to verify that any problem you address is completely resolved before you write it off.
Sometimes you'll need your end user to help with this. Sometimes you can su to that user's account and verify the fix yourself (always my choice).

Habit 5: Document your work

In general, Unix admins don't like to document the things that they do, but some things really warrant the time and effort. I have built some complicated tools and enough of them that, without some good notes, I would have to retrace my steps just to remember how one of these processes works. For example, I have some processes that involve visual basic scripts that run on a windows virtual server and send data files to a Unix server that reformats the files using Perl, preparing them to be ingested into
an Oracle database. If someone else were to take over responsibility for this setup, it might take them a long time to understand all the pieces, where they run, what they're doing, and how they fit together. In fact, I sometimes have to stop and ask myself "wait a minute; how does this one work?" Some of the best documentation that I have prepared for myself outlines the processes and where each piece is run, displays data samples at each stage in the process and includes details of how and when each process runs.

Habit 6: Fix the problem AND explain

Good Unix admins will always be responsive to the people they are supporting, acknowledge the problems that have been reported and let their users know when they're working on them. If you take the time to acknowledge a problem when it's reported, inform the person reporting the problem when you're actually working on the problem, and let the user know when the problem has been fixed, your users are likely to feel a lot less frustrated and will be more appreciative of the time you are spending helping them. If, going further, you take the time to explain what was wrong and why the problem happened, you may allow them to be more self-sufficient in the future and they will probably appreciate the insights that you've provided.

Habit 7: Make time for yourself

As I've said in other postings, you are not your job. Taking care of yourself is an important part of doing a good job. Don't chain yourself to your desk. Walk around now and then, take mental breaks, and keep learning -- especially things that interest you. If you look after your well being, renew your energy, and step away from your work load for brief periods, you're likely to be both happier and more successful in all aspects of your life.

No comments:

Post a Comment