What can Systems Administrators learn from Fighter Pilots? This sounds like a ridiculous question at face value. Flying a fighter plane is completely different than running a network of servers, right? I propose it's not as different as you might think.
There is a military strategy developed by USAF Colonel John Boyd called "OODA loop." According to Col. Boyd, decision-making occurs in a recurring cycle of Observe-Orient-Decide-Act.
To explain the components of the OODA, these are the four steps in more detail.
- Observe: First we must observe something. That observation can be "My flight computer just alerted me that a missile is locked on" or my cell phone is ringing, and caller ID says it's a customer.
- Orient: Next we need to then orient ourselves. In the case of the missile lock, we need to acknowledge "That alarm means someone is going to shoot me down" or in the case of the phone call "If my customer is calling I need to answer the phone and find out what they need."
- Decide: We then need to decide what to do for example take evasive action, or reboot the server.
- Act: Finally we need to actually take that action: bank hard right, or reboot the server. At this point the loop resets. Boyd says the faster you can go through this loop the easier it will be to get from reaction to the action side of the bubble.
There are two sides of a dogfight: action and reaction. The pilot who is in reaction mode, will typically lose the dogfight. It is necessary for a pilot to force his opponent into reacting to him in order to win.
To apply this to systems administration requires only a small leap of faith. A Systems Administrator is usually managing a complex group of networking infrastructure, servers, and services. If that Administrator is always reacting to problems they will always be evading missiles.
By proactively fixing the future problem, the Administrator has now eliminated something that later... he would have had to react to.
Like the fighter pilot, it is necessary for the Systems Administrator to perform an action, or series of actions, to push him to the action side of the bubble. Here are a couple of examples of how a Systems Administrator can be proactive.
First, while working on a project, a server gets updated to a newer version of the operating system. During this upgrade the Systems Administrator notices that there is a flaw in the way a service was configured that, while working on the older version of the OS, no longer works after the upgrade. The Administrator reacts by fixing the configuration on the upgraded server. But being proactive, the Administrator also goes through and reconfigures the other servers that use that service that have not been upgraded to use the new, improved configuration. By proactively fixing the future problem, the Administrator has now eliminated something that later, when upgrading the OS on those other servers, he would have had to react to.
Next, part of a Systems Administrator's job is typically to take trouble tickets from users. The worst of these is when a customer/client/co-worker calls in saying a server is down. After reacting and bringing the server back online the Systems Administrator looks for evidence on why the server failed. For this example, let's say it is because the system ran out of swap and crashed when it no longer had available ram or swap space to write objects. A proactive Systems Administrator will build a monitoring system that pays attention to the system metrics. This system has alerting rules that can be configured by the Administrator that triggers a warning when 70% of swap is used, an error when 80% of swap is used, and a critical warning when 90% of swap is used. With this new monitoring system in place the Administrator can now proactively restart services that are consuming memory before the server fails. This prevents a service outage and prevents the users from experiencing downtime.
Finally, as Systems Administrators our skill set should be constantly evolving. It's not uncommon to learn new techniques or methods while on the job. When a new method or technique is learned that is superior to the previous method it is a good idea to apply these new techniques to previous installations. While there weren't technically any failures, it is usually worth the time and effort to perform the changes. Another benefit of back porting these changes to the other installations is it reduces the number of configurations you need to understand, dramatically simplifying the troubleshooting process if something bad does happen.
In short, Colonel Boyd's strategy is to work through the Observe,-Orient,-Decide-Act loop quickly and efficiently enough to get onto the action side of the action/reaction bubble by more quickly processing available information and performing actions to force a shift in the opponent. As applied to Systems Administrators it is necessary to proactively to discover potential changes we can make to prevent problems from ever occurring. It is also necessary that we as Systems Administrators use tools to monitor our networks so that we know about problems before our end-users as often as possible.
Have you found being proactive to be beneficial overall? What techniques have you used to be proactive in your systems administration?