The pithy phrase “Months of boredom punctuated by moments of extreme terror” is thought to have originated early in the 20th century as a way to describe the experience of an infantryman living in battlefield trenches during World War I. This is also a good analogy to describe what it’s like to be a network admin working in the trenches of managing network configurations.

One can almost get lulled into a false sense that the network is impervious to downtime. But when (not if) the network does go down (and it’s usually due to human error), then it’s a frantic scramble to find the problem and fix it fast.

This article is a three-part series that will look at five best practices for improving your network configuration management. You may be thinking, “If network configuration is months of boredom, then why should I read this?” The short answer is, “So you never have to experience those moments of extreme terror!”

Today, we will be talking about why network configuration management is so difficult and what’s at stake. In the second and third installments, we’ll talk about five best practices that will help you improve network uptime and save you time and money in the process.

The High Cost of Network Downtime

First, let’s talk about what’s at stake and why you should care. As alluded to, most network downtime is caused by human error. In fact, studies confirm that 80% of the incidents resulting in network downtime can be traced back to someone making a mistake. Eliminate the mistakes, and your network becomes all the more stable.

But why get uptight about a little network downtime? You could think of it this way: 60 minutes or less could save you $100K or more! But there is more to it than that. You also need to consider the adverse impact that network downtime will have on mission objectives. Unlike the commercial sector, when a defense network fails in the public sector, the result could put people at risk or even cost lives.

Studies examining the impact of network downtime also confirm that each hour of network downtime costs an organization $100K to $300K in direct and indirect costs. These expenses rise from regulatory and service-level agreement (SLA) non-compliance, mission risk, damage to reputation, and lost revenue, data, and productivity. These costs are very real and adversely affect mission and financial results.

Most organizations have SLAs that they strive to meet. If you have an average 99.9% (or “three nines”) network uptime, you experience about nine hours of downtime each year at a conservative cost of nearly $900K.

If you could improve network uptime by 80% (or 99.76% by simply eliminating human error), this would eliminate almost six hours of downtime. And not only yield conservative savings of $600K, but also vastly improve mission success and reduce loss. So, from this discussion it should be apparent that there is strong motivation for improving network uptime.

The Complexity of Network Configuration

Why is network configuration so prone to error? Think of it as the equivalent of a high-tech perfect storm. You have complex networks consisting of hundreds or even thousands of routers and switches. These devices vary widely in capability, age, and vendor. Each device must be expertly configured as a single entity using an instruction set consisting of tens to hundreds of command statements.

For even a small network of 500 devices, assuming a configuration file of 100 lines, this means there are 50,000 lines of configuration instructions that must be tested, deployed, and protected.

But there’s more! Also consider that for Cisco devices alone, there are over 17,000 command statements that an admin must know how to properly use. Finally, consider that these configurations are maintained manually by admins with varying degrees of skill and expertise. This means that any error in any statement on any device spells inevitable trouble, including likely non-compliance with mandatory DISA STIG or NIST/FISMA standards.

This creates tremendous incentive to institute rigorous management controls to keep device configurations pristine. When you have configurations that are working properly, you don’t want them changing unless there is justification, and the changes are made only by qualified persons. When considering what management processes to implement, you can often find good inspiration by reviewing a proven IT governance model like ITIL.

All of this sets the stage for the five best practices in the second and third parts of this article. In preparation for a detailed discussion on these practices, let’s briefly review them:

  1. Inventory and profile network systems
  2. Develop and deploy standardized device configurations
  3. Protect configurations from changes
  4. Audit configurations for compliance
  5. Use change controls to manage updates

Summary

We’ve talked about how the leading cause of network downtime is simple human error, and that these preventable errors result in nearly 80% of network outages at an annual cost of $1 million, as well as, failed mission objectives and higher DISA STIG and NIST/FISMA compliance costs. We’ve also explored some compelling reasons why network configuration management is so difficult. Any mistake in any one of the 17,000-plus instructions issued to a device in your network can produce serious problems.

The solution is to implement effective management controls to help protect device configurations. In our next discussion, we will take a closer look at five best practices that can help eliminate human error, which leads to improved network reliability. See you next time!