Architects_scaleI have written in recent weeks that most networking purchases are dominated by what I call the Year One Problem. That is to say that networks need to continuously add capacity at ever-increasing rates while budgets remain flat to slightly up (if you are lucky). This creates a dynamic where the only criterion that matters in any individual purchasing decision is the price-per-usable-capacity. Not unsurprisingly, we have seen a wave of white box fever rush over our industry.

But sitting just over the horizon is the Year Three Problem. The fact that operating expense dominates total cost over the life of a device is not lost on anyone. Those CIOs, VPs, and Lead Architects who want to maintain their job security for years to come have to solve the Year One Problem while simultaneously extracting operating expense out of the network. This brings workflow automation into play, which is why SDN has been all the rage for the past 18 months or longer (depending on when you got engaged with it).

Imagine successfully navigating both the Year One and Year Three gauntlets only to find out that it was all for naught. Enter the Year Five Problem.

While some people would only admit it sheepishly, the reality is that for most companies, network architectures are fairly static (and oftentimes more static than we would like to think). Once a direction is chosen, that architecture will exist for many years, serving as the central point for incremental growth. New capacity comes in the form of new devices that look remarkably like the old devices. Over time, older generations might be replaced with newer generations, but for the most part, the network continues as it always has.

This actually makes a lot of sense. Evaluating new architectures is a painful process at best. Having to go through that selection process every 2 or 3 years would be a very expensive way to run a network. Combine that with the politics that come with new architecture decisions and you would end up with an exceedingly difficult operating environment.

But the risk that companies run is that their architecture becomes part of their operating inertia. The reason that things get done is because that’s the way they have always been done. And the difficulty here is that if those initial decisions did not consider the eventual scale at which the entire thing must operate, the data center could be in for some rough sailing.

So where does this show up?

Most decisions in the data center start with capacity. When you select an architecture, you expect that it will service your current capacity requirements adjusted for some future vision of growth. But as you grow, what happens to the devices that you deployed initially? Will they continue to perform? Or must they be replaced?

Juniper has actually done a masterful job of this on the provider side. With their T-series routers, they have provided a base chassis designed to scale as their customers have grown. It’s hard to imagine, but they released a product in 2007 (meaning work began in 2004/5) that has been field upgradable through today. The line card design on the T640 (now the T4k) was such that customers could upgrade without having to re-architect the network. That kind of forward thinking has served T-series customers well.

How could this idea of architectural longevity be applied in the data center? It’s probably not a line card play (ToRs are cheap enough that you replace the entire thing). But how does the data center interconnect scale?

Some architectures handle this through aggregating more and more traffic through higher-powered core switches. The challenge here is that if capacity is forever increasing, the eventual solution simply cannot be aggregation. At some point, there will not be enough ports, power, or bandwidth to support such a model. And even if the physical limitations don’t come into play, the fact that building out requires more and more building up means that for every unit of usable capacity, you have to add 3 times that amount in new equipment. Forget the CapEx; the operational load of tiered architectures is exactly why fabric-based solutions have grown in popularity. Making an inherently tiered architecture cheaper to buy doesn’t change the long-term economics.

What is happening in our industry is that we are caught in the grasp of this Year One Problem. It is so profound and difficult to stay on top of that it has clouded some of our judgment. How can you even imagine architecting your network in a meaningfully better way when you can barely afford to keep the lights on as it is? And given this dynamic, how does anyone plan for the future?

We have been collectively accumulating data center debt through relatively poor architectural hygiene. I should add that this is by no fault of our own. Technology evolves slower than we would like, day-to-day cannot be ignored, and change is never easy. This is not the product of a series of bad decisions but rather the amalgamation of patchwork fixes required to keep the network running.

The challenge though is that there are no easy fixes for poor hygiene. If you put on 50 extra pounds, you don’t shed the weight by hopping on the treadmill for 17 hours. You have to dutifully diet and exercise everyday until the pounds come off. Similarly, when you are in the throes of architectural pain, you don’t rely on a massive rebuild (or you don’t have to). You can dutifully work it off.

And it all starts in Year One.

By making forward-looking decisions about new gear that will allow you to drive OpEx down in a meaningful way, you can free up overall expense to give you some architectural flexibility. And this will only work if those devices that you buy in Year One are thoughtfully designed to accommodate your Year Five scale. This means that you ought to be adding a whole line of questions to your selection process.

  • Does the hardware scale? If so, how specifically do you free up new capacity?
  • What are the operational implications of expanding the architecture to very large topologies?
  • Is scaling achieved purely by adding new capacity (i.e., building out and up)? Or can you get more out of your existing capacity (i.e., utilization)?
  • How large can the architecture grow before the number of interconnect ports is excessively high? And what is that threshold for you?
  • Does the hardware decision change as manufacturing advances change the economics (single-mode vs. multi-mode fiber, for example)?
  • What happens when the number of applications doubles?
  • Can the architecture support geographical separation? Or is that a brand new data center?

In Year One, the answers might not be clear, but the questions have to be asked. Diligent scrutiny is really the only way to check that architectural inertia is not the sole driving force behind your data center strategy.