The data explosion in IT systems like social networking and media sharing are well known. However, a similar data scale problem impacts distributed operational systems from industrial automation to Navy ship control. These demanding environments require very high-performance infrastructure. Those specialized designs that work at smaller scales cannot handle the data deluge. Technologies are evolving to rise to this challenge…but their use is not always straightforward.
The real problem is that most of these mistakes don’t show up until the infrastructure is stressed. By the time you find out if your infrastructure is good, it is often too late to change, but all is not lost. There are ten commonly made mistakes that can be avoided when designing a distributed system.
1. Glossing over your data model
The most basic mistake designers can make is to not understand their data model. Some don’t realize they have a data model. By data model, we mean the set of data types or schema in the system, coupled with an understanding of the sharing requirements of that data. Every distributed system design should start with a clear analysis of this information exchange.
Without a data model design, all the system’s applications will develop their own models. Sending information between entities becomes an exercise in translating types and communication patterns at every step. In an IT system with the luxury of an enterprise service bus (ESB), that may be acceptable. But, it does not work in systems that need top performance.
2. Misunderstanding your data flow
Every distributed system works by flowing information between components. Some designers think about interfaces and not about volumes. A system built without flow clarity will suffer choke points, waste resources, and sacrifice reliability.
The greatest risk is to choose an architecture that doesn’t fit the flow. That leads to server overload, scalability problems, and startup sequencing issues. The most painful realization is to discover your fundamental architecture will not work in the final system.
3. Uncontrolled state
Unstructured data flow or models can cause problems including making every application store and maintain its own state. The unstructured distributed application has a lot of state, and usually none of it is understood.
All this unmanaged state quickly leads to inconsistency. Inconsistent systems are brittle; they break in many ways since there is no robust way to agree on the truth. This problem is not unique to distributed systems. For instance, storage faces similar challenges. If you construct your storage system from a variety of special files, static variables, and other caches, it quickly becomes unstable. The database arose as an infrastructure technology that provides one key benefit: an accessible source of consistent truth.
Databases offer a clean conceptual interface to data: create, read, update, and delete (sometimes called CRUD). By enforcing structure and simple rules that control the data model, databases ensure consistency and greatly ease system integration. This “data first” approach makes a database a “data centric” technology.
Data-centric middleware offers a similar benefit for distributed systems. Like a database, data-centric middleware imposes known structure on the transmitted data. The infrastructure, and all of its associated tools and services, can now access the data through CRUD-like operations. Clear rules govern access to the data, how data in the system changes, and when participants get updates. As in databases, this accessible source of consistent truth greatly eases system integration. Thus, the data-centric distributed DataBus does for data in motion what the database does for data at rest.
4. Ignoring evolution
Some designers consider the system implementation to be static. In reality, distributed systems are anything but static. They evolve over time in many dimensions:
- Your data model will not stay fixed;
- Versions of your software change;
- Applications will join and leave the system;
- Existing software systems and independently developed modules will be integrated.
The first rule for easing evolution is to avoid unnecessary coupling. For instance, the model view controller (MVC) design prevents linking unrelated concepts in graphical display systems. The real power comes from knowing which data to access for what purpose. Then, the application puts only the related data into the same structure. Separating concerns enables easier evolution.
Distributed systems face an even greater challenge. In a typical distributed system, the crush of integrating many diverse applications will eventually overwhelm all such strategies. A strategy to deal with this is needed and the middleware can help considerably. For instance, the Object Management Group’s (OMG) Data Distribution Service (DDS) standard has an evolving specification for extensible types (XTypes). XTypes allow types to evolve over time while supporting automatic interoperability. For instance, adding or changing a few fields does not preclude information transfer.
XTypes provide a critical first functionality for system evolution, but can only go so far. They can help match changing types, however, they cannot bridge major type changes, protect fields from export, or match differing communication patterns. For instance, consider an application that is designed to call a “get info” method. This is a “request reply” communication pattern; the caller expects the recipient to reply with the desired information. What happens if the recipient server changes to a “publish subscribe” model so it can efficiently support many more requesters? Bridging that gap is critical.
This type of integration problem is well solved in enterprise IT with an Enterprise Service Bus (ESB) technology. Similar service bus technologies are now arising in the real-time embedded space and will permit designers to integrate and maintain complex, evolving systems.
5. Assuming all data communications should be reliable
80 percent of networking code is for handling error conditions. A complex, distributed system has many ways to fail. You must understand the likely failure modes, the consequences of those failures, and how you will react to failures. With that in hand, you can begin to understand what you mean by “reliable.”
Most enterprise networking works on top of Transmission Control Protocol (TCP), which imposes a strict definition of reliability: every byte will eventually be delivered. This is far too restrictive for a system that must also respond reliably in time. There’s an “uncertainty principle” of networking that trades off delivery reliability for timing. The simplest example: if you must retry every dropped packet indefinitely, you can make no guarantee about timing.
Thus, real consideration of distributed reliability requires considering factors like retries, persistence, and delivery timing guarantees. The Army’s “Blue Force Tracker” application provides a real example. Its goal is to track hundreds of thousands of vehicles and other assets so friendly forces can identify them. The legacy system used a transactional messaging technology that ensured every update was delivered and acknowledged by every recipient. That design required eleven servers with 88 total cores to track 12,000 assets. A data-centric publish subscribe design that leveraged multicast and a “NACK” reliability protocol handles 250,000 tracks on a single core. The main difference is the new system matched the reliability requirements to system needs, rather than assuming reliable delivery implied reliable operation. The Blue-Force Tracker system gained 25x performance by matching the communications and reliability paradigm.
6. Saving system integration strategy for the end
The schedule will never again look as good as it does today. For most systems, it’s because system integration is left at the end of the program, and (worse) approached with no strategy. The only way to prepare for system integration is to consider system implications at every stage. The only way to reduce system integration effort is to design the system from the first day to adapt to other technologies.
This can be difficult because systems, especially large systems, are complex. The key to reducing complexity is to reduce coupling. Coupled systems have many interactions. Those interactions are fundamentally the source of complexity. There are many approaches that can help.
First, you can control the behavior of application on the network. The OMG DDS standard is the only distributed specification that models and rigorously enforces behaviors. It does that with a unique Quality of Service (QoS) request-offered protocol. Each module offers and requests interaction parameters such as data rates, reliability, and liveliness. The DDS middleware matches offers with requests. When a match is found, the middleware then enforces the resulting contract.
This level of known behavior greatly eases eventual integration. The process of thinking through all these requirements requires time, but dealing with these issues early greatly reduces the overall system development and integration time compared to discovering all these interactions at the end.
7. Thinking performance isn’t important
According to apocryphal legend, when Michelangelo was asked how he managed to carve David out of a block of stone, he replied, “I just cut away everything that doesn’t look like David.” This is a good metaphor for performance. To build a fast system, just avoid sources of delay.
Of course, that’s easier said than done. In my experience, the main cause of a performance problem is to leave performance as something to be done “later.” Tuning a slow architecture rarely helps. Companies should strive for performance from day one. A system with fast infrastructure is fundamentally more adaptable than one without.
8. Failing to anticipate the business drivers
No system lives in isolation. In today’s world, systems are vastly more connectable. Many designers realize this, but don’t appreciate the huge pressure to mine that connectivity for profit and optimization in the near future. Gartner predicts ineffective management of operational technology with corporate IT infrastructure will risk serious failures in over 50 percent of asset-dependent enterprises through 2015.
The business drivers of your application in the future will go beyond making your real-time system perform to its current specifications. Tomorrow’s IT connections will use information tapped from your system to do things like predictive maintenance or energy management. These optimizations will drive the success of your application.
Today’s designer can prepare for this trend by looking into integration technologies that support the common IT standards, such as SOAP, REST, JMS and databases. These evolving interfaces will enable clean IT integration.
9. Assuming all implementations of a standard are equivalent
All cars comply with street-legal laws. Still, a Porsche, a MAC truck and a Prius are hardly equivalent substitutes in most applications. Many system designers make a similar mistake. Implementations of standards differ greatly. The same JMS or DDS API can overlie vast different performance, capability or quality code. Important visible differences include success in similar applications, performance characteristics, tools, configurability, vendor training and services available, support, and extensions.
10. Going it alone
All designers have experience with distributed technology. In fact, it’s safe to assume that 100 percent of architects building distributed systems are not building their first distributed system. This can work for you, especially if it gives you an appreciation for the real challenges of making a system work.
The fact is, hundreds of applications have been written. The sum experience of all their successes and challenges is embodied in the emerging COTS technologies. Services engineers with decades of experience on dozens of systems can guide you to success. Do-it-yourself solutions are still important, but there are many unique challenges in distributed systems and the variety and capability of commercial solutions has exploded. This technology evolution certainly merits serious consideration.
These common pitfalls are certainly not the sole source of error, however, considering these hard-won lessons makes good business and technical sense. Every system designer should examine and understand his or her data model and data flow. Early considerations must include understanding the distributed state and performance, proactively anticipating business drivers, especially those that impact system integration, and staying on top of new technologies can greatly ease the work.
Fundamentally, most distributed systems require an architectural approach, rather than a more haphazard evolving design. The initial investment pays dividends as the system evolves. In many cases, this “ounce of prevention” approach is the difference between success and painful redesign.