This post is the second post on distributed system architectural debt. If you haven’t read the first post, I recommend that you read about why you acquire architectural debt and how feature-driven architecture development is a main contributor. I also there cover “abstraction debt”, the first of four types of architectural debt I cover in this series.
The second type of distributed system architectural debt is “protocol debt”. Without going into details on definitions, APIs expose an interface into a system, to operate on stored data, retrieve data, or trigger actions, i.e. an external system can do stuff remotely. However, a protocol defines a two-way flow of messaging between two or more entities in a system. When you design a protocol, you typically want to establish a messaging flow that can be reused, and while the content in the protocol can change, you want the protocol itself to be fairly generic and stable. Good protocol design takes time and experience, and often the value of the protocol is a factor of the number of systems using the protocol. We like to do protocol design in a standardisation body as we are able to validate the problem as worth solving with a protocol, we get a lot of experienced protocol designers to work on the problem, and the result is more likely to be a protocol that is widely used (if it becomes a standard). I have written earlier about standardisation and the challenges of working through standards bodies, and sometimes we must develop our own internal protocols, either in the hope of getting it standardised later or as a replacement for what could have been a standard.
In fact, the way we develop protocols internally is very similar to how it is done in standardisation bodies like IETF, through involvement of many engineers, iterations, and consensus (except that we also do joint development and integration testing early on). Thus, systematic use of protocols as a tool for reducing architectural debt has high costs and takes time. Sometimes, the problem must be solved through less labor-intensive approaches, either through a protocol defined by just a few of the engineers who should have been involved or by using a simpler API, or maybe more often, just reuse a protocol we have by overlaying meaning, for example by reusing a field in a protocol to mean something new. The result is ambiguity in meaning, more corner cases that can go wrong, and more states or scenarios that are undefined and open to interpretation. Also, the protocol gets more difficult to understand and implement. This debt is very difficult to get rid of, especially abusing an existing protocol, as it is rarely reversible. The problem also grows exponentially as the whole point of a protocol is to have it codified in a document with so much clarity that an engineer can implement based on the document and then do interoperability testing with other implementations to weed out bugs. Any document will have ambiguity, but when adding protocol debt, and thus more ambiguity, a protocol implementation will increasingly need the involvement from an engineer who has already done an implementation to get it right. Even then, new corner cases will be discovered and more testing is necessary for each new feature or implementation. The ability to improve the protocol through iterations is reduced. Using protocols defined in a standardisation body makes it harder to justify overlaying new meaning as we typically want the new meaning to be an acceptable addendum to the standard. With internally defined protocols, we have no such barriers.
Another type of protocol debt is insufficient protocol implementations. If you implement a SIP stack, there are a large number of RFCs to implement and then there are a number of things you need in a working SIP system that is not clearly specified in a standard. If you implement only the MUSTs, and you only test with only your own implementation, you get something that is compliant with the RFCs, but that doesn’t mean that you have implemented something that is interoperable. That may be fine if you are not really looking for interoperability outside your company and you have just one team working on a SIP stack. If you have multiple teams maintaining SIP stacks, then you have exactly the same problem as if you had a multi-vendor environment: you need interoperability. There are two approaches to get this: either you do end-to-end testing for all the use cases you need to support, or you basically do the same type of iterative interoperability testing done in the industry through events like SIPit. We do both, but while my preference is clearly on the latter, there are enough people who feel that generic interoperability testing is not necessary.
The third type of architectural debt, “layering debt”, is a multiplication factor on abstraction and protocol debt. Layering debt occurs when the same basic problem is solved by different people, but with different requirements or focus. Any good protocol design is based on layering, and layering is also useful in any large-scale system with multiple software components to split responsibilities for solving a certain task. However, there is no such thing as a “correct layering”, depending on what you are trying to achieve, a different type of layering may be the right choice. In video conferencing, scheduling a video meeting is a problem where layering is necessary. You have a calendar on your smartphone, you may have outlook with an exchange backend, you have a web site for scheduling meetings, you want to get a virtual conference room to meet in, you want to send out an invite with meeting details, and so on. The scheduling functionality can be solved through many different types of layering (or no layering at all), where each system in the architecture takes responsibility for handling one or more of the required functionality. When two different layering models are used to solve the problem (in our case we have e.g. webex-type scheduling and video conferencing-type scheduling), it is very hard to bring them together as the layers are not compatible. Typically, a problem is solved in one layer in one implementation and in a different layer in the other implementation. Or the same thing is solved in slightly different ways on the same layer. When we bring the two implementations together without a clean-up in layering, we get cross-layer, cross-implementation dependencies. Or what can be referred to as “duct-tape and piece of string architecture”.
Finally, the fourth type of architectural debt is “flow debt”. Layering debt will typically lead to flow debt as layering cross-dependencies are difficult to solve without introducing multi-step API calls and multi-hop protocol flows. When a certain feature or user experience relies on such multi-step flows, each step has a number of error situations that may occur and the probability of failure or undesired side-effects increases with each step added in the process. Flow debt is the mother of tangled, difficult-to-debug bugs that typically are discovered late in the release cycle and directly impacts timelines and release dates.
Being aware of these distributed architecture debts will allow you to manage and evolve your architecture by pro-actively establishing future architecture targets. These targets must have sound layering, the right abstraction level, good protocol craftsmanship, and carefully designed flows. However, there is a cost associated as you actually have to allocate engineering time to re-factor things that seemingly work fine (at the time) and each of the flavors need to evolve towards the target. Getting buy-in across the organisation to invest in distributed architectural debt reduction can be hard. You typically see an increased focus on “let’s just get things done” when distributed architectural debt grows. This organisational gearing effect is the topic of my next post in this series.