In my previous post, I asked the question “Can you get 4,000+ engineers to work together?” The answer is, in the traditional “together sense”, an obvious no! However, just like ants are building an ant hill, not by talking to each other, but by each ant following a set of rules or behaviours specific to their role and by using simple signals, you can get way more than 4,000 engineers to work together by having architectural principles and rules that everybody follows. In the classic waterfall engineering organisation, such rules are codified in documents: technical specifications, solution requirement documents, architectural design documents and so on. Once the overall design is agreed, implementation can be done. In an agile product organisation, there are many people longing back to the times when such documents were used. Truth is that architecture by specs never really worked that well unless the problem to solve was really well understood. Preferably you had a problem that had been solved before, so each iteration would improve the architecture and process. If the process of implementing the specification itself gave significant new understanding to the problem and how to solve it, then the specification had to be changed and big changes gave big impacts to timelines, scopes, and how to implement.
You never get things right the first time, so improvement through iterations is key. If every problem is solved as if it is new, then there is little reuse of understanding from earlier problems you have solved. Thus, the key to improvement and great products and services is to identify patterns or operations that are repeatable, make these explicit, and then improve them through iterations. In a waterfall organisation, the process itself and the specifications become something that is repeatable (when for example you build a car), but in an agile software organisation with a high degree of change, with no elaborate specifications and with a minimum of process, other sets of repeatable patterns or operations must be found.
However, identifying what is repeatable is not enough. You must isolate what is repeatable as well. If not, you are not able to improve as everything must change in sync in every iteration. You isolate what is repeatable through abstractions. Abstractions can be specified through layering, APIs, protocols, modules, services, and so on. All should have clean and improvable interfaces and implementations. These abstractions are absolutely key to the ability to iterate and improve. The challenge in an agile organisation is that you typically build out the architecture as you iterate, i.e. you don’t start out with a clean, abstracted architectural design. So, as you evolve, you need a lot of refactoring and tight management of the overall system architecture. If you don’t invest enough in managing your architecture, you end up with a tangled mess instead. In a large organisation doing agile, you would be wise to organise your internal software development more similar to how the Internet has been developed: through independent teams developing based on a set of standards codified by organisations like W3C and IETF.
Going back to the product-level: technical debt, and design debt within individual products and services are increasingly better understood and managed. In agile, debt is systematically reduced as part of the development process. However, it is still difficult to get product management and management to understand this aspect of software development. Particularly true for the product marketing community (as opposed to product managers) and MBA-managers as they have limited engineering background. Engineers know that not all your resources can be spent on new features, but I often see them hide re-factoring and technical debt reduction in conservative project estimates instead of showing management the real costs of evolving and maintaining a code base for short-term and long-term health.
My problem, working on the architecture level across a 4,000+ engineering organisation, is that we also add debt outside the products and services we build: in how we make them work together. This is what I call distributed system architectural debt, or just architectural debt for short. Understanding, planning, and executing on managing this kind of debt is extremely difficult, and even more so when there is limited understanding of technical and design debt within each product organisation. It tends to add up…
We (Collaboration Technology Group in Cisco) have 4,000+ engineers who use a mix of waterfall and various flavours of agile (and we do both hardware, software, and services), we release our products with various cadence and thus often not in sync, and in order to make sure we move forward, each team needs to be able to plan, execute, and control their own projects. How the products interact evolves over time, we add new features and functionality all the time, and there is no upfront understanding of a “right way” products should interact and work with each other. This is necessary to foster innovation and get the speed to market of new features. However, our customers expect everything to work together in all the different ways they can work together (and sometimes other ways as well), an impossible task as we cannot test every permutation.
So, in order to get feature velocity, we need to allow the same technical problem to be solved in slightly different ways in different parts of the organisation. Absolute coordination requirements would absolutely kill any progress. However, once we have three, four, or maybe even five different ways of doing the same thing, the number of combinations of what should work is growing, and where we should have one test that would test a generic functionality (common to all the five different ways), we have let’s say 5 x 3 = 15 different tests to run if each of the five approaches to the same thing interacts with three different other things. So, if we don’t want to slow down feature velocity by requiring upfront coordination, we need a way of finding the “best approach wins” and a migration towards a single, shared method. This is just in theory, because each of these teams tried to solved a slightly different problem, had limited time to do proper generalisation and design, and ended up with a design solution that does not really fit the other teams’ needs. Voila, we are stuck in “no best approach exists”, and five different development trajectories with product management asking for new features on top of a design that was shaky in the beginning. I call this feature-driven architecture. The consequence is of course that the feature velocity we wanted in the first place is now slowed down, not by coordination, but by testing and because it gets increasingly difficult to make everything tick.
A typical large organisation response to the problem of multiple designs and multiple products with overlap is to synchronise and agree to a shared approach through a warped, time-consuming, regression-to-the-mean consensus process. In the absence of clear priorities, everything is equally important and everybody has an equal position at the table. The relative priority between something important to an existing revenue stream vs another important thing to a new product or service is typically not defined. There is also a lack of technical people who understand more than their own problem area, and there is a lack of strong, technical people with an authority in the organisation to actually make a technical decision that has important business impacts. And as architectural decisions often have clear business impacts, product management needs to be consulted as well. There is a lack of strong, technical product managers, so they focus on features for their product or service and have little to contribute on the fundamental architectural choices that have big business impacts later on, potentially across all the products and services.
Thus, 10, 20, 30 or even more people are brought together in large meetings to make sure all stakeholders are involved. Typically, engineers/architects work in groups, product managers work in groups, and some of them come together in either face-to-face workshops or large sync meetings. Eventually (if important enough), they prepare a consensus-driven read-out to directors and VPs. Since consensus was the goal, no real options have been identified and no pro et contra evaluations have been done, and the result is a mixed bag of something that nobody is really satisfied with. The directors and VPs grow impatient with the lack of progress and finally make a decision far outside their competence area and with limited insights into the impacts of their decision. This will haunt them later, and they will change the decision later down the road.
Am I exaggerating? Well, a bit, but truth is that any clean, easy to execute decision will have somebody on the loosing side, somebody who didn’t get their product or their design chosen as the path forward. Making that decision requires leadership and guts, but it is often too easy to avoid those tough calls. Knowing when consensus is called for and when a trade-off must be made, is one of the most important technical leadership abilities in a large organisation. However, the traditional approach with “one, strong technical chief” who will make the calls does not work as the number of evaluations to be done are too many for a single person to address. The only way to scale is to set clear business objectives and corresponding architectural plans to allow experts locally in each team to make the right decisions.
But, if you don’t get clear plans and trade-offs, what are you left with? Well, we will have to design for the consensus we achieved, both on the business objectives and architecture. This is always too big to be addressed in the next development cycle. Time is short, so let’s focus on what we need out the door the next 6-12 months! Okay! Back to feature-driven architecture!
You may say, ask product management, they should know where you are going! Well, as mentioned above, a typical product manager (not the one we wish we had) will go: “in phase 1, we need features a,b,c, and d, but we need to design for phase 2, where we will need …” and then they add everything they think they might need. Unless there is very clear guidance from upper management on what NOT to do, product management will hedge their asks and ask for pretty much everything they can think of. The result is that the architects have to prioritise which architectural problems to solve with very limited guidance. But we cannot do everything, so again, how do we prioritise? The answer is very often that we choose the stuff that we need to get done to get features in phase 1 out the door and then we do a little bit on some of the stuff we think may be important in phase 2.
Allowing features to drive design, not the generic architectural problem to be solved, introduces “distributed system architectural debt”. The debt that most directly comes as a result of feature-driven architecture, is “abstraction debt”. Getting the right level of abstraction is very difficult when you design APIs. Too little abstraction and your API only allows “set PIN code for this type of conference for this type of user when it is scheduled”. Too much abstraction and your API is a limited and meaningless interface “set variable Y with value X”. Both are responses to feature-driven architecture as the wider context of the problem to solve is not known, only the feature. The problem of setting the right level of abstraction is that you really need to understand the generic problem you are trying to solve. When you are working agile and you are feature-driven, you may not know how your API will be used later. Using a particular methodology, like REST, forces a certain design (e.g. identification of resources and exposure of data for each). This enforcement is of help because you have implicitly done the systematic groundwork for a good API. However, REST (or other approaches) is not the best choice for all APIs (in particular for hardware products), and you rarely have the luxury of ignoring all the existing APIs already implemented in your portfolio (unless you are a start-up or you are allowed to start from scratch).
In my next post, I will deep dive more into the consequences of feature-driven architecture and cover the three other types of distributed architectural debt I have identified: protocol debt, layering debt, and flow debt.