I still have one or two posts I want to do on a multistream-enabled, scalable video architecture, but I have for some time also wanted to write a series on the technical scale challenges in a large, distributed engineering organisation. The fundamental problem is that the classic command and control approach with centralised planning, coordination, and program management breaks down when you get to a certain size, in terms of key stakeholders, number of projects and engineers, time zones, and technical complexity. The decisions must be taken locally in each project, as that’s where the experts are. However, if they do, they base their decisions on limited understanding of what is going on elsewhere and they optimise for what they know: their own product or sub-system of the bigger deliverable. You need to replace the classic top-down, end to end managed mega-projects with something that can scale, but without loosing oversight. You need processes for aligning what to deliver (product management), how to deliver (engineering resources and expertise), how to enable delivery (design and architecture), and how to get done (project management). My role is on a daily basis on design and architecture, but the better the other three also works, the easier my job gets.
We have 4,000+ engineers distributed across the world, in four business units(BU) with individual P&Ls (profit and loss, i.e. full business result responsibility), each BU is headed by a VP/GM, with more than 150 directors and Vice Presidents reporting to them. We have a multi-billion dollar business, and a triple digit number of products, counting legacy and new. Meet Collaboration Technology Group (CTG), Cisco Systems, the engineering unit making voice, video, messaging, presence, conferencing, contact center, and a number of application, services, and products to support companies’ need for communication, coordination, and collaboration. And by the way, these are delivered and sold for on-premise installations, as well as cloud services, and for service providers to offer services to their customers. And, as Cisco has limited sales directly to customers, we have a four-digit number of channel partners that need to understand our products and services and be able to make money when they sell them. And finally, not to mention the five digit-sized Cisco sales force who need to understand and be able to sell our products. All these serve hundreds of thousands of customers. Conclusion: if you are a Cisco CTG engineer, you participate in one of the most advanced and complex product eco-systems that exist. How do you make products that excel? And, my personal, daily challenge: how do you evolve and manage the underlying architecture that is needed for these products to work together and deliver the expected user experience and customer value?
Getting all these products to work together when they all are evolving at the same time and at a high innovation pace, is staggeringly complex! First, take the problems for just one single product: you want to deliver short-term customer value, you want to make sure your code is sound and ready for adding future value and features, and you invest by adding code today that is the foundation for customer value planned in future releases. Experience tells you that you need a software architecture with modules that you evolve according to a plan. If not, you end up architecting and designing feature by feature. In that case, your code gets messy, you keep adding new features through hacks and than hacks on hacks. Your technical debt increases and your velocity goes down.
This is if you only have one product, but what if your product also interacts with a whole slew of other products in many different configurations and solutions? And what if all these configurations and solutions have slightly different requirements for how the same functionality is supposed to work in each? If you are a good architect, you try to generalise and plan your software architecture to cater for variance, and you try to use the same APIs, protocols, and interfaces in all the configurations and solutions. But in order to do this, you need insight into where you are going, so you can plan and see where there are similarities and where you need dedicated code for each context you operate in. But what if you have 100+ products and they all do agile in various forms, so you have little transparency into where they are going, and the innovation pace is high? And add to that a group of executives and managers who change course often because the industry is in rapid change? Even in an organisation with perfect strategic planning capabilities, the predictability is low, so you end up designing, planning, and then re-designing and re-planning repeatedly.
Well, you get many results, but one of the most tangled and complex engineering challenges is what I call distributed system architectural debt. This is a kind of debt that one team cannot fully understand or factor in. Each permutation and interaction between products need to be tested separately to make sure it works, typically designed, implemented, and tested by separate groups with limited interaction with other groups developing other stuff. The test matrices balloon and introducing a new feature creates ripple effects, not only for the test scenario where the feature is used, but across the board because the code that was touched implements an interface to a number of other products that thus will be impacted. When designing a new feature that needs to work in a certain way across several products, the number of scenarios to evaluate becomes too many for a single architect to understand and plan for, even if transparency was high. The result is that more and more critical issues show up later in the development process, after implementation has been done. The number of critical solution bugs or design flaws that show up after code complete increases.
In subsequent posts in this series, I will address this fundamental problem in architecting large inter-dependent systems from several different angles:
- What is distributed system architectural debt and how can it be broken down into pieces that can be managed?
- What are the priorities and forces guiding architectural evolution in a large engineering organisation and how do they interact, often at conflict with each other?
- Why are sound abstractions and classic architectural layering an obvious, though highly difficult, approach to reducing architectural debt and building a sound architecture?
- What are the organisational and people challenges when dealing with architectural debt?
- How to move from managing architectural debt to using architecture as a vehicle for innovation?
If you have perspectives and suggestions for what to address in this series, I’d love to hear from you!