In my previous post, I started a series on the next-generation multi stream video architecture. In this post I will explain the way elasticity has completely changed how you should view audio and video flows in your network. But first, what does a real-time video communication solution consist of? Well, first of all you have the endpoints, or clients, in front of each of the participants. They used to be pretty dumb, but in the next-generation multi stream video architecture, they need to at least be well-behaved participants, and they should be fairly smart if you want really great meeting experiences as the next generation architecture will allow more tailoring and dynamic control of the experience. Second, if you want to do meetings with more than two participants, you need a multiparty bridge, aka an MCU. Third, you need a call control infrastructure that allows you to connect everybody, along with people from home offices and people outside your company, partners, customers etc. Fourth, you need a network that will support an excellent experience (without sacrificing everything else in your network), not only for a single meeting, but for all the meetings that everybody would like to do (and take into consideration that if you do it right, people will do more meetings on video and you need to grow the capacity). And finally, you need to manage all this, and make sure everything is configured correctly and that you can detect and solve problems.
To explain elasticity, let’s first focus on the endpoints/clients and the multiparty bridge. If you make a regular phone call with audio only, there are only a few things you can do to change quality and how it is transported in the network (assuming your microphone/audio source quality is constant). The most important is to choose how it is encoded (i.e. the codec). This will capture the sound at a certain quality, compress it, and give you a required bandwidth. If you don’t have that bandwidth available, packets will arrive late or be dropped, and you get a bad experience. There is not much elasticity there, so if you have a 1Mb/s pipe to an office, and you need 64 kbit/s per call both ways, well, then you better make sure you have that bandwidth, and overhead, available for the duration of the call. Hence, you will configure Quality of Service (QoS) in your network, and you may want to introduce CAC (Call Admission Control) in your call control to reserve bandwidth and reject calls when bandwidth is not available.
Enter video, and you don’t need 64 kbit/s, you need 768kbit/s or 1,2, 3, or even 4 Mb/s depending on a number of factors. It is tempting to follow the same kind of thinking, use Quality of Service and CAC, now adjusted for the additional bandwidth requirements (and your 1Mb/s pipe upgraded…). Well, here’s where elasticity makes that approach similar to trying to get a bunch of 3-year old kids in a kindergarten to do what you want when they can only hear you over a speaker phone: they don’t behave the way you want them to, or the way you think they are. An individual’s perception of the quality of a live video stream is based on a large number of factors that can be optimised, tweaked, and compensated for. As a starting point, the video stream from the camera must be encoded in a chosen resolution and frame rate, but how do you choose which ones? And in addition, if you start dynamically changing resolution and frame rate during a call to optimise the experience, the bandwidth requirements will change dramatically, and yes, dynamically. You can also choose a target bandwidth and reduce the encoding quality (roughly perceived as “fuzziness”) to give a certain bandwidth. Of course, this can also be done dynamically. This is the fundamental starting point for elasticity, now you just determine when and how to dynamically change. In a point to point call, the two peers (endpoints) negotiate the codec and lots of parameters in SDP when you set up the call. However, there are ways to change this within a call as well, and in RTCP (the control stream associated with the media stream) there are also feedback mechanisms that allow each peer to send back information about lost media packets and other things that can be used to adjust the media stream from the other side. Here is where the fourth component in the solution, the network, becomes critical. Each peer in a session will have a desire for a certain resolution, frame rate, bandwidth, codec, and so on. But the network transports not only this session, but all on-going sessions, and the network is shared with other applications as well. There may be limited bandwidth on one hop, and packets get dropped. Thus, bandwidth adaptation, adapting and choosing the “right” bandwidth to the underlying network conditions, is dependent on a large number of factors beyond what the guy at the other can use on his monitor or smart phone. These factors include the current, though fluctuating, state of the underlying network, which is typically not available to the endpoints/clients and the bridge.
Technically, this is a tremendous architectural challenge, as we don’t want the application layer (here, the media setup and control) too tightly coupled with the network layer, because if we do, the underlying network and the application (audio/video solution) need to be upgraded together and probably have to be delivered by the same vendor. However, we want feedback between the network and the application layers to make sure that bandwidth is adapted to real-time state in the network. This feedback mechanism should be based on indications that can be used to improve the adaptation if available. A related challenge here is how you allocate resources in your network, not only to video calls, but between your applications, and where video calls are just one of many important applications. This is typically equally or even more important, because real-time audio/video is not the only business critical application.
Ok, here I have just explain the basic factors of elasticity for a single audio/video stream, you will in later posts see that multistream (for each participant send more than one camera source) and simulcast (for each participant send multiple, layered qualities in a bundle) add to the elastic nature of the call. Intelligent bandwidth adaptation will be the cornerstone of any deployment of communication solutions where video is used to bring people face to face (i.e. not a play thing in the corner) and is the first pillar of enabling elasticity in the scalable next-generation video architecture. The second pillar is media resilience, i.e. how to compensate for lost or slow media packets, as this will happen occasionally, even with perfect bandwidth adaptation. The third pillar is Quality of Service, implementing your policies across all your application, not only video calls, so that you leverage all available bandwidth without hard reservation of bandwidth per application, but where applications co-exist and dynamically adapt under normal conditions, but where degradation is dynamic and “fair” based on your policies. All this is dependent on the elastic nature of video media and advanced mechanisms for doing bandwidth adaptation, and if you don’t have one of these pillars, you will struggle with deploying scalable real-time audio/video collaboration.
Finally, as I have spent quite a few cycles on standards and interoperability in this blog, a few comments on the standardisation efforts. A fully, interoperable multistream video architecture is a goal to realise the full potential of any to any high fidelity collaboration. However, reality is that on each architectural layer (from codec, media and stream negotiations at the bottom layer to capability exchange and implementation of great user experiences controlled through endpoint/client to infrastructure communication), we have a large number of standardisation efforts on-going and even to be started. For elasticity specifically, there are some standardised mechanisms for adaptation and resilience, but they are not sufficient to realise a really great experience, so vendors are adding their own proprietary. This is a good thing in a homogenous environment, but across vendors, it will be more difficult to scale a great experience.
In an upcoming post, I will share details on the hype around SVC, and what the difference is between multistream and simulcast and how it relates to elasticity.