I promised in my last post on the hybrid bridge to cover more details on the inner workings of the hybrid bridge. This post is a part of my series on the next-generation multi stream video architecture. You may get more out of this post if you read my last post first, but I have tried to write this post in such a way that you don’t have to!
First of all, you need to understand the basics of what happens inside a multiparty bridge (the product or service that creates the live meeting experience in audio or video conferences). The simplest version, the audio bridge, receives audio streams encoded in some kind of format. One from each of the participants (you all know mp3, though not the most likely used for live meetings). The bridge then mixes together some of the participants, either only the speaker if you have a broadcast, or typically 4-5 of the streams with the loudest sound. The new, mixed audio stream is then sent back to all the participants. But how does the bridge know who the loudest speakers are? Well, there are two ways: either each participant’s audio stream contains information about the audio level or the bridge needs to decode all the audio streams and calculate the audio level for each participant. Most audio devices participating in multiparty conferences add this information (see RFC6464 if you are interested in details) in the RTP headers of each audio packet. This allows the bridge to select the 4-5 loudest audio streams, decode them, mix them, and encode a single audio stream that is sent back to all participants. This also scales well to large conferences, the only added complexity is that each of the loudest speakers must get a tailored stream where their own audio has been removed. You don’t really want to listen to yourself!
<Sidenote>Encryption is an added complexity. Audio levels are found in RTP headers and in SRTP (Secure RTP), and the headers can either be encrypted or authenticated. In both cases, the bridge needs a key to create the new headers for each participant (typically one key per participant), and if the headers are encrypted, a key is needed to decrypt the RTP packet before audio levels can be calculated. This has consequences for scale (crypto processes are typically CPU heavy), but more importantly for security. The bridge needs to hold the key(s) to decrypt the loudest speakers and then encrypt the mixed audio. Unless there is one shared key for the whole conference, the bridge needs to encrypt all the audio streams separately. This does not scale well, especially for video streams that have a lot more data than audio streams. The key access means that if a hacker gets into a bridge, all conferences may be breached. However, this is another interesting possibility with a multi stream architecture, as each audio stream could potentially be switched through without the bridge needing the encryption key (as long as the audio levels are available without a key).</Sidenote>
Let’s add video to the conference. Basically, the same thing happens, the audio is used to determine who should be most visible (or preferred) and thus which video streams should be mixed together (or switched through as the loudest participants in a switched conference). Again, you don’t want to see yourself, so when video streams are mixed together to show multiple participants in one view, all the people keeping quiet can get the same mixed stream, while the loudest speakers need one each, so they don’t see themselves. However, one interesting thing is that if we restrict the participants’ freedom and only allow them one view (the one everybody shares), the bridge can send the same stream to 1 or 100 participants without doing additional transcoding. This technique is referred to as shared encode and this and other shared encode methods can be done to optimise transcoding (side note: this optimisation is easier with newer hardware or general purpose processors). If you on the other hand allow all participants to choose who they want to see and how, you may worst case need to encode a separate stream for every single participant (being engineers, we have referred to this feature as the “prettiest woman in the conference”-feature as people will have different tastes for who they want to see, but not that different…).
Now we have a fundamental choice to make: either we choose to switch streams with no transcoding, but require the participants to be homogenous in what they send and receive (I have covered the reasons in earlier posts in this series). Or we allow variation in the capabilities of the endpoints joining the conference, and we thus do transcoding as well. If we only switch, we still need to allow “outsiders” (aka interop), but we typically do that through a gateway that removes variance on the switched conference side. The effect of that is that it is easier to optimise the experience for “native” participants, while the “outsider” participants get an inferior experience.
The hybrid bridge supports both switching and transcoding (and of course here, I assume multi stream). For each participant, the bridge can either allocate transcoding resources or switching resources. Or, the bridge can treat a participant as a switched participant for the purpose of sending streams back, but transcode the incoming video from the participant, maybe because it is a H.265 encoded stream or a single stream over a mobile network with bad connectivity. The bridge can then make a full sized optimised stream when the participant is the loudest speaker, but also create a thumbnail video stream for participants that wants to show the participant in a PIP (Picture in Picture) without requiring extra capabilities on the mobile device (or using extra bandwidth to transfer both resolutions). There are many possibilities and corner cases that can be handled, but the important consequence is that all participants get an optimised user experience, including the classic, non-multi stream participant that may opt for a classic 3×3 tile thumbnail presentation of the participants.
And here we are with the most important property of the hybrid bridge: The hybrid bridge is doing “interop” as an integrated part of the conference architecture and not through a gateway. The hybrid architecture is a consequence of the belief that in real life there will be far fewer “perfect” conferences than mixed, non-perfect conferences. A “perfect” conference is here a conference where there are no non-multi stream/switched capable participants, where you don’t have a mix of both large screen room systems and mobile participants, where all participants share the same codec, and where you don’t have some participants with large variances in underlying network quality. The contrary belief is that if you have a service or eco-system with mostly fast upgradable (on-the-fly) mobile and/or PC-based participants, you can push upgrades fast and make sure all participants have the same capabilities. You may also assume that you have little variance in user experience needs, as you only need to adapt to mobile, tablet, and PC screens. You can then do interop through a gateway function, because you don’t really want to optimise the experience for “non-compliant” participants, they should just switch to a native experience or be satisfied with the inferior experience.
Both of these beliefs are valid. However, they meet at one very important point: it is difficult to predict who want to talk to whom, where, and how. The “full interop is needed” belief assumes that everybody wants to talk to everybody using any device/client. The latter belief is hinged on a belief that it is possible to convince people who have access to devices/clients outside your eco-system to download your client and “convert” (possibly just for this meeting). WebRTC makes this more plausible as the threshold to temporarily “convert” is lower. However, the jury is still out on that one, and the industry is still executing on both beliefs. (Meanwhile, my rantings from 2010 on whether video conferencing is a PBX-type or mobile-type of industry may amuse you…)
An interesting effect of choosing a hybrid architecture approach, is that it is close to impossible to predict how many resources a participant or a conference will occupy on the bridge. The cost of adding the 101th transcoded participant in a conference where 97 people share the same encode is very low. And in fact, if you have a non-multi stream, fully transcoded (classic) conference going, a switched participant will actually require more incremental resources than adding another transcoded participant. The reason is that the new switched participant needs, let’s say, the four loudest speaker videos in 720p30, where you must assume that some of these streams must be transcoded. The new transcoded participant can just start receiving the shared encode view (while both incoming streams need to be decoded, scaled, and re-encoded).
There is plenty of room for optimisations, but this makes it difficult to give an exact number of “ports” that a hybrid bridge can support (as it “depends”), resource usage can fluctuate during an on-going conference, both because participants are leaving and joining, but also because they may change layout, the underlying network conditions may change (ref media elasticity), and many other things. These properties have consequences for ability to reserve resources, calculation of available resources on bridges (when selecting a bridge where a new conference should be created), and of course for pricing. But it is important to understand that the hybrid architecture and its flexibility can only be realised on newer or generic hardware where multi-purpose processors allow resources to be allocated more flexibly than on older media processing platforms. That is why you have started to see new pricing models that are not directly related to resource usage, but rather capabilities or number of participants, independent of how they connect.
Due to this newly acquired flexibility, I predict a wave of innovations (or probably many smaller waves) on the real-time conferencing arena. The real winners will be the people having meetings (i.e. all of us!) as these technologies will enable truly great user experiences!