It’s official. Acano is now a part of Cisco: http://cs.co/acrtb16. Together with my counterparts at Acano, I have been working on technical evaluations and architecture for Cisco and Acano conferencing moving forward (in something called a “clean room” where confidential information is shared). From now on we can share and engage freely across the organisations, and we are now part of the same team. It has been a great experience working with old colleagues and seeing that we share the same visions and thoughts for what the future of conferencing should be. I really look forward to working on the new stuff we will do together!
I promised in my last post on the hybrid bridge to cover more details on the inner workings of the hybrid bridge. This post is a part of my series on the next-generation multi stream video architecture. You may get more out of this post if you read my last post first, but I have tried to write this post in such a way that you don’t have to!
First of all, you need to understand the basics of what happens inside a multiparty bridge (the product or service that creates the live meeting experience in audio or video conferences). The simplest version, the audio bridge, receives audio streams encoded in some kind of format. One from each of the participants (you all know mp3, though not the most likely used for live meetings). The bridge then mixes together some of the participants, either only the speaker if you have a broadcast, or typically 4-5 of the streams with the loudest sound. The new, mixed audio stream is then sent back to all the participants. But how does the bridge know who the loudest speakers are? Well, there are two ways: either each participant’s audio stream contains information about the audio level or the bridge needs to decode all the audio streams and calculate the audio level for each participant. Most audio devices participating in multiparty conferences add this information (see RFC6464 if you are interested in details) in the RTP headers of each audio packet. This allows the bridge to select the 4-5 loudest audio streams, decode them, mix them, and encode a single audio stream that is sent back to all participants. This also scales well to large conferences, the only added complexity is that each of the loudest speakers must get a tailored stream where their own audio has been removed. You don’t really want to listen to yourself!
<Sidenote>Encryption is an added complexity. Audio levels are found in RTP headers and in SRTP (Secure RTP), and the headers can either be encrypted or authenticated. In both cases, the bridge needs a key to create the new headers for each participant (typically one key per participant), and if the headers are encrypted, a key is needed to decrypt the RTP packet before audio levels can be calculated. This has consequences for scale (crypto processes are typically CPU heavy), but more importantly for security. The bridge needs to hold the key(s) to decrypt the loudest speakers and then encrypt the mixed audio. Unless there is one shared key for the whole conference, the bridge needs to encrypt all the audio streams separately. This does not scale well, especially for video streams that have a lot more data than audio streams. The key access means that if a hacker gets into a bridge, all conferences may be breached. However, this is another interesting possibility with a multi stream architecture, as each audio stream could potentially be switched through without the bridge needing the encryption key (as long as the audio levels are available without a key).</Sidenote>
Let’s add video to the conference. Basically, the same thing happens, the audio is used to determine who should be most visible (or preferred) and thus which video streams should be mixed together (or switched through as the loudest participants in a switched conference). Again, you don’t want to see yourself, so when video streams are mixed together to show multiple participants in one view, all the people keeping quiet can get the same mixed stream, while the loudest speakers need one each, so they don’t see themselves. However, one interesting thing is that if we restrict the participants’ freedom and only allow them one view (the one everybody shares), the bridge can send the same stream to 1 or 100 participants without doing additional transcoding. This technique is referred to as shared encode and this and other shared encode methods can be done to optimise transcoding (side note: this optimisation is easier with newer hardware or general purpose processors). If you on the other hand allow all participants to choose who they want to see and how, you may worst case need to encode a separate stream for every single participant (being engineers, we have referred to this feature as the “prettiest woman in the conference”-feature as people will have different tastes for who they want to see, but not that different…).
Now we have a fundamental choice to make: either we choose to switch streams with no transcoding, but require the participants to be homogenous in what they send and receive (I have covered the reasons in earlier posts in this series). Or we allow variation in the capabilities of the endpoints joining the conference, and we thus do transcoding as well. If we only switch, we still need to allow “outsiders” (aka interop), but we typically do that through a gateway that removes variance on the switched conference side. The effect of that is that it is easier to optimise the experience for “native” participants, while the “outsider” participants get an inferior experience.
The hybrid bridge supports both switching and transcoding (and of course here, I assume multi stream). For each participant, the bridge can either allocate transcoding resources or switching resources. Or, the bridge can treat a participant as a switched participant for the purpose of sending streams back, but transcode the incoming video from the participant, maybe because it is a H.265 encoded stream or a single stream over a mobile network with bad connectivity. The bridge can then make a full sized optimised stream when the participant is the loudest speaker, but also create a thumbnail video stream for participants that wants to show the participant in a PIP (Picture in Picture) without requiring extra capabilities on the mobile device (or using extra bandwidth to transfer both resolutions). There are many possibilities and corner cases that can be handled, but the important consequence is that all participants get an optimised user experience, including the classic, non-multi stream participant that may opt for a classic 3×3 tile thumbnail presentation of the participants.
And here we are with the most important property of the hybrid bridge: The hybrid bridge is doing “interop” as an integrated part of the conference architecture and not through a gateway. The hybrid architecture is a consequence of the belief that in real life there will be far fewer “perfect” conferences than mixed, non-perfect conferences. A “perfect” conference is here a conference where there are no non-multi stream/switched capable participants, where you don’t have a mix of both large screen room systems and mobile participants, where all participants share the same codec, and where you don’t have some participants with large variances in underlying network quality. The contrary belief is that if you have a service or eco-system with mostly fast upgradable (on-the-fly) mobile and/or PC-based participants, you can push upgrades fast and make sure all participants have the same capabilities. You may also assume that you have little variance in user experience needs, as you only need to adapt to mobile, tablet, and PC screens. You can then do interop through a gateway function, because you don’t really want to optimise the experience for “non-compliant” participants, they should just switch to a native experience or be satisfied with the inferior experience.
Both of these beliefs are valid. However, they meet at one very important point: it is difficult to predict who want to talk to whom, where, and how. The “full interop is needed” belief assumes that everybody wants to talk to everybody using any device/client. The latter belief is hinged on a belief that it is possible to convince people who have access to devices/clients outside your eco-system to download your client and “convert” (possibly just for this meeting). WebRTC makes this more plausible as the threshold to temporarily “convert” is lower. However, the jury is still out on that one, and the industry is still executing on both beliefs. (Meanwhile, my rantings from 2010 on whether video conferencing is a PBX-type or mobile-type of industry may amuse you…)
An interesting effect of choosing a hybrid architecture approach, is that it is close to impossible to predict how many resources a participant or a conference will occupy on the bridge. The cost of adding the 101th transcoded participant in a conference where 97 people share the same encode is very low. And in fact, if you have a non-multi stream, fully transcoded (classic) conference going, a switched participant will actually require more incremental resources than adding another transcoded participant. The reason is that the new switched participant needs, let’s say, the four loudest speaker videos in 720p30, where you must assume that some of these streams must be transcoded. The new transcoded participant can just start receiving the shared encode view (while both incoming streams need to be decoded, scaled, and re-encoded).
There is plenty of room for optimisations, but this makes it difficult to give an exact number of “ports” that a hybrid bridge can support (as it “depends”), resource usage can fluctuate during an on-going conference, both because participants are leaving and joining, but also because they may change layout, the underlying network conditions may change (ref media elasticity), and many other things. These properties have consequences for ability to reserve resources, calculation of available resources on bridges (when selecting a bridge where a new conference should be created), and of course for pricing. But it is important to understand that the hybrid architecture and its flexibility can only be realised on newer or generic hardware where multi-purpose processors allow resources to be allocated more flexibly than on older media processing platforms. That is why you have started to see new pricing models that are not directly related to resource usage, but rather capabilities or number of participants, independent of how they connect.
Due to this newly acquired flexibility, I predict a wave of innovations (or probably many smaller waves) on the real-time conferencing arena. The real winners will be the people having meetings (i.e. all of us!) as these technologies will enable truly great user experiences!
My last post on SVC and related technologies packed a lot of complex concepts into a fairly short post. If you didn’t quite follow, don’t worry, here is a quick summary of the conclusions:
<summary>Video streams come in many different resolutions and qualities depending on the type of video endpoint or client you have, and in the receiving end, the experience you are able to get depends on the capabilities of your endpoint or client (a smartphone all the way up to a multi-screen telepresence endpoint). Abilities or constraints in the network may also limit what you can send or receive, both maximum ability and temporary problems that reduces the ability of the network to support a given user experience. So, what your endpoint send (e.g. a high quality HD camera), may not be what the other end actually wants to use (e.g. small screen on an iPhone), and if you can use it, your network may not be able to deliver it to you.
We are looking for that magical user experience that makes people go: wow, this is better than face to face. This implies far more than just a sharp, high-resolution video. However, the ability to create a better than face to face experience includes sharp, high-resolution video streams (yes, plural), and to get there, all participants should send the very best quality they can, and each participant should be able to see, not only one, but all other participants in the quality they can show. Impossible (see my previous post for details on why), but we want to get as close as possible. In order to get there (beyond just isolated islands), I argued in my last post that we need the media streams we are using to be elastic in such a way that they can adapt to bandwidth constraints, recover from problems/packet loss, as well as fit into policy constraints to avoid impacting other important network traffic. If media streams have these characteristics, then we can allow multiple streams going back and forth to each participant, support pro-active selection of streams from each participant, and we can dynamically adapt the resolution and qualities to create that user experience we are looking for.</summary>
In this post I will cover in more detail the multiparty bridge (aka MCU) component of a real-time video communication solution supporting multistream. First, an overview of the types of multiparty bridges will be useful: Most people used to video conferencing know the traditional transcoded multiparty bridge where each participant sends audio and video to the bridge based on what they are able to send and each participant receives from the bridge what they are capable of receiving. It is the bridge that is responsible for supporting all codecs the endpoint can send and receive, the capabilities the endpoint supports, as well as translating and transcoding between each participant. Of course, in a classic multiparty meeting, each participant will receive a composited image of several participants from the bridge, typically created based on who were the last loudest speakers. The actual layout may be tailored per individual, i.e. the bridge will compose a video stream for each participant with multiple participants shown, but where the receiver’s video stream has been removed. This approach gives perfect isolation of the participants from each other. Each can send and receive at its optimum capabilities, and if one participant is experiencing packet loss, the bridge will detect this, communicate it back to the participant with problems (so that corrections can be made), and conceal the packet loss from the other participants.
The other extreme is best effort switching. All participants negotiate an optimal shared codec and a resolution that all are capable of sending and receiving. To make sure everybody can participate, the least common denominator is used, thus reducing the quality of the user experience for everybody if one of the participants is only capable of viewing only, let’s say, 360p, while the others could have used 720p. If one participant is starting to get packet loss, all the other participants will notice it and start sending corrections back. This approach offers no isolation at all.
Transcoded conferencing sounds perfect until I tell you that transcoding is expensive (you need servers), introduces latency (time lag which may be uncomfortable like a satellite link), and may reduce the quality of the image. On the other hand, best effort switching is perfect as long as all the participants are equal and the network is perfect, but cannot support variance in endpoint capabilities (or desires for different resolution streams) and is sensitive to network problems. So, given the pros and cons of each approach, what is the right technology choice for the future multiparty bridge? Before I address that question, we need to revisit multistreaming and how it relates to bridge approaches.
Contrary to many people’s beliefs, multistreaming has nothing to do with switching or transcoding (though you typically need multistreaming to do efficient switching). Multistreaming is something we want to do to send and receive more streams to and from each participant to allow the endpoint or client to be more intelligent and create great user experiences locally, and based on screen size, number of screens, room context, and user preferences. If a highly capable endpoint receives one high-definition stream from each participant (let’s say the four loudest speakers) and low resolution thumbnails for the rest, a multitouch endpoint (like the Cisco DX80) could offer the user to swipe through live video streams from each of the participants, and pinch, zoom, and organise how and where each of the participants and thumbnails are shown on the screeen. Borders between video streams can be sharper, text can be overlaid where it belongs (like name of the participant), and sound can be played out in the right speaker relative to whether the participant speaking is positioned on the right or left side of the screen(s). The streams coming from the multiparty bridge will be multiplexed together over the network (i.e. transported over the same RTP/RTCP ports), but the streams themselves can be switched through the bridge with no transcoding or they can be transcoded, the endpoint doesn’t really care as long as it gets streams that are optimised to its needs (through quality/resolution selection and media elasticity). Indeed, instead of receiving many thumbnails of passive participants, the endpoint could, in addition to HD streams for each of the four loudest speakers, request a transcoding multiparty bridge to create a composited stream with all the participants laid out in 2×2 or 8×1 layout or whatever the endpoint may need to create a great user experience. And of course, if one of the participants is starting to get packet loss, the transcoding bridge will conceal this.
We have seen quite a few best effort switching products the last few year, each trying to solve the problem of media elasticity, not through isolation, but through various other techniques. However, they all suffer from one fundamental problem: switching and each special technique require support in RTP/RTCP and/or signalling, so the solution ends up being proprietary and any inclusion of a third party participant either significantly reduces the quality or you must use a gateway to isolate these participants. Unfortunately, this gives both third party participants and native participants an inferior user experience.
In a next-generation multiparty bridge, we thus want the isolation capabilities of a transcoding bridge, the low latency and the quality of a switched bridge (and low cost), and we want to send and receive multiple streams to create the very best user experience. And this is why you have started to hear about hybrid multiparty bridges. Hybrid bridges support both transcoding, multistream (as in SVC/lync/simulcast++), as well as switching. A hybrid bridge can support standard SIP/H.264 participants through transcoding, switch streams to participants that support switching, and send multiple streams to participants that are capable of decoding and rendering multiple streams locally. Note that a hybrid bridge can both transcode and switch to participants in the same conference. SVC-support in itself does not make a bridge into a hybrid. Supporting SVC-based endpoints like Lync can easily be done through transcoding alone. Likewise, the video stream sent to a Lync client can be transcoded (and composited) or multiple streams can be sent to Lync (multiplexed in Microsoft SVC). Still, as explained earlier, these streams can be transcoded as well (and will often be as other participants may not be sending the resolution that the Lync client needs). It is only when video streams (RTP) are switched through the bridge without being touched that you are seeing true switching. In a hybrid conference, some of the participants will support switching and send the right-sized thumbnail for the other participants to use, while other participants don’t, and the bridge will have to transcode the stream down to thumbnail size and send to the simulcast-capable participant.
If of interest, I may cover more of the inner workings of a hybrid bridge in a later post. Another interesting aspect of the hybrid bridge that I may cover later, is the ability to distribute conferences across multiple nodes by using switched media trunks between each bridge node, a so-called cascade (but without the bad user experience associated with the transcoded bridge cascade). Not that you need a hybrid bridge to do this kind of cascade, the architecture is similar to internal bridge architectures, but through switched multistream technologies, things typically get easier to do as the fundamental technologies are in place.