My last post on SVC and related technologies packed a lot of complex concepts into a fairly short post. If you didn’t quite follow, don’t worry, here is a quick summary of the conclusions:
<summary>Video streams come in many different resolutions and qualities depending on the type of video endpoint or client you have, and in the receiving end, the experience you are able to get depends on the capabilities of your endpoint or client (a smartphone all the way up to a multi-screen telepresence endpoint). Abilities or constraints in the network may also limit what you can send or receive, both maximum ability and temporary problems that reduces the ability of the network to support a given user experience. So, what your endpoint send (e.g. a high quality HD camera), may not be what the other end actually wants to use (e.g. small screen on an iPhone), and if you can use it, your network may not be able to deliver it to you.
We are looking for that magical user experience that makes people go: wow, this is better than face to face. This implies far more than just a sharp, high-resolution video. However, the ability to create a better than face to face experience includes sharp, high-resolution video streams (yes, plural), and to get there, all participants should send the very best quality they can, and each participant should be able to see, not only one, but all other participants in the quality they can show. Impossible (see my previous post for details on why), but we want to get as close as possible. In order to get there (beyond just isolated islands), I argued in my last post that we need the media streams we are using to be elastic in such a way that they can adapt to bandwidth constraints, recover from problems/packet loss, as well as fit into policy constraints to avoid impacting other important network traffic. If media streams have these characteristics, then we can allow multiple streams going back and forth to each participant, support pro-active selection of streams from each participant, and we can dynamically adapt the resolution and qualities to create that user experience we are looking for.</summary>
In this post I will cover in more detail the multiparty bridge (aka MCU) component of a real-time video communication solution supporting multistream. First, an overview of the types of multiparty bridges will be useful: Most people used to video conferencing know the traditional transcoded multiparty bridge where each participant sends audio and video to the bridge based on what they are able to send and each participant receives from the bridge what they are capable of receiving. It is the bridge that is responsible for supporting all codecs the endpoint can send and receive, the capabilities the endpoint supports, as well as translating and transcoding between each participant. Of course, in a classic multiparty meeting, each participant will receive a composited image of several participants from the bridge, typically created based on who were the last loudest speakers. The actual layout may be tailored per individual, i.e. the bridge will compose a video stream for each participant with multiple participants shown, but where the receiver’s video stream has been removed. This approach gives perfect isolation of the participants from each other. Each can send and receive at its optimum capabilities, and if one participant is experiencing packet loss, the bridge will detect this, communicate it back to the participant with problems (so that corrections can be made), and conceal the packet loss from the other participants.
The other extreme is best effort switching. All participants negotiate an optimal shared codec and a resolution that all are capable of sending and receiving. To make sure everybody can participate, the least common denominator is used, thus reducing the quality of the user experience for everybody if one of the participants is only capable of viewing only, let’s say, 360p, while the others could have used 720p. If one participant is starting to get packet loss, all the other participants will notice it and start sending corrections back. This approach offers no isolation at all.
Transcoded conferencing sounds perfect until I tell you that transcoding is expensive (you need servers), introduces latency (time lag which may be uncomfortable like a satellite link), and may reduce the quality of the image. On the other hand, best effort switching is perfect as long as all the participants are equal and the network is perfect, but cannot support variance in endpoint capabilities (or desires for different resolution streams) and is sensitive to network problems. So, given the pros and cons of each approach, what is the right technology choice for the future multiparty bridge? Before I address that question, we need to revisit multistreaming and how it relates to bridge approaches.
Contrary to many people’s beliefs, multistreaming has nothing to do with switching or transcoding (though you typically need multistreaming to do efficient switching). Multistreaming is something we want to do to send and receive more streams to and from each participant to allow the endpoint or client to be more intelligent and create great user experiences locally, and based on screen size, number of screens, room context, and user preferences. If a highly capable endpoint receives one high-definition stream from each participant (let’s say the four loudest speakers) and low resolution thumbnails for the rest, a multitouch endpoint (like the Cisco DX80) could offer the user to swipe through live video streams from each of the participants, and pinch, zoom, and organise how and where each of the participants and thumbnails are shown on the screeen. Borders between video streams can be sharper, text can be overlaid where it belongs (like name of the participant), and sound can be played out in the right speaker relative to whether the participant speaking is positioned on the right or left side of the screen(s). The streams coming from the multiparty bridge will be multiplexed together over the network (i.e. transported over the same RTP/RTCP ports), but the streams themselves can be switched through the bridge with no transcoding or they can be transcoded, the endpoint doesn’t really care as long as it gets streams that are optimised to its needs (through quality/resolution selection and media elasticity). Indeed, instead of receiving many thumbnails of passive participants, the endpoint could, in addition to HD streams for each of the four loudest speakers, request a transcoding multiparty bridge to create a composited stream with all the participants laid out in 2×2 or 8×1 layout or whatever the endpoint may need to create a great user experience. And of course, if one of the participants is starting to get packet loss, the transcoding bridge will conceal this.
We have seen quite a few best effort switching products the last few year, each trying to solve the problem of media elasticity, not through isolation, but through various other techniques. However, they all suffer from one fundamental problem: switching and each special technique require support in RTP/RTCP and/or signalling, so the solution ends up being proprietary and any inclusion of a third party participant either significantly reduces the quality or you must use a gateway to isolate these participants. Unfortunately, this gives both third party participants and native participants an inferior user experience.
In a next-generation multiparty bridge, we thus want the isolation capabilities of a transcoding bridge, the low latency and the quality of a switched bridge (and low cost), and we want to send and receive multiple streams to create the very best user experience. And this is why you have started to hear about hybrid multiparty bridges. Hybrid bridges support both transcoding, multistream (as in SVC/lync/simulcast++), as well as switching. A hybrid bridge can support standard SIP/H.264 participants through transcoding, switch streams to participants that support switching, and send multiple streams to participants that are capable of decoding and rendering multiple streams locally. Note that a hybrid bridge can both transcode and switch to participants in the same conference. SVC-support in itself does not make a bridge into a hybrid. Supporting SVC-based endpoints like Lync can easily be done through transcoding alone. Likewise, the video stream sent to a Lync client can be transcoded (and composited) or multiple streams can be sent to Lync (multiplexed in Microsoft SVC). Still, as explained earlier, these streams can be transcoded as well (and will often be as other participants may not be sending the resolution that the Lync client needs). It is only when video streams (RTP) are switched through the bridge without being touched that you are seeing true switching. In a hybrid conference, some of the participants will support switching and send the right-sized thumbnail for the other participants to use, while other participants don’t, and the bridge will have to transcode the stream down to thumbnail size and send to the simulcast-capable participant.
If of interest, I may cover more of the inner workings of a hybrid bridge in a later post. Another interesting aspect of the hybrid bridge that I may cover later, is the ability to distribute conferences across multiple nodes by using switched media trunks between each bridge node, a so-called cascade (but without the bad user experience associated with the transcoded bridge cascade). Not that you need a hybrid bridge to do this kind of cascade, the architecture is similar to internal bridge architectures, but through switched multistream technologies, things typically get easier to do as the fundamental technologies are in place.