Have you heard about TIP, the TelePresence Interoperability Protocol? What about the working group CLUE @ IETF? If not, you have for sure heard about SVC, right? If you have been exposed to the video conferencing/telepresence industry, you have probably also heard about how switching technology is turning the old MCU video conferencing approach upside down. And add to that the WebRTC hype, and I can understand that even very technical people will be confused. How does all this come together and why does it really matter?
The video conferencing world used to be simple: each participant would send an audio and a video stream to a conferencing box, called an MCU, and each audio/video stream would be processed individually, mixed together, and then the MCU would send back a composited audio and video stream to each participant and thus create a meeting experience. It was simple, but it was not easy to get a good experience though, because each set of audio/video stream was fairly static in bandwidth requirements and very sensitive to network issues, and the user experience was very much based on the capabilities of the MCU. The processing needs required to create a good experience are costly, especially at higher resolutions (which is necessary for a good experience).
The first thing we did to improve experience and reduce cost was to make sure that each participant would only get what it could digest. In the beginning, it was fairly static, you had to pre-set your maximum bandwidth and resolution capabilities up front and the Codian bridges (later TANDBERG, and now Cisco) would set aside enough resources to give all participants the best possible experience (“a port is a port”, a fairly wasteful approach, but given the hardware restrictions, the best way to give the best experience). We also introduced more dynamic bandwidth adaptation (and more and more intelligently controlled adaptation), as well as mechanisms for recovering from network issues.
Then, Cisco TelePresence introduced something totally new: multiple cameras and multiple screens per participant (three of each) and super-high resolution. Also, instead of mixing and using CPU processing to create the meeting experience, the bridge basically switched the streams through (a media relay), so that all participants received the three audio/video streams from the last speaker. The near-face to face experience was incredible, but the sensitivity to network issues was the same… Also, all the participants had to be capable of handling exactly the same audio and video resolutions and qualities (due to the switching), so interoperability was a key problem, including interoperability across types of endpoints (let’s say light-weight PC clients or cheaper units that could not handle the high-resolution telepresence streams). Also, no signalling standards really took into account more than one camera and one screen, so there was no way of exchanging information about capabilities. That is what the TelePresence Interoperability Protocol did (and still does).
Fast forward, and the industry has evolved in several key areas. First of all, each video stream is now much more “elastic”, i.e. it will more quickly and intelligently adapt to network issues and other constraints, thus optimising the experience. Also, the concept of a video stream is no longer simple thanks the SVC and simulcast technologies (I will write more about this in a later post). Furthermore, mobile devices, web browsers, and cheaper video capable devices became increasingly important to include in any meeting. And, we really started to understand what we are able to do to improve the user experience if we allow the endpoints/clients to be more intelligent and interact with the conferencing unit (bridge/MCU) to request what it needs to create the very best experience. Finally, there is a fundamental technical challenge in creating a cheaper, more scalable conferencing experience through technologies like switching and simulcast/SVC and, at the same time, embracing mobility and a very heterogeneous set of participants (across companies, across hardware, across vendors, and across trust boundaries).
Ironically, the multi-camera, multi-screen TelePresence area has not really evolved that much. It was a high-end technology that was ground-breaking at the time, but the innovation continued at the lower end. I believe this will change. We had to innovate on how to handle a single audio/video stream and offer new user experiences, because most video users actually have only one camera and one screen. However, if you do have multiple cameras (and content sources), there are a number of interesting possibilities for improving the feeling of “better than being there” in meeting room, lecture hall, conference, and other scenarios where you have a large number of people sharing the same experience.
I will in subsequent posts cover some of these areas and show how they may come together to make global high-quality live video as pervasive as phone calls. I refer to this is the next-gen multistream video architecture. Next is how elasticity changes how you need to view real-time video in your network.
One last note: If you haven’t seen what difference it makes to have high-quality audio and video in a one-to-one meeting or a larger meeting (compared to web video and it’s like), you will not understand why we would even bother to build out multi stream video architecture. And if that is the case, you may not find these posts too interesting.
[…] Server, and the TANDBERG TelePresence Server (I blogged an entire series about the technology, starting in 2014). Their architectures and implementations were totally incompatible (so anti-pattern #2 above). The […]