In my previous post in this series, I covered the three pillars of elastic media: bandwidth adaptation (optimise media throughput based on a receiver’s needs, user experience, and the network), media resilience (robust behavior when there is packet loss), and quality of service (policies implementing relative priority of various traffic in your network). Here I will deep dive into how a stream is not a stream anymore and how new methods of encoding and packaging streams make these three pillars increasingly complex to get right.
SVC, Scalable Video Coding, an encoding standard, has seen a lot of attention. This standard itself is just one of many standards and technologies that must work together to realise the underlying concept, but, as often is the case, people like to use three-letter acronyms for simplicity, so SVC is often used to refer to also the methods underlying SVC. However, the basic concept of SVC seems simple and very appealing: within a single media stream, pack together a base layer with a minimum quality and then add additional layers with incremental information to build a higher quality video stream. The problem is that there are many ways of layering and packaging, and how you package (encode) a media stream has huge impacts on the receiving end in what it is able to do with the streams and what it needs to be capable of handling technically (in terms of resources, decoding, compositing, and rendering on the screen). This again determines the freedom you have to create really great user experiences or if you are restricted.
SVC is great if you have a very uniform set of clients without any non-SVC (AVC) clients participating and where all the participants have fairly equal needs with respect to the streams they need in order to render a great user experience. One of the problems is that only the base layer in SVC is (encoding) compatible with standard AVC clients. Also, if there is large variance in the needs of the participants (some want 1080p60, some 720p30, and others only 360p15), how do you package the right SVC stream bundle? One 180p15 base layer, and a 360p30 and a 720p30 enhancement layer? Or should you do a 360p15 base layer and only add temporal enhancement layers (more about temporal qualities later)? (The p30/p15 at the end signifies the number of frames per second in the video stream.)
And where do you optimise the bandwidth adaptation, only for the base layer? Or what if you get packet loss for the base layer for one participant, and for another participant you get packet loss only on the enhancement layers? A side note here, but also equally relevant, is that hardware may also restrict the combinations you can do. If you use hardware encoding in a camera, you may be able to get 720p30, 360p30, and 180p30, but if you want 1080p30, you may only get 360p30 at the same time due to hardware limitations.
In theory, in a multipart conference, each participant should only receive the base layer and the enhancement layers it needs (and note, in a switched conference it will receive a base layer and enhancement layers for multiple participants). If you add the pillars of a scalable, elastic next-generation video architecture, you need each network leg to each participant to adjust bandwidth based on network conditions, compensate for packet loss, and you need a simple way of prioritising the right streams.
So, as you may be starting to realise, the appealing simplicity of SVC may only be apparent. The problem of creating a scalable video architecture is multi-dimensional where you can either optimise for one factor, a handful, or try to establish a balance. But, the basic fundament of packaging multiple sub-parts of a video stream into a “stream bundle” is a key element of SVC (see RFC6190 for details) that is very useful and can be used regardless of how you optimise. Either you can use layers, with dependencies between the sub-streams, or you can have individual streams that each is independent of the others, but you can still use SVC packetization to bundle the streams together into one RTP/RTCP stream. We refer to this technique of bundling multiple qualities of the same source as simulcast. The fundamental technical trade-off is to determine how much to add dependencies between the sub-streams. The more dependency you have between the streams, the harder it is to support elasticity.
To understand this in more detail, let’s look at how a video stream can be built up: A video stream can be categorised along three basic dimensions: resolution, quality, and motion. Resolution is simple, the picture has low resolution fitting a small screen or high resolution fitting, let’s say, an HD screen. Resolution is measured in how many pixels you can fit in vertically (counted as number of lines vertically), thus 180p (p stands for progressive, a detail not relevant for this discussion), 360p, 720p (HD), and 1080p (Full HD). Quality is a factor of how much information you have about each pixel. Simplified, it’s a bit like drawing with a black pencil vs charcoal, with different amounts of pressure you can draw a very accurate pencil drawing, but using charcoal you get more of a black and white drawing and it will have less detail. Information about the pixel needs to be described and sent over the network, so the more information you have, the more bandwidth you use. (Side note: when taking raw data from a camera and creating information about a pixel, you also need more processing power and time to create detailed information, also referred to as encoding). Finally, motion is simply the number of frames that are shown in rapid succession per second. The more frames per second, the better and more natural the motion, but more information is needed. To reduce the data needed to describe a pixel across many frames, the encoding process uses information about the previous frame and only describes the change from one frame to another. If such delta information is lost in the network (packet loss), it will be difficult for the receiving end to construct the next frame, and this will result in strange visual effects and artifacts on the screen.
SVC and simulcast are basically a set of technologies that allow us to slice and dice multiple resolutions (“spatial quality”), qualities (“fidelity quality”), and motions (“temporal quality”) from a single camera into a “single packaged stream bundle”. Back to the trade-off again, you can take a layered approach where you start out with a base layer e.g. at 360p and then add another layer describing what is missing to get a higher resolution, let’s say 720p. Or, you can basically send two separate video streams packaged together, one full 360p and one full 720p. The first approach will give you reduced bandwidth, but less elasticity and flexibility, while the second approach will increase the bandwidth, but will allow you more elasticity, among other things ability to react to packet loss.
As stated earlier, this layering with dependency was initially what made SVC so appealing, as it is conceptually easy to understand and you seem to save a lot of bandwidth (see Jeff Schertz post for a great detailed technical analysis). However, it was also something that people felt fitted really well with switched conferences (as opposed to transcoded conferences), because you could just switch through the layers you needed instead of decoding, scaling, composing, and then re-encoding (aka transcoding), and you would get both the best experience and ability to support a heterogeneous set of participants.
Based on this, many people have suggested that the transcoded multiparty, or classic MCU, conference is dead. This sounds really great as transcoding takes resources and thus costs more than switching. And yes, it is really great for participants that all have fairly similar screen sizes and user experience needs, when the underlying network is either very robust or all have the same conditions, all the participants encode and decode SVC the same way, and they all have fairly similar hardware capabilities, so they can all offer fairly similar number of encoded streams and can decode the same type of streams. If you have H.264 AVC only clients, or introduce H.265 to reduce bandwidth, or have both high-end meeting room systems and mobile phones or have participants coming in from other networks, partners, customers, and the Internet, you need a more robust real-time video architecture with a combination of transcoding and switching, where some streams are transcoded and some are just switched. The reason is that with this hybrid multiparty conference approach, you can better optimise the stream packaging for the user experience and at the same time you get media elasticity. Ironically, the more media elasticity, the more you can rely on a switched conference and avoid transcoding, thus making independent streams without a base+enhancement layers more appealing. SVC thus makes pure switched conferencing harder to implement (again, this is based on the assumption that your environment is not entirely based on a single vendor’s SVC implementation).
I know your head is spinning, mine does… but let’s add a final dimension: In a switched conference, all participants will receive multiple, simulcasted video streams from the other participants (typically the four last speakers) and the streams will be rendered and presented to the people in front of the endpoint based on the endpoint’s screen size, number of screens, etc. Also, if you have one or more presentation sources in a meeting, or a meeting room with two or more cameras to cover a larger room, or maybe the whiteboard and the speaker at the same time, or if you have an immersive video room with three HD cameras and three HD screens, well, then each of these streams from the same participant can be simulcasted in many different qualities, some of the streams may be turned off, muted, or need information about physical positioning in the room. We have now described a simulcast, multi-source, multi-screen conference experience (with the multi-source component covered by the TIP/CLUE standards). The next-generation multistream video architecture must support such a highly complex and dynamic audio/video collaboration experience where each participant both encodes multiple qualities of multiple camera sources and at the same time receives multiple streams from multiple remote participants. In my post on elasticity, I touched how new standards are required on every single layer of the conferencing stack to offer a great user experience across vendors. A tight and very interactive dialog is necessary between call control development teams, endpoint teams, multiparty conference server teams, and network people to get the trade-offs right. A consequence of this is that only vendors with an end to end portfolio will be able to really take innovation lead as all components need to evolve synchronised. We will probably not see entirely walled gardens, but there is a risk that we will continue down the path of limited interoperability using a gateway between vendors.
In my next post, I cover switched and transcoded conference bridges and why you are starting to see new hybrid bridges.
In my next post, I cover switched and transcoded conference bridges and why you are starting to see new hybrid bridges.