At the recent IETF meeting in Prague, one of the hot areas was the RTCWEB stuff. The short explanation (Real-Time Communication over the web) is that the browser should become a full audio/video client. The browser needs to get access to hardware and OS resources so a full, high-quality video client can run live inside the browser. This is of course something that Google (Google Talk) and Skype are very interested in enabling.
There are two technical areas that are of particular interest (skip to the last paragraph if you are not too technically inclined…). The first is the interface between the audio/video (RTC) client running in the browser (probably downloaded as part of a web site in the form of Javascript or Java) and the browser itself. An RTC client must be able to request audio and video from the browser (and the browser must know how to access microphone and camera in the OS), but more importantly, for this to scale, the browser must be able to set up an audio/video session directly with another browser (the other person you are calling). The reason is that the alternative, going through a server on the Internet, is costly. The current browser security model only allows a direct connection to the server where you downloaded the web page containing the Java/Javascript, so an important part of the work will be to figure out how to solve this without introducing security issues. It seems that everybody agrees that the browser’s interface (API) used by the Javascript/Java RTC client should be standardized to allow a single RTC client implementation to run on many different browsers. This would allow you to write one RTC client embedded in a web page and it will run in many browsers. Specifying such an API is within the W3C organization’s domain and the W3C is involved in the RTCWEB work.
More controversial, the second interface makes for more interesting discussions and possibly strategic consequences. The RTC client running inside the browser needs to communicate with a server or network to make and receive calls. It needs to register somewhere, “here I am, I’m ready to receive calls!” and it needs to be able to make calls, “please connect me with the following address!” The question is: Should this interface be standardized or should the RTC client just communicate with the central server using a proprietary protocol? The argument for the latter is that it will stimulate innovation and that the above API and how to set up the browser-to-browser media sessions are the only elements that should be standardized. Well, it is likely that mechanisms like ICE may be of use in setting up a session, and there will be media parameters etc that must be communicated as part of setting up a session, so what has already been solved using SIP must be mapped into the protocol between the RTC client and the server that connects the calls, whether it is proprietary or not. Although I’m favorable to the argument that using SIP between the RTC client and the server creates a lot of overhead if you only need a small part for your application, it seems a bit unnecessary that every RTC client implementer has to come up with a new way of setting up a session and exchanging the necessary parameters.
At a higher level, though, what if a Javascript RTC client can leverage a full SIP stack in the browser? Doesn’t that mean that any high-school student could add additional value on top instead of writing a lot of low-level client/server code? If that functionality is not available, both client and server-side code must be developed before the RTC capabilities of the browser can be used. On the other hand, in order to use a SIP stack in the browser, you would need a SIP service running on your server. If you don’t have SIP in the browser, you have to invent your own client-server protocol and you could use simple XML-based exchanges from the Javascript client to your server to set up a session.
Pondering this a bit I get the feeling that no standardization of the RTC client-server interface would spur innovation for small, light-weight applications, but that any real audio/video/IM client implementation would require quite a bit of work to get right. This in reality favors larger vendors with lots of resources. As you probably can choose to adopt a light-weight client/server protocol even if there is a full SIP stack in the browser (or at least, it should be possible to design in way that this is possible), I don’t really see the downside (beyond standardization body cycles) of standardizing how SIP should work between a browser and a web server, probably over port 443/80.
So, what are the consequences for companies like Cisco? We would really welcome video capabilities within the browser. For example, our PrecisionUSB camera would be possible to use to get a high-quality video source. Of course, having a large install base, we would like to ensure compatibility with existing SIP systems and avoid interoperability through a gateway. Gateways are always expensive (especially media gateways) and reduces the functionality that can flow over the gateway. I would love to see the possibility of creating a browser-based client, but if the quality ends up as a least common denominator (i.e. not possible to add codecs to improve quality or extend beyond APIs/protocols to improve functionality), I feel we haven’t gained much. Indeed, enabling a basic real-time communication experience through free services would be beneficial to everybody, but if we do it in a way that allows high-school students as well as large corporations to innovate on top of it, I believe we have much more to gain.