The document discusses considerations for server-side WebRTC infrastructure. It describes how WebRTC uses STUN and TURN servers to handle NAT traversal so clients can establish direct peer-to-peer connections. However, media servers and WebRTC gateways are also important to provide value-added functions like conferencing, recording, transcoding and interoperating WebRTC with existing VoIP networks. The document compares different approaches for multi-party video, including mesh, MCU, SFU and simulcast, and how servers can optimize resource usage for large scale conferencing.
I have been involved with WebRTC for more than 3 years.
I was formerly at Acme Packet where I worked on Acme Packet’s WebRTC launch.
After the Oracle acquisition I later worked with Doug and the Oracle Communications team on their WebRTC Session Controller
I have been at Dialogic for 16 months focused on WebRTC and their media server business
In addition, I am a blogger and editor at webrtcHacks – a blog for WebRTC developers that features technical content, demos, code, and commentary for the developer community.
It has grown much more popular than I ever imagined with more than 20,000 visitors a month.
Dialogic has more than 25 years of history providing telephony infrastructure and enabling technology.
Dialogic’s portfolio includes:
Rich media processing boards, enabling technology and platforms like our software media server
Class 4 softswitches and gateways
And mobile signaling products
Today I am going to talk about server-side infrastructure for WebRTC.
This includes
Signaling servers
Servers to help with NAT traversal
Media servers for processing media
And Gateways for interconnecting to existing networks
Let’s start with signaling servers.
WebRTC is often called a peer-to-peer technology.
This is not entirely true.
While WebRTC media often is delivered in a peer-to-peer architecture, a server is required to help setup the initial connection.
WebRTC standards today have very basic requirements for signaling.
The only thing the signaling system needs to do is relay Session Description Protocol – or SDP.
SDP is an existing format left over from SIP used to negotiate the parameters of the media session.
Of course most real-world systems require more complex signaling to handle other functions such as:
User identification and authentication
Access control and security
Push notifications to help conserve battery on mobile devices
Federation to interwork other other identify and authentication systems
And many, many other features your particular application might already have or need to be a real service
These items are beyond the scope of WebRTC, but certainly not beyond the scope of what is needed in many applications.
Next I will transition to the NAT traversal problem.
It is obviously not ok to ask your users to change their network connection or adjust their browser to make a call go through.
Let’s do a quick recap of how NATs work, and why this is a problem for VoIP.
NATs take one address space and convert it to another
10.10.1.1 to 200.2.20.2 in the left-side example above
In order for 2 points to communicate with each other they need to know the address.
The challenge in this case is which address do you use? The one on the local NAT or the external one?
The client only knows its local address, but it needs to know its external address so the other client knows how to reach it.
This is not a new problem to Voice Over IP systems.
Existing VoIP systems largely use SBC’s to deal with this by relaying the media through the SBC
and using the SBCs’ intelligence to figure out the addresses.
WebRTC deals with it an a new, and very different way using a protocol known as Interactive Connectivity Establishment (ICE)
ICE requires two kinds of servers – STUN and TURN.
STUN stands for Session Traversal Using NATs
This technique is simple.
The client sends a STUN message to a STUN server.
The STUN server responds with the external IP address it sees.
That way the client knows both what its internal and external IP addresses are.
STUN is simple, lightweight, and very inexpensive to operate.
However, some firewalls are very restrictive and STUN does not always work,
In these scenarios you need a TURN server
TURN stands for Traversal Using Relay NAT
TURN acts a lot like a SBC, relaying media
The rule of thumb for TURN is that it is needed 10 to 15% of the time.
This really varies depending on the network and environment you are in.
Unlike STUN, TURN consumes a lot of bandwidth relaying media, so it more expensive to operate.
If not engineered properly, TURN can also increase latency, hurting voice quality.
Now let’s transition to some of the most difficult challenges of WebRTC – dealing with media.
As we discussed in the previous segments, media in WebRTC is normally sent directly between peers.
However, media can also be relayed by a server as we just showed in the TURN example.
There are many other reasons than TURN for requiring a media server. These include:
Traditional video conferencing multi-point control unit (MCU) for bridging multiple parties
Transcoding from one audio or video codec to another
Interworking WebRTC media with standard VoIP medis
Recording a stream or conversation
Analyzing or processing a stream in real time, such as inserting an image or video, performing call analytics, or simply adding DTMF
Any kind of person to machine or machine to machine that might not involve another person at all like today’s IVRs and speech recognition systems or the emerging computer vision systems for future applications
One advantage of today’s fast processors and the web model is that processing can be done in the client or server in many cases.
However, there are important trade-offs.
For example, bandwidth is not always ubiquitous or free – especially in mobile environments.
Server-side media processing can help reduce bandwidth requirements for clients.
In addition, CPU is often expensive.
This is especially true on mobile devices where CPU processing usually means high battery consumption.
Again, aggregating some or all this processing in the cloud is often a more efficient and user friendly method.
To give an example, let’s talk about parties:
Multi-party video conferences.
In most WebRTC design is additional bi-directional stream is added for each party.
Each end-point must fully encode and decode the stream for each party.
This actually works very well if there is only a couple parties – usually not more than 3 or 4.
However, this methodology quickly fails when you add multiple parties.
The clients quickly become overloaded and you run out of bandwidth/
The better approach is to centralize and mix all the media in an MCU and send a single or subset of streams to the each device.
This is very client friendly since each client only gets one adapted stream for its specific capabilities.
The downside the the MCU approach is that is very processor intensive on the server, especially when dealing with HD video.
The reason is each stream needs to be individually encoded and decoded.
A more efficient, higher-capacity approach is a technique we call encoder sharing.
If several devices are receiving the same stream, rather than fully encode each one, you can dramatically increase capacity by encoding only once and sharing that stream.
Since encoding requires significantly more processing than decoding, we have found this can increase capacity by 30 to 50%
A newer approach is known as a Selective Forwarding Unit (SFU)
In this architecture, each client sends only one stream to the SFU.
The SFU then redirects the stream to only the end points that want to see it.
The main task for the SFU is managing the encryption and decryption of the streams
No server-side encoding or decoding is required, so the SFU can handle a lot of clients.
An enhancement to this approach is known as simulcast.
Rather than just sending one stream, each client sends 2 or more streams – usually one high bitrate and one low bitrate.
Often times only a single high-bit rate – i.e. HD video – stream is sent for the active talker and the low bit-rate stream is sent for the others.
If a low power or bandwidth limited device is connected then the SFU can forward just the low-bitrate stream.
In fact, this is how Google Hangouts works today when you use it with Chrome.
We had a recent blog post on webrtcHacks that reverse engineers hangouts to see how it works.
It is interesting to see Google needed to implement a lot of proprietary mechanisms to make simulcast work.
And this is the main draw-back of this approach today – there is no standard way to do it - yet, and that is why it only works with Google Chrome today and other cannot really replicate what Google has done.
There is one additional approach called Scalable Video Coding or SVC.
Like simulcast, SVC sends multiple streams of varying quality from each client and a centralized SFU does the routing.
Unlike simulcast where independent streams are sent, SVC uses a layering approach in a single stream.
Like simulcast, the mechanisms for signaling the SFU are not standardized and wide-scale, WebRTC-based systems have yet to emerge.
Popular WebRTC blogger Tsahi Levant-Levi of bloggeek.me actually made a nice summary of this topic in a whitepaper he wrote that you can download for reference.
That covers conferencing, but I also wanted to touch on Recording too.
I think this is best illustrated with a case study.
We have actually seen a lot of demand for various video recording solutions.
In this example, a cable service provider wanted to leverage their set top boxes to allow anyone to easily stream, record, and videos from either a mobile app or the set top box. Many of their older customers are not big smart phone users and prefer the set top box interface.
The WebRTC recording solution fit in with their movement to a web-oriented architecture and provided a lot of flexiblity.
They have a diverse network infrastructure with mahy set top boxes only supporting older codecs, so there was also a transcoding need.
The last topic I would like to discuss is Gateways.
There are a lot of components that are involved with WebRTC gateways:
The STUN and TURN server pieces we discussed earlier
Another piece is what we call the HTTP-to-SIP or (H2S) component
this converts what ever proprietary web signaling mechanism is used to SIP and back.
Some groups have started to look at standards around this piece, but there is no strict standard definitions for how this should be done today.
The next piece is the Media Gateway
this handles WebRTC’s mandatory DTLS encryption and converts it to SDES or no encryption.
It also helps with some port multiplexing techniques WebRTC uses to aid with NAT traversal
Next is the transcoder - this converts codecs commonly used in WebRTC (OPUS and VP8) to codecs more commonly used in existing VoIP systems
Most existing VoIP systems also have some sort of SBC to help with SIP security protections and SIP header interworking.
API Gateway
Also, much like the SBC for SIP, the web interfaces need some kind of control interface.
This is usually accomplished via a API gateway that controls access to the API calls that a client can make
Lastly, unlike a SIP system, since there is no standard signaling there is no such thing as a standardized client.
Therefore, the WebRTC Gateway needs to provide a client interface – usually via a SDK or Widget for web development environments
Similarly, an additional SDK is usually needed for mobile environments
It is important to note that there is no standard way of configuring these elements.
Deployment models will vary considerably based on the size the deployment, the vendors involved, and the equipment that is already there.
I showed many different server examples. To conclude, I would like to show you a view of a real-world WebRTC architecture from a major US service provider.
Some key features:
As traffic increases, it makes more sense to specialize some elements – like the Secure Websocket (WSS) server to help handle web-socket based signaling
Multiple app servers exist – they communicate with each other as needed using REST API’s
Identify servers handle OAuth, OpenID, and local ID authentication
STUN/TURN is used to traverse strict firewalls and NATs
An API manager controls all ‘signaling’ communication into the network and protects the web service core from attacks
Firewalls handle service specific non-specific attacks and port scanning
In conclusion, there are at least 5 kinds of servers that are directly WebRTC related.
As I just mentioned, multiple kinds of servers does not necessarily mean they are packaged and sold that way.
In addition, there are many other servers - like web servers and identify servers – that are often already present that can be leveraged.
While this seems complex, often these elements are evolutions of existing VoIP gear.