Building a low-latency live streaming platform is certainly possible nowadays, but not without its challenges and compromises. Traditionally, HLS-style delivery keeps latency high because of chunk sizes being between 5 and 10 seconds. Pushing that down to 2 seconds, generally seen as the minimum viable chunk size can then cause problems estimating bandwidth and thus breaking ABR.
Tackling these challenges are a host of technologies such as CMAF, Low-Latency HLS (LHLS) and Apple’s LLHLS but this talk takes a different approach to deliver streams with only 3-4 seconds of latency.
Michelle Munson from Eluvio explains that theoretically you could stream chunks in realtime and the delay would be the propagation time over the internet. In reality, though, encoding and transcoding delay add up, plus the CDN can gradually add to a drift of the signal to 15 seconds. ABR is tricky when delivering chunks in a streamed manner because the standard method of determining available bandwidth by measuring the download time gets broken since all chunks come in real-time.
Tackling this, Michelle introduces her to the decentralised fabric which Eluvio have put together which uses dispersed nodes to hold data acting, in some ways as a CDN but the trick here is that the nodes work together to share video. Each node can transcode just in time and also can create playlists on-demand from the distributed metadata in response to client requests. Being able to bring things together dynamically an on the fly removes a lot of latency pinch points from the system.
The result is a system which can deliver content from the encoder to the nodes in around 250ms then a further 50 or so for distribution. To make ABR easier, the player works one segment behind live so it always has a whole segment to download as quickly as it can and thus enabling ABR to work normally.
Michelle finishes by highlighting the results of testing both over time and at scale. The results show that node load stayed low and even in both scenarios delivering 3.5 seconds of latency.
If you’ve ever tried to implement your own player, you’ll know there’s a big gap between understanding the HLS/DASH spec and getting an all-round great player. Finding the best, most elegant, ways of dealing with problems like buffer exhaustion takes thought and experience. The same is true for low-latency playback.
Fortunately, Akamai’s Will Law is here to give us the benefit of his experience implementing his own and helping customers monitor the performance of their players. At the end of the day, the player is the ‘kingpin’ of streaming, comments Will. Without it, you have no streaming experience. All other aspects of the stream can be worked around or mitigated, but if the player’s not working, no one watches anything.
Will’s first tip is to implement ‘segment abandonment’. This is when a video player foresees that downloading the current segment is taking too long; if it continues, it will run out of video to play before the segment has arrived. A well-programmed player will sport this and try to continue the download of this segment from another server or CDN. However, Will says that many will simply continue to wait for the download and, in the meantime, the download will fail.
Tip two is about ABR switching in low-latency, chunked transfer streams. The playback buffer needs to be longer than the chunk duration. Without this precaution, there will not be enough time for the player to make the decision to switch down layers. Will shows a diagram of how a 3-second playback buffer can recover as long as it uses 2-second segments.
Will’s next two suggestions are to put your initial chunk in the manifest by base64-encoding it. This makes the manifest larger but removes the round-trip which would otherwise be used to request the chunk. This can significantly improve the startup performance as the RTT could be a quarter of a second which is a big deal for low-latency streams and anyone who wants a short time-to-play. Similarly, advises Will, make those initial requests in parallel. Don’t wait for the init file to be downloaded before requesting the media segment.
Whilst many of points in this talk focus on the player itself, Will says it’s wise for the player to provide metrics back to the CDN, hidden in the request headers or query args. This data can help the CDN serve media smarter. For instance, the player could send over the segment duration to the CDN. Knowing how long the segment is, the CDN can compare this to the download time to understand if it’s serving the data too slow. Perhaps the simplest idea is for the player to pass back a GUID which the CDN can put in the logs. This helps identify which of the millions of lines of logs are relevant to your player so you can run your own analysis on a player-by-player level.
Will’s other points include advice on how to avoid starting playing at the lowest bandwidth and working up. This doesn’t look great and is often unnecessary. The player could run its own speed test or the CDN could advise based on the initial requests. He advises never trusting the system clock; use an external clock instead.
Regarding playback latency, it pays to be wise when starting out. If you blindly start an HLS stream, then your latency will be variable within the duration of a segment. Will advocates HEAD requests to try to see when the next chunk is available and only then starting playback. Another technique is to vary your playback rate o you can ‘catch up’. The benefit of using rate adjustment is that you can ask all your players to be at a certain latency behind realtime so they can be close to synchronous.
Two great tips which are often overlooked: Request multiple GOPs at once. This helps open up the TCP windows giving you a more efficient download. For mobile, it can also help the battery allowing you to more efficiently cycle the radio on and off. Will mentions that when it comes to GOPs, for some applications its important to look at exactly how long your GOP should be. Usually aligning it with an integer number of audio frames is the way to choose your segment duration.
The talk finishes with an appeal to move to using CMAF containers for streaming ask they allow you to deliver HLS and DASH streams from the same media segments and move to a common DRM. Will says that CBCS encrypted content is now becoming nearly all-pervasive. Finally, Will gives some tips on how players are best to analyse which CDN to use in multi-CDN environments.
Is it possible to improve on CMAF’s offer of an ultra-low-latency, scalable protocol with good viewer experience? This is what HESP, the High-Efficiency Streaming Protocol, promises. With almost instant channel change times and sub-second latency, it’s worth taking a look at those protocol created by THEOPlayer to understand where it might work in your workflows.
Presented by Pieter-Jan Speelmans and Johan Vounckx from THEO, we hear some more detail surrounding HESP’s inception. Quality, latency and bitrate are often referred to as a triangle where if you improve one or even two, the remaining factor will get worse to compensate. HESP plays in the triangle connecting ‘viewer experience’, ‘low latency’ and ‘scalability’. If you compare WebRTC with CMAF, you see that WebRTC prioritises low-latency streaming but suffers in terms of scalability. CMAF, being 2-5 seconds higher latency, has much better scalability but the channel zapping times are high which affects viewer experience as well as overall latency. HESP, contests Pieter-Jan, actually improves all three. It’s able to do this because it’s not extending existing protocols which weren’t designed to meet all these requirements, rather it’s bringing in new techniques which shift the whole equation.
THEOPlayer has created the HESP Alliance which is devoted to standardising the HESP technology through the IETF or other avenue, promoting adoption through marketing and the creation of tools, certification and management of intellectual property. The talk outlines the decoder royalties which can be payable by subscriber, per subscriber per hour, or per device.
Looking at the technical details, we find out that you can actually start playing an HESP stream without downloading the manifest. While HESP does have manifest files, they change very infrequently. If a new one is changed at short notice, the server can ask players to download one by embedding a message in the stream. The channel zapping speed is achieved using two streams, an initialisation stream and a continuation stream. The initialisation stream just I and P frames allowing you to start playing immediately. The continuation stream is intended to be the low-bitrate stream used after the establishment of the stream.
HESP uses two modes: Maximal Gain and Maximal Compatability. Maximal gain aims to have the lowest latency, lowest bandwidth and lowest zapping times. It has long segments with 1 frame chunks containing one I or P frame. The Maximal Compatability mode, however, allows you to reuse Low-Latency DASH and LLHLS streams and uses 6-second segments with 200msec chunks including B frames.
THEOPlayer claim 7x less delivery delay, 20x lower zapping times and a 20% bandwidth saving over CMAF with broad compatibility with many TVs, android, iOS, Web, streaming devices.
AV1 seems to be shaking off its reputation for slow encoding, now only 2x slower than HEVC. How practical, then is it to put AV1 into a real-time codec aiming for sub-second latency? This is exactly what the Alliance for Open Media are working on as parts of AV1 are perfectly suited for the use case.
Dr Alex from CoSMo Software took the podium at the Alliance for Open Media Research Symposium to lay out the whys and wherefores of updating WebRTC to deliver AV1. He started by outlining the different requirements of real-time vs VoD. With non-live content, encoding time is often unrestricted allowing for complex encoding methods to achieve lower bitrates. Even live CMAF streams aiming to achieve a relatively low 3-second latency have time enough for much more complex encoding than real-time. Encoding, ingest, storage and delivery can all be separated into different parts of the workflow for VoD, whereas real-time is forced to collapse logical blocks down as much as possible. Unsurprisingly, Dr Alex outlines latency as the most important driver in the WebRTC use case.
When streaming, ABR isn’t quite as simple as with chunked formats. The different bit rate streams need to be generated at the encoder to save any transcoding delays. There are two ways of delivering these streams. One is to deliver them as separate streams, the other is to deliver only one, layered stream. The latter method is known as Scalable Video Coding (SVC) which sends a base layer of a low-resolution version of the video which can be decoded on its own. Within that stream, is also the information which builds on top of that video to create a higher-resolution version of the same stream. You can have multiple layers and hence provide information for 3, 4 or more streams.
Managing which streams get to the decoder is done through an SFU (Selective Forwarding Unit) which is a server to which WebRTC clients connect to receive just the stream, or parts of a stream, they need for their current bandwidth capability. It’s important to remember that compared to video conferencing solutions based on WebRTC, that streaming using WebRTC scales linearly. Whilst it’s difficult to hold a meeting with 50 people in a room, it’s possible to optimise what video is sent to everyone by only showing the last 5 speakers in full resolution, the others as thumbnails. Such optimisations are not available for video distribution, rather SFUs and media servers need to be scaled and cascaded. This should be simple, but testing can be difficult but it’s necessary to ensure quality and network resilience at scale.
Cisco have already demonstrated the first real-time AV1-based WebRTC system, though without SVC support. Work is ongoing to deliver improvements to RTP encapsulation of AV1 in WebRTC. For instance, providing Decoding Target Information which embeds information about frames without needing to decode the video itself. This information explains how important each frame is and how it relates to the other video. Such metadata can be used by the SFU or the decoder to understand which frames to drop and send/decode.
Dr Alex Gouaillard
Video Codec Working Group – Real-time subgroup, Allience for Open Media
Founder, Directory & CEO, CoSMo Software Consulting Pte. Ltd.
Co-founder & CTO, Millicast
Subscribe to get daily updates
Views and opinions expressed on this website are those of the author(s) and do not necessarily reflect those of SMPTE or SMPTE Members.
This website is presented for informational purposes only. Any reference to specific companies, products or services does not represent promotion, recommendation, or endorsement by SMPTE