Delivering scale in streaming really exposes the weaknesses of every point of your workflow, so even those of us who are not streaming at maximum scale, there are many lessons to be learnt. CBS Sports Digital delivered the Super Bowl using the principles of ‘practice, practice, practice’, keeping the solution as simple as possible and making mitigation of problems primary to solving them.
Taylor Busch tells walks us through their solution explaining how it supported their key principles and highlighting the technology used. Staring with Acquisition, he covers the SDI fibre delivery to a backup facility as well as the AWS Direct Connect links for their Elemental Live encoders. The origin servers were in two different regions and both received data from both sets of encoders.
CBS used ‘Output locking’ which ensures that the TS segments are all aligned even across different encoders which is done by respecting the timecode in the SDI and helps in encoder failover situations. QVBR encoding is a method of encoding up to a quality level rather than simply saying ‘7000 kbps’. QVBR provides a more efficient use a bandwidth since in the situations where a scene doesn’t require a lot of bandwidth, it won’t be sent. This variability, even if you run in capped mode to limit the bandwidth of particularly complex scenes, can look like a failing encoder to some systems, so the fact this is now in ‘VBR’ mode, needs to be understood by all the departments and companies who monitor your feed.
Advertising is famously important for the Super Bowl, so Taylor gives an overview of how they used the CableLabs ESAM protocol and SCTE to receive information about and trigger the adverts. This combined SCTE-104, ESAM and SCTE-35 as we’ll as allowing clients to use VAST for tracking. Extra caching was provided by Fastly’s Media Shield which tests for problems with manifests, origin servers and encoders. This fed a Multi-CDN setup using 4 CDNs which could be switched between. There is a decision point for requests to determine which CDN should answer.
Taylor then looks at the tools, such as Mux’s dashboard, which they used to spot problems in the system; both NOC-style tools and multiviewers. They set up three war rooms which looked at different aspects of the system, connectivity, APIs etc. This allowed them to focus on what should be communicated keeping ‘noise’ down to give people the space they needed to do their work at the same time as providing the information required. Taylor then opens up to questions from the floor.
Delivering personalised video at scale, live or otherwise, is a tradeoff between speed and complexity. In this lightning talk at Demuxed 2019, Kyle Boutette from Cloudflare explains the benefits of running code on the ‘edge’.
Kyle starts by highlighting the reason to use CDNs; they take the management of a whole fleet of servers off your hands allowing you to concentrate on delivering a video service and deploying the technology to do just that. This works really well and CDNs are the backbone of most of the large sites on the internet. Some companies build their own whilst some use Cloudflare or Amazon CloudFront among the many CDNs out there. Apart from dealing with the admin of the servers, CDNs are careful to provide servers as close to your users as practical which helps in reducing latency.
The problem that Kyle exposes is that any personalisation needs to be done on the player itself or on the server. The former requiring implementing the same features on many platforms, the latter destroying the value of the CDN since it’s based on needing the central server(s) to calculate the new information and send it to the CDN bringing us back to square one.
Netflix take to the stage at Demux to tell us about the work they’ve been doing to understand and reduce latency by looking at the queue management of their managed switches. As Tony Orme mentioned yesterday, we need buffers in IP systems to allow synchronous parts to interact. Here, we’re looking at how the core network fabric’s buffers can get in
the way of the main video flows.
Te-Yuan Huang from Netflix explains their work in investigating buffers and how best to use them. She talks about the flows that occur due to the buffer models of standard switches i.e. waiting until the buffer is full and then dropping everything else that comes in until the buffer is emptied. There is an alternative method, Active Queue Management (AQM), called FQ-CoDel which drops packets based on probability before the buffer is dropped. By carefully choosing the probability, you can actually improve buffer handling and the impact it has on latency.
Te-Yuan shows us results from tests that her team has done which show that the FQ-CoDel specification does, indeed, reduce latency. After showing us the data, she summarises saying that FQ-CoDel improves playback and QOE.
Bruce Spang interned at Netflix and studied the phenomenon of unexpected latency variation within the netflix caches they deploy at ISPs to reduce latency and bandwidth usage. He starts by introducing us to the TCP buffering models looking at how they work and what they are trying to achieve with the aim of identifying how big it is supposed to be. The reason this is important is that if it’s a big buffer, you may find that data takes a long time to leave the buffer when it gets full, thus adding latency to the packets as they travel through. Too small, of course, and packets have to be dropped. This creates more rebuffing which impacts the ABR choice leading to lower quality.
Bruce was part of an experiment that studied whether the buffer model in use behaved as expected and whist he found that it did most of the time, he did find that video performance varied which was undesirable. To explain this, he details the testing they did and the finding that congestion, as you would expect, increases latency more during a congested time. Moreover, he showed that a 500MB had more latency than 50MB.
To explain the unexplained behaviour such as long-tail content having lower latency than popular content, Bruce explains how he looked under the hood of the router to see how VOQs are used to create queues of traffic and how they work. Seeing the relatively simply logic behind the system, Bruce talks about the results they’ve achieved working with the vendor to improve the buffering logic.