Video: Providing better video experiences for the next billion users

What’s the best way for a billion people all on mobile networks to have a universally great streaming experience? It’s not trivial, and no service is perfect, but Facebook set out to find out what problems existed and find ways to fix them. This video explains their approach and solutions.

Denise Noyes from Facebook spoke at Demuxed 2020 about their work in India over the year. For Facebook, India is unique for this research as it represents such a large number of people almost universally using Android phones and mobile data. Not only does this allow them to understand the low-bitrate performance of video, but the Android penetration level simplifies comparisons.

The problems that Denise and her colleagues identified were gaps in the bitrate ladders where the ABR ladder either wasn’t well optimised or didn’t go low enough. There were also some ABR logic/decisions that were seen to be causing problems along with server delays from the CDN and internal congestion within the app. The research looked at ‘average bad sessions per user’ rather than the overall number of bad sessions which would be skewed by how many videos people generally watched.

Covid had a bearing on the research as this was being conducted by in-person interviews within India. These teams had to come home but the relevance of the research was acutely highlighted by the networks in other countries which worsened in response to the rising amount of traffic making them closer to the Indian example.

Denise’s team worked with colleagues throughout the company to create improvements across the whole network and delivery stack. On the encoding front, they decreased the lowest encoding level to 100kbps. This doesn’t look amazing, as seen by the metric score, but it’s better than buffering and can be watchable dependent on content. The GOP size was also increased from 2 seconds to 5. Longer GOP sizes are known to deliver improved bitrate, in this case up to 8%, but there is a tradeoff to pay in latency and how frequently you can move up/down the ABR ladder. Facebook found that the tradeoffs were worth the improvement for the viewers.

Denise introduces FB-MOS, Facebook’s objective model of the MOS objective metric. The lower the number, the worse the video looks. Facebook have used the fact that encoding resolution ‘A’ at, say, 400kbps and 200kbps can look better than encoding resolution ‘A’ at 400kbps and using a lower resolution ‘B’ for the 200kbps encode. This has lead to the ABR having 360p at two bitrates and 480p at two bitrates.

That FB-MOS score comes in handy for avoiding the lowest rungs of the ABR ladder. As their MOS score is quite low, the player will only choose it if it really has no choice otherwise, it will prefer to settle on a higher quality version if it isn’t able to go up the ladder. Ironically, they have also implemented logic to limit who gets the highest bandwidth streams since most users would prefer to spend less on data than get that disproportionately low improvement in quality.

In playback, Denise explains that they have reduced the impact of occasional anomalies on the bandwidth estimation and adjusted prefetching to prefetch the first chunk of all videos it would like to prefetch before getting the next chunk. This has reduced the chance that someone is able to choose a video which hasn’t yet been buffered and hence have to wait for it to start.

Lastly Denise covers the work done at the network layer seeing a move from HTTP/2 to QUIC. We see how the removal of head-of-line blocking has helped and that, not only has this the move to QUIC seen an overall improvement in performance but as congestion increased, QUIC traffic has shown a disproportionate improvement.

Denise concludes highlighting that this work across the network stack with wide collaboration has not only delivered the desired results but is a vital approach for any company looking to make marked improvements in customer experience.

Watch now!
Speaker

Denise Noyes Denise Noyes
Software Developer,
Facebook

Video:Measuring Video Quality with VMAF – Why You Should Care

VMAF, from Netflix, has become a popular tool for evaluating video quality since its launch as an Open Source project in 2017. Coming out of research from the University of Southern California and The University of Texas at Austin, it’s seen as one of the leading ways to automate video assessment.

Netflix’s Christos Bampis gives us a brief overview of VMAF’s origins and its aims. VMAF came about because other metrics such as MS-SSIM and, in particular, PSNR aren’t close enough indicators of quality. Indeed, Christos shows that when it comes to animated content (i.e. anime and cartoons) subjective scores can be very high, but if we look at the PSNR score it can be the same as the PSNR of score another live-action video clip which humans rate a lot lower, subjectively. Moreover, in less extreme examples, Christos explains. PSNR is often 5% or so away from the actual subjective score in either direction.

To a simple approximation, VMAF is a method of bringing out the spatial and temporal information from a video frame in a way which emphasises the types of things humans are attuned to such as contrast masking. Christos shows an example of a picture where artefacts in the trees are much harder to see than similar artefacts on a colour gradient such as a sky or still water. These extraction methods take account of situations like this and are then fed into a trained model which matches the results of the model with the numbers that humans would have given it. The idea being that when trained on many examples, it can correctly predict a human’s score given a set of data extracted from a picture. Christos shows examples of how well VMAF out-performs PSNR in gauging video quality.

 

Challenges are in focus in the second half of the talk. What are the things which still need working on to improve VMAF? Christos zooms in on two: design dimensionality and noise. By design dimensionality, he means how can VMAF be extended to be more general, delivering a number which has a consistent meaning in different scenarios? As the VMAF model has been trained on AVC, how can we deal with different artefacts which are seen with different codecs? Do we need a new model for HDR content instead of SDR and how should viewing conditions, whether ambient light or resolution and size of the display device, be brought into the metric? The second challenge Christos highlights is noise as he reveals VMAF tends to give lower scores than it should to noisy sources. Codecs like AV1 have film-grain synthesis tools and these need to be evaluated, so behaving correctly in the presence of video noise is important.

The talk finishes with Christos outlining that VMAF’s applicability to the industry is only increasing with new codecs coming out such as LCEVC, VCC, AV1 and more – such diversity in the codec ecosystem wasn’t an obvious prediction in 2014 when the initial research work was started. Christos underlines the fact that VMAF is a continually evolving metric which is Open Source and open to contributions. The Q&A covers failure cases, super-resolution and how to interpret close-call results which are only 1% different.

Watch now!
Download the presentation
Speaker

Christos Bampis Christos Bampis
Senior Software Engineer,
Netflix

Video: Encoding Vs Compute Efficiency in Video Coding

Ioannis Katsavounidis from Facebook joins us to talk us through his work finding the best balance between computation and encoding. He explains how encoding has moved from real-time, hardware-based encoding in the late 80s and 1990s through to file encoding, chunk-based encoding and now shot-based encoding. Each of these stages has brought opportunities to speed up encoding, but there has always been a fundamental reason why encoding can’t simply be sped up by the advance of IT.

Moore’s law posits that every year, the number of transistors in chips doubles. Whilst this has continued to be true until recent years, transistors have always been a proxy for processing power. For many years now, the way to keep the computational ability of CPUs high has been not to increase clock-speed as it was twenty years ago, but to add cores to the chip. As each core acts as its own CPU, this gives the ability to execute code in parallel with a thread of code running separately on each core. Whilst 12-20 cores are typical for servers, there are CPUs which deliver up to 128 cores.

Ioannis explains why DCT-based codecs are resistant to multi-thread encoding by showing how some of the encoding decisions are based on the previously decoded video frame so the encoder needs to decode the video before it has the information it needs to make the next encode decisions. An example of this motion estimation where you need to understand what a macroblock looks like in order to detail if and how it can be moved to form part of the macroblock currently being encoded.

It turns out that some of the information you need to calculate can be found from the original video. Whilst this doesn’t provide full parallelisation, it does help in freeing some of the computation to be done in parallel thus reducing the length of time spent on the linear encoding stage. As the design of the codec itself is limited in its ability to be parallelised, the best way to speed up encoding has been to split up the original video and encode these, now separate, sections independently.

Speeding up video encoding has therefore focused on splitting up the video into different sections and encoding those in parallel rather than trying to parallelise the encoding itself due. Encoding each frame separately is one way to do this, but sacrifices encoding efficiency. Splitting each frame up into sections (tiles or slices) is another way, though this also sacrifices either quality or bitrate. The most successful encoding parallelisation has been chunked encoding. As streaming applications use chunks, typically around 2 seconds nowadays, there’s no reason not to just cut your video up into small sections and encode those separately; the whole of this video focuses on non-live video.

Direct link

If there’s a shot change in the middle of your chunk, this is likely to look very bad since the motion estimation will fail to produce good results and there may not be enough bitrate budget to compensate. Therefore it’s best to drop in an IDR frame at the shot change or to actually change your video chunks to match shot changes. Simply encoding these chunks in parallel would speed up the encoding, however, it misses an opportunity to optimise quality vs bitrate.

Ioannis explains an experiment to determine the best operating point for chunks. He does that by reminding us that all encoders have certain ‘speed’ settings which control how much computation, and therefore time, is required for each encode. The ‘very fast’ setting in x264 will encode at the highest speed possible, but the quality will be worse or a certain bitrate compared to the ‘very slow’ setting. Ioannis’s experiment encoded each chunk at every speed setting for a variety of resolutions and bitrates. Each encode was then analysed for quality using PSNR, MS-SSIM and VMAF.

From Ioannis’ work, we can see how the bitrate setting affects both the encode time and the quality and we can observe that the slower speeds tend to have minimal quality advantages for the significant extra time involved in the encoding. Each curve has a steep part and a shallow section with the transition between known as the ‘convex hull’. Choosing a setting on the convex hull portion of the line is the optimal balance between quality and encoding time and is where, says Ioannis, most people should aim to operate.

The talk finishes with a summary of the conclusions which can be drawn from this work looking at the use of convex-hull which we’ve just discussed, the best type of parallel processing, whether oversubscription of CPU cores is helpful or not and an interesting observation that it’s often the metrics which put a significant burden on encoding rather than the video encoding itself, particularly for lower resolutions.

Watch now!
Speakers

Ioannis Katsavounidis Ioannis Katsavounidis
Research Scientist,
Facebook

Video: Scaling up Anime with Machine Learning and Smart Real Time Algorithms

Too long has video been dominated by natural scenes and compression has been about optimising for skin tones. Recently we have seen technologies taking care of displaying other types of video correctly like computer displays such as computer games, as seen in VVC and also animation optimisation for upscalers as we explore in this talk.

Anime, a Japanese genre of animation, is not very different from an objective point of video from most video cartoons; the drawing style is black lines on relatively simple, solid areas of colour. Anime itself is a clearly distinct genre whose fans are much more sensitive to quality, but for codecs and scalers, 2D animation, in general, is a style that easily shows artefacts.

Up- and down-scaling is the process of making an image of say 1080 pixels high and 1920 wide larger, for instance 2160×3840 or smaller, say to SD resolution. Achieving this without jagged edges or blurriness is difficult and conventional maths can do a decent job, but often leaves something to be desired. Christopher Kennedy from Crunchyroll explains the testing he’s done looking at a super resolution upscaling technique which uses machine learning to improve the quality of upscaled anime video.

Waifu2x is an opensource algorithm which uses Convolutional Neural Networks (CNNs) to scale images and remove artefacts. To start with, Christopher explains the background of traditional algorithmic upscaling discussing the fact that better-looking algorithms take longer so TVs often choose the fastest leading them to look pretty bad if fed SD video. Better for the streaming provider to spend the time doing an upconversion to 4K so allow the viewer a better final quality on their set.

Machine Learning needs a training set and one thing which has contributed to waifu2x’s success in Anime is that it has been trained only on examples of anime leaving it well practised in improving this type of image. Christopher presents the results of his tests comparing standard bilinear and bicubic scaling with waifu2x showing the VMAF, PSNR and SSIM scores.

Finishing off the video, Christopher talks about the time this waifu2x takes to run, the cost of running it in the cloud and he shares some of the command lines he used.

Reference links:

Watch now!
Speaker

Christopher Kennedy Christopher Kennedy
Staff Video Engineer,
Crunchyroll