Video: Encoding Vs Compute Efficiency in Video Coding

Ioannis Katsavounidis from Facebook joins us to talk us through his work finding the best balance between computation and encoding. He explains how encoding has moved from real-time, hardware-based encoding in the late 80s and 1990s through to file encoding, chunk-based encoding and now shot-based encoding. Each of these stages has brought opportunities to speed up encoding, but there has always been a fundamental reason why encoding can’t simply be sped up by the advance of IT.

Moore’s law posits that every year, the number of transistors in chips doubles. Whilst this has continued to be true until recent years, transistors have always been a proxy for processing power. For many years now, the way to keep the computational ability of CPUs high has been not to increase clock-speed as it was twenty years ago, but to add cores to the chip. As each core acts as its own CPU, this gives the ability to execute code in parallel with a thread of code running separately on each core. Whilst 12-20 cores are typical for servers, there are CPUs which deliver up to 128 cores.

Ioannis explains why DCT-based codecs are resistant to multi-thread encoding by showing how some of the encoding decisions are based on the previously decoded video frame so the encoder needs to decode the video before it has the information it needs to make the next encode decisions. An example of this motion estimation where you need to understand what a macroblock looks like in order to detail if and how it can be moved to form part of the macroblock currently being encoded.

It turns out that some of the information you need to calculate can be found from the original video. Whilst this doesn’t provide full parallelisation, it does help in freeing some of the computation to be done in parallel thus reducing the length of time spent on the linear encoding stage. As the design of the codec itself is limited in its ability to be parallelised, the best way to speed up encoding has been to split up the original video and encode these, now separate, sections independently.

Speeding up video encoding has therefore focused on splitting up the video into different sections and encoding those in parallel rather than trying to parallelise the encoding itself due. Encoding each frame separately is one way to do this, but sacrifices encoding efficiency. Splitting each frame up into sections (tiles or slices) is another way, though this also sacrifices either quality or bitrate. The most successful encoding parallelisation has been chunked encoding. As streaming applications use chunks, typically around 2 seconds nowadays, there’s no reason not to just cut your video up into small sections and encode those separately; the whole of this video focuses on non-live video.

Direct link

If there’s a shot change in the middle of your chunk, this is likely to look very bad since the motion estimation will fail to produce good results and there may not be enough bitrate budget to compensate. Therefore it’s best to drop in an IDR frame at the shot change or to actually change your video chunks to match shot changes. Simply encoding these chunks in parallel would speed up the encoding, however, it misses an opportunity to optimise quality vs bitrate.

Ioannis explains an experiment to determine the best operating point for chunks. He does that by reminding us that all encoders have certain ‘speed’ settings which control how much computation, and therefore time, is required for each encode. The ‘very fast’ setting in x264 will encode at the highest speed possible, but the quality will be worse or a certain bitrate compared to the ‘very slow’ setting. Ioannis’s experiment encoded each chunk at every speed setting for a variety of resolutions and bitrates. Each encode was then analysed for quality using PSNR, MS-SSIM and VMAF.

From Ioannis’ work, we can see how the bitrate setting affects both the encode time and the quality and we can observe that the slower speeds tend to have minimal quality advantages for the significant extra time involved in the encoding. Each curve has a steep part and a shallow section with the transition between known as the ‘convex hull’. Choosing a setting on the convex hull portion of the line is the optimal balance between quality and encoding time and is where, says Ioannis, most people should aim to operate.

The talk finishes with a summary of the conclusions which can be drawn from this work looking at the use of convex-hull which we’ve just discussed, the best type of parallel processing, whether oversubscription of CPU cores is helpful or not and an interesting observation that it’s often the metrics which put a significant burden on encoding rather than the video encoding itself, particularly for lower resolutions.

Watch now!
Speakers

Ioannis Katsavounidis Ioannis Katsavounidis
Research Scientist,
Facebook

Video: Optimal Design of Encoding Profiles for Web Streaming

With us since 1998, ABR (Adaptive Bitrate) has been allowing streaming players to select a stream appropriate for their computer and bandwidth. But in this video, we hear that over 20 years on, we’re still developing ways to understand and optimise the performance of ABRs for delivery, finding the best balance of size and quality.

Brightcove’s Yuriy Reznik takes us deep into the theory, but start at the basics of what ABR is and why we. use it. He covers how it delivers a whole series os separate streams at different resolutions and bitrates. Whilst that works well, he quickly starts to show the downsides of ‘static’ ABR profiles. These are where a provider decides that all assets will be encoded at the same set bitrate of 6 or 7 bitrates even though some titles such as cartoons will require less bandwidth than sports programmes. This is where per-title and other encoding techniques come in.

Netflix coined the term ‘per-title encoding’ which has since been called content-aware encoding. This takes in to consideration the content itself when determining the bitrate to encode at. Using automatic processes to determine objective quality of a sample encode, it is able to determine the optimum bitrate.

Content & network-aware encoding takes into account the network delivery as part of the optimisation as well as the quality of the final video itself. It’s able to estimate the likelihood of a stream being selected for playback based upon its bitrate. The trick is combining these two factors simultaneously to find the optimum bitrate vs quality.

The last element to add in order to make this ABR optimisation as realistic as practical is to take into account the way people actually view the content. Looking at a real example from the US open, we see how on PCs, the viewing window can be many different sizes and you can calculate the probability of the different sizes being used. Furthermore we know there is some intelligence in the players where they won’t take in a stream with a resolution which is much bigger than the browser viewport.

Yuriy brings starts the final section of his talk by explaining that he brought in another quality metric from Westerink & Roufs which allows him to estimate how people see video which has been encoded at a certain resolution which is then scaled to a fixed interim resolution for decoding and then to the correct size for the browser windows.

The result of adding in this further check shows that fewer points on the ladder tend to be better, giving an overall higher quality value. Going much beyond 3 is typically not useful for the website. Shows only a few resolutions needed to get good average quality. Adding more isn’t so useful.

Yuriy finishes by introducing SSIM modeling of the noise of an encoder at different bitrates. Bringing together all of these factors, modelled as equations, allows him to suggest optimal ABR ladders.

Watch now!
Speaker

Yuriy Reznik Yuriy Reznik
Technology Fellow and Head of Research,
Brightcove

Video: No-Reference QoE Assessment: Knowledge-based vs. Learning-based

Automatic assessment of video quality is essential for creating encoders, selecting vendors, choosing operating points and, for online streaming services, in ongoing service improvement. But getting a computer to understand what looks good and what looks bad to humans is not trivial. When the computer doesn’t have the source video to compare against, it’s even harder.

In this talk, Dr. Ahmed Badr from SSIMWAVE looks at how video quality assessment (VQA) works and goes into detail on No-Reference (NR) techniques. He starts by stating the case for VQA which is an extension, and often replacement for subjective scoring by people. Clearly this is time-consuming, can be more expensive due to involvement of people (and the time) plus requires specific viewing conditions. When done well, a whole, carefully decorated room is required. So when it comes to analysing all the video created by a TV station or automating per-title encoding optimisation, we know we have to remove the human element.

Ahmed moves on to discuss the challenges of No Reference VQA such as identifying intended blur or noise. NR VQA is a two-step process with the first being extracting features from the video. These features are then mapped to a quality model which can be done with a machine learning/AI process which is the technique which Ahmed analyses next. The first task is to come up with a dataset of videos which should be carefully chosen, then it’s important to choose a metric to use for the training, for instance, MS-SSIM or VMAF. This is needed so that the learning algorithm can get the feedback it needs to improve. The last two elements are choosing what you are optimising for, technically called a loss function, and then choosing an AI model for use.

The data set you create needs to be aimed at exploring a certain aspect or range of aspects of video. It could be that you want to optimise for sports, but if you need a broad array of genres, optimising for reducing compression or scaling artefacts may be the main theme of the video dataset. Ahmed talks about the millions of video samples that they have collated and how they’ve used that to create their metric called SSIMPLUS which can work both with a reference and without.

Watch now!
Speaker

Dr. Ahmed Badr Dr. Ahmed Badr
SSIMWAVE

Video: Extension to 4K resolution of a Parametric Model for Perceptual Video Quality

Measuring video quality automatically is invaluable and, for many uses, essential. But as video evolves with higher frame rates, HDR, a wider colour gamut (WCG) and higher resolutions, we need to make sure the automatic evaluations evolve too. Called ‘Objective Metrics’, these computer-based assessments go by the name of PSNR, DMOS, VMAF and others. One use for these metrics is to automatically analyse an encoded video to determine if it looks good enough and should be re-encoded. This allows for the bitrate to be optimised for quality. Rafael Sotelo, from the Universidad de Montevideo, explains how his university helped work on an update to Predicted MOS to do just this.

MOS is the Mean Opinion Score and is a result derived from a group of people watching some content in a controlled environment. They vote to say how they feel about the content and the data, when combined gives an indication of the quality of the video. The trick is to enable a computer to predict what people will say. Rafael explains how this is done looking at some of the maths behind the predicted score.

In order to test any ‘upgrades’ to the objective metric, you need to test it against people’s actual score. So Rafael explains how he set up his viewing environments in both Uruguay and Italy to be compliant with BT.500. BT.500 is a standard which explains how a room should be in order to have viewing conditions which maximise the ability of the viewers to appreciate the pros and cons of the content. For instance, it explains how dim the room should be, how reflective the screens and how they should be calibrated. The guidelines don’t apply to HDR, 4K etc. so the team devised an extension to the standard in order to carryout the testing. This is called ‘subjective testing’.

With all of this work done, Rafael shows us the benefits of using this extended metric and the results achieved.

Watch now!
Speakers

Rafael Sotelo Rafael Sotelo
Director, ICT Department
Universidad de Montevideo