We know AI is going to stick around. Whether it’s AI, Machine Learning, Deep Learning or by another name, it all stacks up to the same thing: we’re breaking away from fixed algorithms where one equation ‘does it all’ to a much more nuanced approached with a better result. This is true across all industries. Within the Broadcast industry, one way it can be used is in video and audio compression. Want to make an image smaller? Downsample it with a Convolutional Neural Network and it will look better than Lanczos. No surprise, then, that this is coming in full force to a compression technology near you.
In this talk from Comcast’s Dan Grois, we hear the ongoing work to super-charge the recently released VVC by replacing functional blocks with neural-networks-based technologies. VVC has already achieved 40-50% improvements over HEVC. From the work Dan’s involved with, we hear that more gains are looking promising by using neural networks.
Dan explains that deep neural networks recognise images in layers. The brain does the same thing having one area sensitive to lines and edges, another to objects, another part of the brain to faces etc. A Deep Neural Network works in a similar way.
During the development of VVC, Dan explains, neural network techniques were considered but deemed too memory- or computationally-intensive. Now, 6 years on from the inception of VVC, these techniques are now practical and are likely to result in a VVC version 2 with further compression improvements.
Dan enumerates the tests so far swapping out each of the functional blocks in turn: intra- and inter-frame prediction, up- and down-scaling, in-loop filtering etc. He even shows what it would look like in the encoder. Some blocks show improvements of less than 5%, but added together, there are significant gains to be had and whilst this update to VVC is still in the early stages, it seems clear that it will provide real benefits for those that can implement these improvements which, Dan highlights at the end, are likely to require more memory and computation than the current version VVC. For some, this will be well worth the savings.
“Enhance!” the captain shouts as the blurry image on the main screen becomes sharp and crisp again. This was sci-fi – and this still is sci-fi – but super-resolution techniques are showing that it’s really not that far-fetched. Able to increase the sharpness of video, machine learning can enable upscaling from HD to UHD as well as increasing the frame rate.
Bitmovin’s Adithyan Ilangovan is here to explain the success they’ve seen with super-resolution and though he concentrates on upscaling, this is just as relevant to improving downscaling. Here are our previous articles covering super resolution.
Adithyan outlines two main enablers of super-resolution, allowing it to displace the traditional methods such as bicubic and Lanczos. Enabler one is the advent of machine learning which now has a good foundation of libraries and documentation for coders allowing it to be fairly accessible to a wide audience. Furthermore, the proliferation of GPUs and, particularly for mobile devices, neural engines is a big help. Using the GPUs inside CPUs or in desktop PCI slots allows the analysis to be done locally without transferring great amounts of video to the cloud solely for the purpose of processing or identification. Furthermore, if your workflow is in the cloud, it’s now easy to rent GPUS and FPGAs to handle such workloads.
Using machine learning doesn’t only allow for better upscaling on a frame-by-frame basis, but we are also able to allow it to form a view of the whole file, or at least the whole scene. With a better understanding of the type of video it’s analysing (cartoon, sports, computer screen etc.) it can tune the upscaling algorithm to deal with this optimally.
Anime has seen a lot of tuning for super-resolution. Due to Anime’s long history, there are a lot of old cartoons which are both noisy and low resolution which are still enjoyed now but would benefit from more resolution to match the screens we now routinely used.
Adithyan finishes by asking how we should best take advantage of super-resolution. Codecs such as LCEVC use it directly within the codec itself, but for systems that have pre and post-processing before the encoder, Adithyan suggests it’s viable to consider reducing the bitrate to reduce the CDN costs knowing the using super-resolution on the decoder, the video quality can actually be maintained.
Artificial Intelligence and Machine Learning (ML) dominate many discussions and for good reason, they usually reduce time and reduce costs. In the broadcast industry their are some obvious areas where it will, an already does, help. But what’s the time table? Where are we now? And what are we trying to achieve with the technology?
Edmundo Hoyle from TV Globo explains how they have managed to transform the thumbnail selection for their OTT service from a manual process taking an editor 15 minutes per video to an automated process using machine learning. A good thumbnail is relevant, it is a clear picture and has no nudity or weapons in it. Edmundo explains that they tackled this in a three-step process. The first step uses NLP analysis of the episode summary to understand what’s relevant and to match that with the subtitles (closed captions). Doing this identifies times int he video which should be examined more closely for thumbnails.
The durations identified by the process are then analysed for blur-free frames (amongst other metrics to detect clear videography) which gives them candidate pictures which may contain problematic imagery. the AWS service Rekognition which returns information regarding whether faces, guns or nudity are present in the frame. Edmundo finishes by showing the results which are, in general very positive. Final choice of thumbnails is still moderated by editors, but the process is much more streamlined because they are much less likely to have to find an image manually since the process selects 4 options. Edmundo finishes by explaining some of the chief causes of rejecting an image which are all relatively easy to improve upon and tend to be related to a person looking down or away from the camera.
We’ve seen before on The Broadcast Knowledge the idea of super-resolution which involves up-scaling images/video using machine learning. The result is better than using standard linear filters like lanczos. This is has been covered in a talk from Mux’s Nick Chadwick about LCEVC. Yiannis Andreopoulos from iSize talks next about the machine learning they use to improve video which uses some of these same principles to pre-treat, or as they call it ‘pre-code’ video before it’s encoded using a standard MPEG encoder (whether that be AVC, HEVC or the upcoming VVC). Yiannis explains how they are able to understand the best resolutions to encode at and scale the image intelligently appropriately. This delivers significant gains across all the metrics leading to bandwidth reduction. Furthermore he outlines a system which feeds back to maintain both the structure of the video which avoids it becoming too blurry which can be a consequence of being to subservient to the drive to reduce bitrate and thus simplifying the picture. It can also, though, protect itself from going too far down the sharpness path and only chasing metrics gains. He concludes by outlining future plans.
Grant Franklin Totten then steps up to explain how Al Jazeera have used AI/machine learning to help automate editorial compliance processes. He introduces the idea of ‘Contextual Video Metadata’ which ads a level of context to what would otherwise be stand-alone metadata. To understand this, we need to learn more about what Al Jazeera is trying to achieve.
As a news organisation, Al Jazeera has many aspects of reporting to balance. They are particularly on the look out for bias, good fact-checking & fake news. In order to support this, they are using AI and machine learning. They have both textual and video-based methods of detecting fake news. As an example of their search for bias, they have implemented voice detection and analysed MP’s speech time in Ireland. Irish law requires equal speaking time, yet Al Jazeera can easily show that some MPs get far more time than others. Another challenge is detecting incorrect on-screen text with the example given of naming Trump as Obama by accident on a lower-third graphic. Using OCR, NLP and Face recognition, they can flag issues with the hope the they can be corrected before Tx. In terms of understanding, for example, who is president, Al Jazeera is in the process of refining the Knowledge graph to capture the information they need to check against.
AI and machine learning (ML) aren’t going anywhere. This talk shines a light on two areas where it’s particularly helpful in broadcast. You can count on hearing significant improvements in AI and ML’s effectiveness in the next few years and it’s march into other parts of the workflow. Watch now! Speakers
Per-title encoding is a common method of optimising quality and compression by changing the encoding options on a file-by-file basis. Although some would say the start of per-scene encoding is the death knell for per-title encoding, either is much better than the more traditional plan of applying exactly the same settings to each video.
This talk with Mux’s Nick Chadwick and Ben Dodson looks at what per-title encoding is and how to go about doing it. The initial work involves doing many encodes of the same video and analysing each for quality. This allows you to out which resolutions and bitrates to encode at and how to deliver the best video.
Ben Dodson explains the way they implemented this at Mux using machine learning. This was done by getting computers to ‘watch’ videos and extract metadata. That metadata can then be used to inform the encoding parameters without the computer watching the whole of a new video.
Nick takes some time to explain MUX’s ‘convex hulls’ which give a shape to the content’s performance at different bitrates and helps visualise the optimum encoding parameters the content. Moreover, we see that using this technique, we can explore how to change resolution to create the best encode. This doesn’t always mean reducing the resolution; there are some surprising circumstances when it makes sense to start at high resolutions, even for low bitrates.
The next stage after per-title encoding is to segment the video and encode each segment differently which Nick explores and explains how to deliver different resolutions throughout the stream seamlessly switching between them. Ben takes over and explains how this can be implemented and how to chose the segment boundaries correctly, again, using a machine learning approach to analysis and decision making.
Views and opinions expressed on this website are those of the author(s) and do not necessarily reflect those of SMPTE or SMPTE Members.
This website is presented for informational purposes only. Any reference to specific companies, products or services does not represent promotion, recommendation, or endorsement by SMPTE