JPEG XS is a low-latency, light-compression codec often called a ‘mezzanine’ codec. Encoding within milliseconds, JPEG XS can compress full-bandwidth signals by 4x or more allowing scope for several generations of compression without significant degradation. The low-latency and resilience to de-generation make it ideal for enabling remote production.
John Dale from Media Links joins us to look at what’s being done within the Video Services Forum (VSF) to ensure interoperability. As a new standard, JPEG XS is yet to be or is still being implemented in many companies’ products. Therefore this is the perfect time to be looking at how to standardise interconnects,
Running JPEG XS over MPEG TS is one approach which is being written up in ‘VSF TR-07’ (Technical Reference 7) which will be imminently completed. It defines capabilities for 2K, 4K and 8K video with and without HDR. They have split the video formats into capability sets meaning that a vendor can comply with the specification by stating which subset(s) it can cope with. All formats up to 1080p60 are under capability set ‘A’ with ‘B’ covering UHD resolutions. After this work, they will look at JPEG XS over ST 2110-22 instead of MPEG TS. This is yet to start and will share much of the work from previous work.
We saw in this week’s AV1 panel, AV1 encoding times have dropped into a practical range and it’s starting to gain traction. One of the key differentiators of the codec, along only with VVC is the inclusion by default of tools aimed at encoding screens and computer graphics rather than natural video.
Zoe Liu, CEO of Visionular talks at RTE2020 about these special abilities of AV1 to encode screen-content. The video starts with a refresher on AV1 in general, it’s arrival on the scene from the Alliance of Open Media and the en/decoder ecosystem around it such as SVT-AV1 we talked about two days ago, dav1d, rav1e etc. as well as a look at the hardware encoders being readied from the likes of Samsung.
Turning her focus to screen content, Zoe explains that screen content is different for a number of reasons. For content like this presentation, much of the video stays static a lot of the time, then there is a peak as the slide changes. This gives rise to the idea of allowing for variable frame rates but also optimising for the depth of the colour palette. Motion on screens can be smoother and also has more distinct patterns in the form of identical letters. This seems to paint a very specific picture of what screen content is, when we all know that it’s very variable and usually has mixed uses. However, having tools to capture these situations as they arise is critical for the times when it matters and it’s these coding tools that Zoe highlights now.
One common technique is to partition the screen into variable-sized blocks and AV1 brings more partition shapes than in HEVC. Motion compensation has been the mainstay of MPEG encoding for a long time. AV1 also uses motion compensation and for the first time brings in motion vectors which allow for rotation and zooming. Zoe explains the different modes available including compound motion modes of which there are 128.
Capitalising on the repetitive nature screen content can have, Intra Block Copy (IntraBC) is a technique used to copy part of a frame to other parts of the frame. Similar to motion vectors which point to other frames, this helps replication within the frame. This is used as part of the prediction and therefore can be modified before the decode is finished allowing for small variations. Palette Mode CFL (Chrome from Luma) is a predictor for colour based on the luma signal and some signalling from the encoder.
Zoe highlights to areas where screen content reacts badly to encoding tools normally beneficial such as temporal filtering which is usually associated with 8% gains in efficiency at the encoder, but this can make motion vectors much more complicated in screen content and hurt compression efficiency. Similarly, when partitioning, screen content lower sizes often work well for natural video, but the opposite is true for screen content.
The talk finishes with Zoe explaining how Visionular’s own AV1 implementation preformed on standardised 4K against other implementations, their implementation of scalable video coding for RTC and the overall compression improvements.
With two years of development and deployments under its belt, AV1 is still emerging on to the codec scene. That’s not to say that it’s no in use billions of times a year, but compared to the incumbents, there’s still some distance to go. Known as very slow to encode and computationally impractical, today’s panel is here to say that’s old news and AV1 is now a real-time codec.
Brought together by Jill Boyce with Intel, we hear from Amazon, Facebook, Googles, Amazon, Twitch, Netflix and Tencent in this panel. Intel and Netflix have been collaborating on the SVT-AV1 encoder and decoder framework for two years. The SVT-AV1 encoder’s goal was to be a high-performance and scalable encoder and decoder, using parallelisation to achieve this aim.
Yueshi Shen from Amazon and Twitch is first to present, explaining that for them, AV1 is a key technology in the 5G area. They have put together a 1440p, 120fps games demo which has been enabled by AV1. They feel that this resolution and framerate will be a critical feature for Twitch in the next two years as computer games increasingly extend beyond typical broadcast boundaries. Another key feature is achieving an end-to-end latency of 1.5 seconds which, he says, will partly be achieved using AV1. His company has been working with SOC vendors to accelerate the adoption of AV1 decoders as their proliferation is key to a successful transition to AV1 across the board. Simultaneously, AWS has been adding AV1 capability to MediaConvert and is planning to continue AV1 integration in other turnkey content solutions.
David Ronca from Facebook says that AV1 gives them the opportunity to reduce video egress bandwidth whilst also helping increase quality. For them, SVT-AV1 has brought using AV1 into the practical domain and they are able to run AV1 payloads in production as well as launch a large-scale decoder test across a large set of mobile devices.
Matt Frost represent’s Google Chrome and Android’s point of view on AV1. Early adopters, having been streaming partly using AV1 since 2018 in resolution small and large, they have recently added support in Duo, their Android video-conferencing application. As with all such services, the pandemic has shown how important they can be and how important it is that they can scale. Their move to AV1 streaming has had favourable results which is the start of the return on their investment in the technology.
Google’s involvement with the Alliance for Open Media (AOM), along with the other founding companies, was born out of a belief that in order to achieve the scales needed for video applications, the only sensible future was with cheap-to-deploy codecs, so it made a lot of sense to invest time in the royalty-free AV1.
Andrey Norkin from Netflix explains that they believe AV1 will bring a better experience to their members. Netflix has been using AV1 in streaming since February 2020 on android devices using a software decoder. This has allowed them to get better quality at lower bitrates than VP9 Testing AV1 on other platforms. Intent on only using 10-bit encodes across all devices, Andrey explains that this mode gives the best efficiency. As well as being founding members of AoM, Netflix has also developed AVIF which is an image format based on AV1. According to Andrey, they see better performance than most other formats out there. As AVIF works better with text on pictures than other formats, Netflix are intending to use it in their UI.
Tencent’s Shan Liu explains that they are part of the AoM because video compression is key for most Tencent businesses in their vast empire. Tencent cloud has already launched an AV1 transcoding service and support AV1 in VoD.
The panel discusses low-latency use of AV1, with Dave Ronca explaining that, with the performance improvements of the encoder and decoders along-side the ability to tune the decode speed of AV1 by turning on and off certain tools, real-time AV1 are now possible. Amazon is paying attention to low-end, sub $300 handsets, according to Yueshi, as they believe this will be where the most 5G growth will occur so site recent tests showing decoding AV1 in only 3.5 cores on a mobile SOC as encouraging as it’s standard to have 8 or more. They have now moved to researching battery life.
The panel finishes with a Q&A touching on encoding speed, the VVC and LCEVC codecs, the Sisvel AV1 patent pool, the next ramp-up in deployments and the roadmap for SVT-AV1.
Watch now! Please note: After free registration, this video is located towards the bottom of the page Speakers
AWS & Twitch
Video Infrastructure Team,
Product Manager, Chome Media Technologies,
Emerging Technologies Team
Dr Shan Liu
Chief Scientist & General Manager,
Tencent Media Lab
Ioannis Katsavounidis from Facebook joins us to talk us through his work finding the best balance between computation and encoding. He explains how encoding has moved from real-time, hardware-based encoding in the late 80s and 1990s through to file encoding, chunk-based encoding and now shot-based encoding. Each of these stages has brought opportunities to speed up encoding, but there has always been a fundamental reason why encoding can’t simply be sped up by the advance of IT.
Moore’s law posits that every year, the number of transistors in chips doubles. Whilst this has continued to be true until recent years, transistors have always been a proxy for processing power. For many years now, the way to keep the computational ability of CPUs high has been not to increase clock-speed as it was twenty years ago, but to add cores to the chip. As each core acts as its own CPU, this gives the ability to execute code in parallel with a thread of code running separately on each core. Whilst 12-20 cores are typical for servers, there are CPUs which deliver up to 128 cores.
Ioannis explains why DCT-based codecs are resistant to multi-thread encoding by showing how some of the encoding decisions are based on the previously decoded video frame so the encoder needs to decode the video before it has the information it needs to make the next encode decisions. An example of this motion estimation where you need to understand what a macroblock looks like in order to detail if and how it can be moved to form part of the macroblock currently being encoded.
It turns out that some of the information you need to calculate can be found from the original video. Whilst this doesn’t provide full parallelisation, it does help in freeing some of the computation to be done in parallel thus reducing the length of time spent on the linear encoding stage. As the design of the codec itself is limited in its ability to be parallelised, the best way to speed up encoding has been to split up the original video and encode these, now separate, sections independently.
Speeding up video encoding has therefore focused on splitting up the video into different sections and encoding those in parallel rather than trying to parallelise the encoding itself due. Encoding each frame separately is one way to do this, but sacrifices encoding efficiency. Splitting each frame up into sections (tiles or slices) is another way, though this also sacrifices either quality or bitrate. The most successful encoding parallelisation has been chunked encoding. As streaming applications use chunks, typically around 2 seconds nowadays, there’s no reason not to just cut your video up into small sections and encode those separately; the whole of this video focuses on non-live video.
If there’s a shot change in the middle of your chunk, this is likely to look very bad since the motion estimation will fail to produce good results and there may not be enough bitrate budget to compensate. Therefore it’s best to drop in an IDR frame at the shot change or to actually change your video chunks to match shot changes. Simply encoding these chunks in parallel would speed up the encoding, however, it misses an opportunity to optimise quality vs bitrate.
Ioannis explains an experiment to determine the best operating point for chunks. He does that by reminding us that all encoders have certain ‘speed’ settings which control how much computation, and therefore time, is required for each encode. The ‘very fast’ setting in x264 will encode at the highest speed possible, but the quality will be worse or a certain bitrate compared to the ‘very slow’ setting. Ioannis’s experiment encoded each chunk at every speed setting for a variety of resolutions and bitrates. Each encode was then analysed for quality using PSNR, MS-SSIM and VMAF.
From Ioannis’ work, we can see how the bitrate setting affects both the encode time and the quality and we can observe that the slower speeds tend to have minimal quality advantages for the significant extra time involved in the encoding. Each curve has a steep part and a shallow section with the transition between known as the ‘convex hull’. Choosing a setting on the convex hull portion of the line is the optimal balance between quality and encoding time and is where, says Ioannis, most people should aim to operate.
The talk finishes with a summary of the conclusions which can be drawn from this work looking at the use of convex-hull which we’ve just discussed, the best type of parallel processing, whether oversubscription of CPU cores is helpful or not and an interesting observation that it’s often the metrics which put a significant burden on encoding rather than the video encoding itself, particularly for lower resolutions.
Subscribe to get daily updates
Views and opinions expressed on this website are those of the author(s) and do not necessarily reflect those of SMPTE or SMPTE Members.
This website is presented for informational purposes only. Any reference to specific companies, products or services does not represent promotion, recommendation, or endorsement by SMPTE