MPEG-H 3D Audio is an object-based audio coding standard. Object audio keeps parts of the audio as separate sound samples allowing them to be moved around the soundfield, unlike traditional audio where everything is mixed down into a static mix whether stereo or surround. The advantage of keeping some of the audio separate is that it can be adapted to nearly any set of speakers whether it be a single pair or an array of 25 + 4. This makes it a great cinema and home-theatre format but one which also works really well in headphones.
In this video, Yannik Grewe from Fraunhofer IIS gives an overview of the benefits of MPEG-H and the way in which it’s put together internally. The major benefit which will be noticed by most people is immersive content as it allows a better representation of the surround sound effect with options for interactivity. Personalisation is another big benefit where the listener can, for example, select a different language. Under-appreciated, but very important is the accessibility functionality available where dialogue-friendly versions of the audio can be selected or an extra audio description track can be added.
Yannik moves on, giving a demo of software that allows you to place object objects within a room relative to the listener. He then shows how the traditional audio workflow is changed by MPEG-H only to add an authoring stage which ensures the audio is correct and adds metadata to it. It’s this metadata that will do most of the work in defining the MPEG-H audio.
Within the MPEG-H metadata, Yannik explains there is some overall scene information which includes details about reproduction and setup, loudness and dynamic range control as well of the number of objects. Under that lie components such as a surround sound ‘bed’ with a number of separate audio tracks for speech. Each of these components can be made into an either-or group whereby only one can be chosen at a time. This is ideal for audio that is not intended to be played simultaneously with another. Metadata control means you can actually offer many versions of audio with no changes to the audio itself. Yannik concludes by introducing us to the MPEG-H Production Format (MPF)
Finally, Yannik takes us through the open-source software which is available to create, manage and test your MPEG-H audio setup.
VVC was finalised in mid-2020 after five years of work. AVC’s still going strong and is on its 26th version, so it’s clear there’s still plenty of work ahead for those involved in VVC. Heavily involved in AVC, HEVC and now VVC is the Fraunhofer Heinrich Hertz Institute (HHI) who are patent holders in all three and for VVC they are, for the first time, developing a free, open-source encoder and decoder for the standard.
In this video from OTTVerse.com, Editor Krishna Rao speaks to Benjamin Bross and Adam Więckowsk both from Fraunhofer HHI. Benjamin has previously been featured on The Broadcast Knowledge talking at Mile High Video about VVC which would be a great video to check out if you’re not familiar with this new codec given before its release.
They start by discussing how the institute is supported by the German government, money received from its patents and similar work as well as the companies who they carry out research for. One benefit of government involvement is that all the papers they produce are made free to access. Their funding model allows them the ability to research problems very deeply which has a number of benefits. Benjamin points out that their research into CABAC which is a very efficient, but complex entropy encoding technique. In fact, at the time they supported introducing it into AVC, which remember is 19 years old, it was very hard to find equipment that would use it and certainly no computers would. Fast forward to today and phones, computers and pretty much all encoders are able to take advantage of this technique to keep bitrates down so that ability to look ahead is beneficial now. Secondly, giving an example in VVC, Benjamin explains they looked at using machine learning to help optimise one of the tools. This was shown to be too difficult to implement but could be replaced by matrix multiplication which and was implemented this way. This matrix multiplication, he emphasises, wouldn’t have been able to be developed without having gone into the depths of this complex machine learning.
Krishna suggests there must be a lot of ‘push back’ from chip manufacturers, which Benjamin acknowledges though, he says people are just doing their jobs. It’s vitally important, he continues, for chip manufacturers to keep chip costs down or nothing would actually end up in real products. Whilst he says discussions can get quite heated, the point of the international standardisation process is to get the input at the beginning from all the industries so that the outcome is an efficient, implementable standard. Only by achieving that does everyone benefit for years to come.e
The conversation then moves on to the open source initiative developing VVenC and VVdeC. These are separate from the reference implementation VTM although the reference software has been used as the base for development. Adam and Benjamin explain that the idea of creating these free implementations is to create a standard software which any company can take to use in their own product. Reference implementations are not optimised for speed, unlike VVenC and VVdeC. Fraunhofer is expecting people to take this software and adapt it for, say 360-degree video, to suit their product. This is similar to x264 and x265 which are open source implementations of AVC and HEVC. Public participation is welcomed and has already been seen within the Github project.
Adam talks through a slide showing how newer versions of VVenC have increased speed and bitrate with more versions on their way. They talk about how some VVC features can’t really be seen from normal RD plots giving the example of open vs closed GOP encoding. Open GOP encoding can’t be used for ABR streaming, but with VVC that’s now a possibility and whilst it’s early days for anyone having put the new type of keyframes through their paces which enable this function, they expect to start seeing good results.
The conversation then moves on to encoding complexity and the potential to use video pre-processing to help the encoder. Benjamin points out that whilst there is an encode increase to get to the latest low bitrates, to get to the best HEVC can achieve, the encoding is actually quicker. Looking to the future, he says that some encoding tools scale linearly and some exponentially. He hopes to use machine learning to understand the video and help narrow down the ‘search space’ for certain tools as it’s the search space that is growing exponentially. If you can narrow that search significantly, using these techniques becomes practical. Lastly, they say the hope is to get VVenC and VVdeC into FFmpeg at which point a whole suite of powerful pre- and post- filters become available to everyone.
AI’s continues its march into streaming with this new approach to optimising encoder settings to keep down the bitrate and improve quality for viewers. By its more appropriate name, ‘machine learning’, computers learn how to characterise video to avoid hundreds of encodes whilst determining the best way to encode video assets.
Daniel Silhavy from Fraunhofer FOKUS takes the stand at Mile High Video 2020 to detail the latest technique in per-title and per-scene encoding. Daniel starts by outlining the problem with fixed ABR which is that efficiencies are gained by being flexible both with resolution and with bitrate.
Netflix were the best-known pioneers of the per-title encoding idea where, for each different video asset, many, many encodes are done to determine the best overall bitrate to choose. This is great because it will provide for animation-based files to be treated differently than action films or sports. Efficiency is gained.
However, per-title delivers an average benefit. There are still parts of the video which are simple and could see reduced bitrate and arts where complexity isn’t accounted for. When bitrate is higher than necessary to achieve a certain VMAF score, Danel calls this ‘wasted quality’. This means bitrate was used making the quality better than we needed it to be. Whilst better quality sounds like a boon, it’s not always possible for it to be seen, hence having a target VMAF at a lower level.
Naturally, rather than varying the resolution mix and bitrate for each file, it would be better to do it for each scene. Working this way, variations in complexity can be quickly accounted for. This can also be done without machine learning, but more encodes are needed. The rest of the talk looks at using machine learning to take a short-cut through some of that complexity.
The standard workflow is to perform a complexity analysis on the video, working out a VMAF score at various bitrate and resolution combinations. This produces a ‘Convex hull estimation’ allowing determination of the best parameters which then feed in to the production encoding stage.
Machine learning can replace the section which predicts the best bitrate-resolution pairs. Fed with some details on complexity, it can avoid multiple encodes and deliver a list of parameters to the encoding stage. Moreover, it can also receive feedback from the player allowing further optimisation of this prediction module.
Daniel shows a demo of this working where we see that the end result has fewer rungs on the ABS ladder, a lower-resolution top rung and fewer resolutions in general, some repeated at different bitrates. This is in common with the findings of Facebook which we covered last week who found that if they removed their ‘one bitrate per resolution rule’ they could improve viewers’ experience. In total, for an example Fraunhofer received from a customer, they saw a 53% reduction in storage needed.
We’ve seen AI entering our lives in many ways over the past few years and we know that this will continue. Artificial Intelligence and Machine Learning are techniques that are so widely applicable they will touch all aspects of our lives before too many more years have passed. So it’s natural for us to look at the broadcast industry and ask “How will AI help us?” We’ve already seen machine learning entering into codecs and video processing showing that up/downscaling can be done better by machine learning than with the traditional ‘static’ algorithms such as bicubic, lanczos and nearest neighbour. This webinar examines the other side of things; how can we use the data available within our supply chains and from our viewers to drive efficiencies and opportunities for better monetisation?
There isn’t a strong consensus on the difference between AI and Machine learning. One is that that Artificial Intelligence is a more broad term of smart computing. Others say that AI has a more real-time feedback mechanism compared to Machine Learning (ML). ML is the process of giving a large set of data to a computer and giving it some basic abilities so that it can learn for itself. A great example of this is the AI network monitoring services available that look at all the traffic flowing through your organisation and learn how people use it. It can then look for unusual activity and alert you. To do this without fixed thresholds (which for network use really wouldn’t work) is really not feasible for humans, but computers are up to that task.
For conversations such as this, it usually doesn’t matter how the computer achieves it, AI, ML or otherwise. The points how can you simplify content production? How can you get better insights into the data you have? How can you speed up manual tasks?
David Short from IET Media moderates this session with Steve Callanan who’s company WIREWAX is working to revolutionise video creation, asset management and interactive video services joined by Hanna Lukashevich from Fraunhofer IDMT (Institute for Digital Media Technology) who uses machine learning to understand and create music and sound. Grant Franklin Totten completes the panel with his experience at Al Jazeera who have been working on using AI in broadcast since 2018 as a way to help maintain editorial and creative compliance as well as detecting fake news and bias checking.
Moderator: David Short
IET Media Technical Network
Head of Semantic Music Technologies,
Grant Franklin Totten
Head of Media & Emerging Platforms,
Al Jazeera Media Network
Subscribe to get daily updates
Views and opinions expressed on this website are those of the author(s) and do not necessarily reflect those of SMPTE or SMPTE Members.
This website is presented for informational purposes only. Any reference to specific companies, products or services does not represent promotion, recommendation, or endorsement by SMPTE