MPEG-H 3D Audio is an object-based audio coding standard. Object audio keeps parts of the audio as separate sound samples allowing them to be moved around the soundfield, unlike traditional audio where everything is mixed down into a static mix whether stereo or surround. The advantage of keeping some of the audio separate is that it can be adapted to nearly any set of speakers whether it be a single pair or an array of 25 + 4. This makes it a great cinema and home-theatre format but one which also works really well in headphones.
In this video, Yannik Grewe from Fraunhofer IIS gives an overview of the benefits of MPEG-H and the way in which it’s put together internally. The major benefit which will be noticed by most people is immersive content as it allows a better representation of the surround sound effect with options for interactivity. Personalisation is another big benefit where the listener can, for example, select a different language. Under-appreciated, but very important is the accessibility functionality available where dialogue-friendly versions of the audio can be selected or an extra audio description track can be added.
Yannik moves on, giving a demo of software that allows you to place object objects within a room relative to the listener. He then shows how the traditional audio workflow is changed by MPEG-H only to add an authoring stage which ensures the audio is correct and adds metadata to it. It’s this metadata that will do most of the work in defining the MPEG-H audio.
Within the MPEG-H metadata, Yannik explains there is some overall scene information which includes details about reproduction and setup, loudness and dynamic range control as well of the number of objects. Under that lie components such as a surround sound ‘bed’ with a number of separate audio tracks for speech. Each of these components can be made into an either-or group whereby only one can be chosen at a time. This is ideal for audio that is not intended to be played simultaneously with another. Metadata control means you can actually offer many versions of audio with no changes to the audio itself. Yannik concludes by introducing us to the MPEG-H Production Format (MPF)
Finally, Yannik takes us through the open-source software which is available to create, manage and test your MPEG-H audio setup.
Immersive audio is pretty much standard for premium sports coverage and can take many forms. Typically, immersive audio is explained as ‘better than surround sound’ and is often delivered to the listener as object audio such as AC-4. Delivering audio as objects allows the listener’s decoder to place the sounds appropriately for their specific setup, whether they have 3 speakers, 7, a ceiling bouncing array or just headphones. This video looks at how these can be carefully manipulated to maximise the immersiveness of the experience and is available as a binaural version.
This conversation from SVG Europe, hosted by Roger Charlseworth brings together three academics who are applying their research to live, on-air sports. First to speak is Hyunkook Lee who discusses how to capture 360 degree sound fields using microphone arrays. In order to capture audio from all around, we need to use multiple microphones but, as Hyunkook explains, any difference in location between microphones can lead to a phase difference in the audio. This can be perceived as a delay in audio between two microphones gives us the spatial sound of the audio just as the spacing of our ears helps us understand the soundscape. This effect can be considered separately in the vertical and horizontal domain, the latter being important.
Talking about vertical separation, Hyunkook discusses the ‘Pitch-Height’ effect whereby the pitch of the sound affects our perception of its height rather than any delays between different sound sources. Modulating the amplitude, however, can be effective. Now, when bringing together into one mix multiple versions of the same audio which have been slightly delayed, this produces comb filtering of the audio. As such, a high-level microphone used to capture ambience can colour the overall audio. Hyunkook shows that this colouration can be mitigated by reducing the upper sound by 7dB which can be done by angling the audio up. He finished by playing his binaural recordings recorded on his microphone arrays. A binaural version of this video is also available.
Second up, is Ben Shirley who talks about supporting the sound supervisor’s mix with AI. Ben highlights that a sound supervisor will not just be in charge of the main programme mix, but also the comms system. As such, if that breaks – which could endanger the wider production – their attention will have to go to that rather than mixing. Whilst this may not be so much of an issue with simpler games, when producing high-end mixes with object audio, this is very skilled job which requires constant attention. Naturally, the more immersive an experience is, the more obvious it is when mistakes happen. The solution created by Ben’s company is to use AI to create a pitch effects mix which can be used as a sustaining feed which covers moments when the sound supervisor can’t give the attention needed, but also allows them more flexibility to work on the finer points of the mix rather than ‘chasing the ball’.
The AI-trained system is able to create a constant-power mix of the on-pitch audio. By analysing the many microphones, it’s also able to detect ball kicks which aren’t close to any microphones and, indeed, may not be ‘heard’ by those mics at all. When it detects the vestiges of a ball kick, it has the ability to pull from a large range of ball kick sounds and insert on-the-fly in place of the real ball kick which wasn’t usefully recorded by any mic. This comes into its own, says Ben, when used with VR or 360-degree audio. Part of what makes immersive audio special is the promise of customising the sound to your needs. What does that mean? The most basic meaning is that it understands how many speakers you have and where they are meaning that it can create a downmix which will correctly place the sounds for you. Ideally, you would be able to add your own compression to accommodate listening at a ‘constant’ volume when dynamic range isn’t a good thing, for instance, listening at night without waking up the neighbours. Ben’s example is that in-stadium, people don’t want to hear the commentary as they don’t need to be told what to think about each shot.
Last in the order is Felix Krückels who talks about his work in German football to better use the tools already available to deal with object audio in a more nuanced way, improving the overall mix by using existing plugins. Felix starts by showing how the closeball/field of play mic contains a lot of the audio that the crowd mics contain. In fact, Felix says the closeball mic contains 90% of the crowd sound. When mixing that into stereo and also 5.1 we see that the spill in the closeball mic, we can get colouration. Some stadia have dedicated left and right crowd mics. Felix then talks about personalisation in sound giving the example of watching in a pub where there will be lots of local crowd noise so having a mix with lots of in-stadium crowd noise isn’t helpful. Much better, in that environment, to have clear commentary and ball effects with a lower-than-normal ambience. Felix plays a number of examples to show how using plugins to vary the delays can help produce the mixes needed.
Often not discussed, audio is essential to television and film so as the pixels get better, so should the sound. All aspects of audio are moving forward with more processing power at the receiver, better compression at the sender and a seismic shift in how audio is handled, even in the consumer domain. It’s fair to say that Dolby have been busy.
Larry Schindel from Linear Acoustic is here thanks to the SBE to bring us up to date on what’s normally called ‘Next Generation Audio’ (NGA). He starts from the basics looking at how audio has been traditionally delivered by channels. Stereo sound is delivered as two channels, one for each speaker. The sound engineer choosing how the audio is split between them. With the move to 5.1 and beyond, this continued with the delivery of 6, 8 or even more channels of audio. The trouble is this was always fixed at the time it went through the sound suite. Mixing sound into channels makes assumptions on the layout of your speakers. Sometimes it’s not possible to put your speakers in the ideal position and your sound suffers.
Dolby Atmos has heralded a mainstream move to object-based audio where sounds are delivered with information about their position in the sound field as opposed to the traditional channel approach. Object-based audio leaves the downmixing to the receiver which can be set to take into account its unique room and speaker layout. It represents a change in thinking about audio, a move from thinking about the outputs to the inputs. Larry introduces Dolby Atmos and details the ways it can be delivered and highlights that it can work in a channel or object mode.
Larry then looks at where you can get media with Dolby Atmos. Cinemas are an obvious starting point, but there is a long list of streaming and pay-TV services which use it, too. Larry talks about the upcoming high-profile events which will be covered in Dolby Atmos showing that delivering this enhanced experience is something being taken seriously by broadcasters across the board.
For consumers, they still have the problem of getting the audio in the right place in their awkward, often small, rooms. Larry looks at some of the options for getting great audio in the home which include speakers which bounce sound off the ceiling and soundbars.
One of the key technologies for delivering Dolby Atmos is Dolby AC-4, the improved audio codec taking compression a step further from AC-3. We see that data rates have tumbled, for example, 5.1 surround on AC-3 would be 448Kbps, but can now be done in 144kbps with AC-4. Naturally, it supports channel and object modes and Larry explains how it can deliver a base mix with other audio elements over the top for the decoder to place allowing better customisation. This can include other languages or audio description/video description services. Importantly AC-4, like Dolby E, can be sent so that it doesn’t overlap video frames allowing it to accompany routed audio. Without this awareness of video, any time a video switch was made, the audio would become corrupted and there would be a click.
Dolby Atmos and AC-4 stand on their own and are widely applicable to much of the broadcast chain. Larry finishes this presentation by mentioning that Dolby AC-4 will be the audio of choice for ATSC 3.0. We’ve covered ATSC 3.0 extensively here at The Broadcast Knowledge so if you want more detail than there is in this section of the presentation, do dig in further.
The Broadcast Knowledge exists to help individuals up-skill whatever your starting point. Videos like this are far too rare giving an introduction to a large number of topics. For those starting out or who need to revise a topic, this really hits the mark particularly as there are many new topics.
John Mailhot takes the lead on SMPTE 2110 explaining that it’s built on separate media (essence) flows. He covers how synchronisation is maintained and also gives an overview of the many parts of the SMPTE ST 2110 suite. He talks in more detail about the audio and metadata parts of the standard suite.
Eric Gsell discusses digital archiving and the considerations which come with deciding what formats to use. He explains colour space, the CIE model and the colour spaces we use such as 709, 2100 and P3 before turning to file formats. With the advent of HDR video and displays which can show bright video, Eric takes some time to explain why this could represent a problem for visual health as we don’t fully understand how the displays and the eye interact with this type of material. He finishes off by explaining the different ways of measuring the light output of displays and their standardisation.
Yvonne Thomas talks about the cloud starting by explaining the different between platform as a service (PaaS), infrastructure as a service (IaaS) and similar cloud terms. As cloud migrations are forecast to grow significantly, Yvonne looks at the drivers behind this and the benefits that it can bring when used in the right way. Using the cloud, Yvonne shows, can be an opportunity for improving workflows and adding more feedback and iterative refinement into your products and infrastructure.
Looking at video deployments in the cloud, Yvonne introduces video codecs AV1 and VVC both, in their own way, successors to HEVC/h.265 as well as the two transport protocols SRT and RIST which exist to reliably send video with low latency over lossy networks such as the internet. To learn more about these protocols, check out this popular talk on RIST by Merrick Ackermans and this SRT Overview.
Rounding off the primer is Linda Gedemer from Source Sound VR who introduces immersive audio, measuring sound output (SPL) from speakers and looking at the interesting problem of forward speakers in cinemas. The have long been behind the screen which has meant the screens have to be perforated to let the sound through which interferes with the sound itself. Now that cinema screens are changing to be solid screens, not completely dissimilar to large outdoor video displays, the speakers are having to move but now with them out of the line of sight, how can we keep the sound in the right place for the audience?
This video is a great summary of many of the key challenges in the industry and works well for beginners and those who just need to keep up.
Digital TV Group
Subscribe to get daily updates
Views and opinions expressed on this website are those of the author(s) and do not necessarily reflect those of SMPTE or SMPTE Members.
This website is presented for informational purposes only. Any reference to specific companies, products or services does not represent promotion, recommendation, or endorsement by SMPTE