Using AI and Machine Learning to Clear the Fog of Dialogue

We’ve come a long way when it comes to streaming content and broadcast tech development.

From watching content in 4K HDR to high frame rates, and on massive screens that are no thicker than a pencil. And it’s not just about the picture. Today’s audio is streamed in high-fidelity, multi-channel and object-based surround sound.

If everything is so great, why are so many people having so much trouble hearing and understanding dialogue on their televisions? DTS recently surveyed 1,200 people and found that a whopping 84% have experienced issues understanding dialogue while watching TV shows and movies. That led to 77% of those same respondents to use captions or subtitles at some point, with nearly one in three (30%) people saying they always or often have them turned on. No matter how you look at it, that’s a lot of people having trouble with dialogue intelligibility.

It’s true subtitles can point to other, more subjective issues like hearing loss, which is experienced in different ways and to varying degrees. Language and dialect differences can also lead to more subtitle use. But those issues alone don’t account for the large degree that subtitle usage has increased, especially when we consider that it actually requires more brain power to watch a program with subtitles than without. How can we explain why this is happening, and how can we address the problem of dialogue intelligibility directly?

Multiple factors feed into the problem

There’s no doubt that televisions have gotten remarkably better, and notably thinner, over the years. The bigger and flatter the television is, however, can pose challenges to overall speaker quality and have adverse effects on the audio output from those speakers. Even audio processing technologies like virtual surround sound and dynamic range compression that attempt to improve sound quality can inadvertently affect dialogue if not tuned properly.

Beyond the television itself, post-production practices can also play a part in dialogue intelligibility. More and more soundtracks are being mixed for multiple channels of audio, with one channel primarily devoted to dialogue. These immersive sound mixes are often downmixed to stereo when played over a regular television. The problem is that the downmix process can lead to a situation where the combination of downmixed channels may mask the dialogue and make it more difficult to understand.

Finally, even the consumer’s home plays a part. Changes in the average home means they are likely to be more reverberant than before. An abundance of appliances like refrigerators, dishwashers, air conditioners and dryers create more in-home noise pollution. Environmental sounds, both interior and exterior, can mask important dialogue especially if that dialogue is reproduced at lower levels.

Solutions are falling short

While DSP-based (digital signal processor) solutions have been available in television more recently, they’ve not proven to adequately solve the issue of dialogue intelligibility. Traditional dialogue enhancements algorithms do boost audio frequencies associated with dialogue and have proven to improve intelligibility for people with mild-to-moderate hearing loss. But when background sounds are mixed with the dialogue, both dialogue and the non-dialogue get the same boost so the situation is not much better.

Delivering bespoke television mixes to the home that are more dialogue-forward and have lower dynamic ranges between quiet and loud sounds is a possible solve. The issue with this approach is that it’s a one-size-fits-all solution for a problem that’s more subjective, improving the experience for some, diminishing the experience for others.

Making the dialogue track available as a separate audio stream is another approach worth considering. It would allow the consumer to control the dialogue-to-background ratio based on their preferences, hearing needs or environmental noises. Unfortunately, this approach requires the content creators themselves to adopt new workflows, which they may or may not be willing to do. And it doesn’t address the issue of legacy content that would have to somehow be retrofitted to deliver the same benefits.

AI + machine learning can make the difference

AI and machine learning are more than just industry buzzwords. Recent developments in audio machine learning have made it possible to separate audio content into component parts. That means unmixing tech can now separate dialogue from non-dialogue components within any type of content, without the need to involve the content creators at all.

These techniques don’t just work for new content. The processes to separate dialogue can be used on any audio content ever made in the past century – going all the way back to 1927 and the first “talkie”, The Jazz Singer. The best part is that once the dialogue has been separated, it can be processed with minimal consequence to the original artistic intent. What’s more, the amount of processing applied can be user controlled, either based on their preferences or environmental noise, or even the nature of the mix.

So where is the technology that actually brings these techniques and processes to life? Well, it’s coming soon. This exciting new tech will, for the first time, enable dialogue to be processed independently from the rest of the soundtrack, thus enabling viewers to focus less on trying to understand dialogue and more on being wowed and entertained by what’s being said.

Stay tuned for more news about this exciting development in dialogue intelligibility and stay up to speed by signing up for the DTS Home Entertainment newsletter here.

 

Latest

Introducing DTS Clear Dialogue

09/04/24

As televisions picture quality and resolution have gotten infinitely better, hearing and understanding dialogue has gotten worse. We previously discussed many of the factors involved…

Learn More