Interviews, insight & analysis on digital media & marketing

Riding the wave of AI audio evolution

By Dr. Timo P Kunz, Co-founder & CEO of AudioStack

According to a recent study published in eMarketer by 2024, a fifth of all time spent with digital media in the US will go to digital audio. Despite not being as explosive as 10 years ago, the audio sector is still growing. This growth has been generating more and more interest in audio as a channel for a while now, increasing media share and budgets. Due to the way audio is being produced and broadcasted these budgets have traditionally been rather small compared to other channels such as video.

Thanks to digital audio, addressability, the ability to reach specific target audiences, especially with dynamic content, has been solved and audiences can be identified a lot better than in traditional radio on-the-air broadcasting. 

However, that still leaves us with the linear process of creating audio: a script has to be created for a product or service, then a speaker will read the script in a studio, which is then being mixed and polished by an engineer to sound perfect. Enter synthetic media production. Due to the vast advancements in machine learning and AI in recent years, the process of creating audio can now be accelerated significantly, meaning audio can now be created faster than real-time (i.e. production is faster than the script could be read in real-time). 

AI has also revolutionised the creative process, integrating seamlessly into traditional creative processes, i.e. helping with script creation, or using human voice or music recordings alongside synthetically generated assets. That is where AI’s real strength comes into play, working hand in hand with humans.

On top of that audio has also become modular and dynamic. This means that a speaker, music or script can be changed at any time in the process, instead of having to go back to the beginning and casting another speaker or adding a new music track in the studio. This comes in handy when you want to adjust a Christmas campaign to a summer one or to exchange product offerings on the fly.

The biggest opportunities in AI audio 

For the first time, audio creation can be accelerated to a point where use cases that have been  prohibitive for over a century are now possible. Examples include real-time audio generation, podcast and news generation based on personal preferences or dynamic advertising using real-time data such as weather, local product stock or music. On top of that, this myriad of possibilities can easily be used for voice overs in video applications.

In a nutshell, audio is becoming so addressable, flexible and scalable, that media creators, publishers and advertisers are rethinking how they generate their content.

This is exactly what AudioStack does, offering different audio production workflows for enterprises to create audio ads from scratch in seconds or build thousands of versions of an audiotrack programmatically. These workflows can also be combined, allowing for highly customized processes and can be integrated into any media creation system.

As an example, Australian agency Creative Fix used AudioStack to build real-time news ads for  News Corp. In a groundbreaking campaign, headlines and subheadlines taken from articles on news.com.au were used to programmatically build 30 second ads that were seamlessly inserted into matching podcasts, across categories such as breaking news, finance, entertainment, lifestyle, or tech. These ads had a maximum lifespan of 12 hours, meaning that only synthetic media was a feasible method of creating the audio production for each spot. Click here to listen.

The challenges in the market

Over the last five years, voice quality has been the biggest showstopper for synthetic media. Lack of emotion, mistakes in pronunciation or limited dynamic range (how diverse words and sentences sound) have put a hard stop on any use case where the information wasn’t a key element of the production. This is being overcome by ever better text-to-speech and speech-to-speech technology, as well as growing investments into the industry, creating a flywheel effect. Using AudioStack, large corporations such as McDonald’s and Porsche have started to air 100% synthetic audio ads or even hybrid TV commercials.

Another challenge essential to synthetic media has been the lack of accountability for companies and their users when creating media assets, especially with respect to privacy, identity and copyright. However, the industry has grown more mature and solid regulations are being put into place by the EU and US governments. This has led to AI media companies being more transparent about how they use training data or deleting it automatically. Also, contracts now include more specific and prohibitive clauses about voice identity and model usage, ensuring a more responsible and ethical use of AI in audio.

AudioStack decided to become SOC2 compliant and built auditability into the systems in order to create a robust, audited permission management system, protecting customers’ IP.

The last challenge is a certain lack of deliverability for dynamic audio assets. Given that it has never been possible to create audio content fast and dynamically, adtech platforms haven’t felt the need to develop technical tooling that would allow users to automatically create dynamic audio content and serve it to a specific audience at scale. AudioStack actively pursues and develops these technologies together with leading players in the AdTech field, such as Adswizz or Acast. The goal is to make it easy for advertisers to air the right content to the right audience, in the right moment or context, no matter how diverse it might be. Sounds good, doesn’t it?