By Appu Shaji, CEO, Mobius Labs
How does computer vision for video work? At first glance the process appears almost identical to the one used to analyse static images. Every second of video contains a set number of frames depending on the resolution. Let’s say 24 per second, a highly common format. The software simply examines the frames in the same way as it would a photograph.
As the old saying goes, it’s a bit more complicated than that. To start with, take the sheer number of frames involved. Simple maths tells us that a five-minute, 24 fps video clip contains 72,000 images. Now imagine a video stock agency with thousands of hours of footage in its archive. The figures are astronomical.
But hang on a moment. Isn’t the whole point of artificial intelligence and computer vision that it can manage huge media volumes in a fraction of the time that it takes a human being, and with fewer errors brought on by fatigue or boredom? True, but as any software engineer will tell you, efficiency is everything and computer vision for video is no exception.
Frame rates and fast searches
So how does the latest technology make the entire process more efficient and useful? The first step is to reduce the number to a few frames per second —enough to detect significant changes in the contents of the frame, positions of objects or facial expressions.
A good start, but it doesn’t take into account the natural rhythm of storytelling that can switch dramatically, even in a thirty second advertisement. As any editor will tell you, video and film are cut to tell a story: establishing shots, medium shots, close-ups and reverse angles for dialogue. During a car chase, you might see dozens of segments in a minute compared with five or six for a rom-com conversation.
Take this one step further and it becomes possible to identify what’s going on during a particularly hectic episode. With enough training data, computer vision can tell the difference between a car chase, a superhero combat sequence or a rooftop pursuit.
But things get really interesting when the software can recognise emotions. Computer vision for video takes this feature, already used with static images, to a whole new level. Take that rom-com conversation for example. During this scene the viewer follows a fast-changing set of emotions on the faces of the protagonists: anger, confusion, amusement, affection. The expressions, as well as the language, tell a story that you’ve probably seen hundreds of times!
Now imagine changing the order: amusement, affection, confusion, anger. It’s a different story altogether. In other words, there’s no point asking the software to find a clip with these four emotions. It also needs to understand the order and context of these feelings.
Getting a feel for narrative
Computer vision, at least for now, can’t feel emotions. But, critically, it can recognise the context of a particular expression. More recently, the best software can also interpret tone of voice and recognise language, using all these elements to ‘understand’ the narrative and find similar examples in a library of hundreds of clips.
Let’s return to the example of a video stock library or a marketing agency that has hundreds of hours of short clips, thousands of actors and dozens of brands. Now, imagine a campaign that celebrates the end of the pandemic, with a brief to show the emotions shared by grandparents and grandchildren or long separated partners. Computer vision now enables you to choose the participants and the sequence of feelings that you want to convey. It won’t find you the perfect clip straight off, but it will narrow down the search dramatically.
You can slice and dice this approach any way you want. Search for a famous actor drinking a particular brand of fizzy drink, as the emotions pass from uncomfortable (thirsty) to happy, to satisfied. Or an athlete crossing the line in first place and then reaching for the same brand with a look of relief on their face.
Managing campaigns, boosting revenues
The latest computer vision technology not only supports emotion recognition out of the box but enables the user to train the software and create bespoke tags, based on a relatively small number of video clips. This enables a stock agency, for example, to introduce greater nuance into its searches and make it even easier for customers to find and purchase the clips they need.
What about the future? For now, the technology is largely reactive, categorising existing content and optimising the search process for customers and employees. But there’s no reason why computer vision shouldn’t aid the creative process. Imagine being able to describe the narrative and emotions, leaving the software to come up with a variety of storyboards that meet the brief. Or automatically generating a rough cut from a set of video rushes that can be further refined by an expert.
As we’ve seen with static images, computer vision still needs a ‘human in the loop’ to train the software and enhance the results. And we are still far from a situation where technology can create content from scratch that gets anywhere near the quality of a human specialist. But it still has a role to play. The computer vision for video story has only just started, but its future looks to be as thrilling as the next super-hero blockbuster.