Interviews, insight & analysis on digital media & marketing

How content classification tools can unlock the value of publishers’ data

By Matt Shearer, Director of Product Information, Data Language

Since the advent of the internet, publishers have repeatedly had their business models overturned by the pace of change. Some have adapted to the digital world, but many are struggling to move from a document-centric landscape to a data-led one. They may understand the challenges facing them in a competitive and disruptive world, but many publishers remain unsure how to unlock the value of their data and content.

Many are looking to AI for answers since its evolution from science fiction to real solutions for challenging problems. One area where AI has already been successfully deployed in publishing is content classification. In this example, AI applies metadata – descriptive data that identifies assets and information across your business with common terms – to content automatically.

At its best, metadata enables accurate business intelligence to support decisions on when and what to publish, actively creating new opportunities in targeted advertising and content recommendations for users. A particularly powerful way to structure metadata is to use a knowledge graph. Put simply, a knowledge graph is a data architecture that describes the important things in your business domain (things like people, places, and organisations), and, importantly, how they relate to each other.

Google started the widespread use of the term “knowledge graph” in 2012, but the structural approach was made popular in publishing spheres by the likes of the BBC from 2010 with BBC Sport, through to 2013 with BBC News. They enabled content creators to move away from an “article”-led world (reflecting physical paper publishing or simple taxonomies), into a data-led world where the metadata includes context and meaning.

The BBC, and its machines, were then able to ask more questions of the metadata, and allow the content and information to flow around the answers, which were taken from multiple parts of their knowledge graph. Using a knowledge graph, a single metadata allocation on one piece of content could then be used to connect that content with other groupings and contexts automatically, even several logical steps from that original piece of metadata.

In sport, this approach meant that content could be automatically organised by events, teams, people, scores and outcomes. Using only metadata specifying a single athlete, it could enable news articles about a particular player to be organised into “news about X team” or “news about X competition”. For news, this meant that individual people could be connected with organisations, political events, sectors, stories and more. For example, tagging news with Chris Whitty can then lead it to be associated with the UK Government, the Department of Health and Social Care and public health, vaccinations and more. It’s immediately obvious the value this adds in terms of searchability and SEO.

Rolling forwards to the 2020s, in order to liberate as much value as possible from their valuable content, publishers must look at a flexible, efficient and evolvable way to classify it. A combination of knowledge graph and AI classification is an increasingly popular way of achieving this; knowledge graphs provide flexibility and structure, and AI provides the scale. Forget the old “classify once, publish and forget” – publishers now need to create, classify, connect, distribute, measure and then repurpose multiple times.

How best to move in this direction, without expensive and painful “big bang” projects?

The answer is to start small and focus on core differentiators. One effective and agile approach is to start with a shared map in the form of a “domain model”. This maps a business’s important entities and how they relate to one another. Trust me, even in the same general business area, these domain models are always different – they are particular to each organisation. The domain model is best created in a series of workshops including team members from as many functions across organisation as possible and works by driving a shared understanding of the organisation’s information structure. This enables organisations to prioritise which areas of the domain model represent their core differentiator and that drive value and provides a basis for where to start with a content classification initiative.

The actual selection of AI and knowledge graph technologies is the stuff of many books and blog posts, but a few painful gotchas to avoid are long setup times, expensive regular training cycles, the need for term mapping from your tags to the AI’s tags, and generic AI classifiers.

My top tip? Look for a lightweight SaaS, API-based route to get you started as quickly as possible; one that is production-ready, that works on your own custom metadata, and that does not require expensive data science resources to support the AI training process.

It’s clear that the combination of knowledge graph and AI-powered content classification tools are growing in popularity due to the benefits they offer publishers. The lasting challenge for publishers will simply be to resist the knee-jerk temptation to invest huge amounts of capital into building a bespoke system that could have been purchased off-the-shelf.