Interviews, insight & analysis on digital media & marketing

Q&A: why real-time web data is a new source of competitive intelligence

New Digital Age meets Aleksandras Šulženko, Product Owner at Oxylabs.io…

Aleksandras Šulženko started his career in the web intelligence industry as an account manager at Oxylabs, overseeing the daily operations and challenges of the world’s biggest data-driven brands. This experience inspired him to shift his career path towards product development in order to shape the most effective services for web intelligence collection. As a Product Owner for innovative web data gathering solutions, today Aleksandras continues to contribute to Oxylabs’ mission by helping companies of all sizes to reach their full potential by harnessing the power of data.

Gathering real-time public web data for business intelligence has been a recent topic of discussion. It is a new competitive asset for some companies, but little information is available about the use cases of such data. How do businesses employ real-time web intelligence?

Public web data is used by a growing number of companies. For example, the latest research by Oxylabs and Censuswide of over 1000 key decision makers in financial services companies found that almost half of them (44%) plan to invest the most into web scraping in the coming years. This is no surprise since a quarter (26%) of respondents said web scraping had the greatest impact on revenue compared to other data gathering methods.

Financial and ecommerce companies are the frontrunners in competitive web intelligence, but others are catching up, too. The internet offers a plethora of public data perfect for mining unique business insights and boosting decision-making and sales. One of the well-known use cases is travel fare aggregation and comparison – such services as Skyscanner couldn’t exist without web scraping technologies, and we wouldn’t be able to catch those perfect flight deals as it is merely impossible to monitor so many different airlines manually.

Ecommerce companies gather real-time price and competitor intelligence to optimize dynamic pricing and assortment or monitor the supply chain. You’ve probably noticed that prices on major marketplaces can change several times per day – this is possible only with the help of public competitor intelligence. Financial and investment firms rely on unique insights derived from alternative data to find the most rofitable investment opportunities. Marketing agencies gather public web intelligence, such as consumer sentiment data, to understand economic trends or buyer behavior and preferences.

There are many other use cases, including search ranking optimization, cybersecurity, illegal content detection, and anti-counterfeiting. Digitalization of both business and everyday life means that there’s data for almost anything scattered around the internet. It is publicly available to all of us; however, the volumes are so extreme that organizations trying to make sense of web data need state-of-the-art technologies to gather, clean, and process it. This is where companies like Oxylabs come into the picture, offering public web intelligence collection solutions.

Gathering data at such a scale should require enormous resources. How do companies extract web data – in-house or by outsourcing it to third-party vendors?

Some companies, for example, cybersecurity firms that work with sensitive information, prefer to scrape data in-house. However, they need a robust proxy infrastructure to distribute requests and bypass geo-blocks and anti-scraping measures.

For businesses that need to gather public web data but don’t have the resources to do it in-house, ready-made scraping solutions are the most cost-effective choice. Oxylabs offers Scraper APIs designed for different targets, including search engines and major marketplaces.

They allow gathering web data with less coding and on a large scale. Our Scraper APIs guarantee a 100% success rate, delivering data from almost any site as raw HTML or a structured JSON document. Companies that gather web data in-house must overcome various technical difficulties that can be time- and money-consuming. For example, managing a proxy infrastructure, running headless browsers, maintaining scraping and parsing pipelines that can break down due to constant changes in web page layout, and generating custom fingerprints to bypass anti-scraping measures. We do all these tasks on our end so that our clients can get accurate real-time data and instantly focus on its analysis.
For companies that still prefer to use their own crawlers and scrapers but need a solution that
helps handle common technical difficulties, we have created an artificial intelligence (AI) and
machine learning (ML) powered Web Unblocker. It can be integrated as a simple proxy,
bypassing sophisticated anti-bot systems, performing proxy management and JavaScript
rendering, generating browser fingerprints, solving CAPTCHAS, and validating responses.

What are the main challenges of gathering real-time web data?

Gathering public web data is a challenging process in general. Firstly, to gather any web data, you will need to figure out what URLs you want to access. This can be done either by generating URLs (if they follow a certain pattern) or by crawling a site to figure out what URLs are present on it. Once you have the URLs, you may attempt to fetch the content from the web.

The content will usually be in HTML format, so the next step is to parse the HTML into a simpler data structure, such as JSON or CSV, containing only the data points of interest. In the case of real-time data, complexity adds up as there is no room for error: the system must be up and running at all times.

One of the biggest challenges is gathering accurate data, as wrong content comes in many different ways. Some scraping responses might seem legit, although they contain CAPTCHAS or, even worse, false information from the so-called honey pots. Websites can also track and block scrapers based on fingerprints, which include the IP address, HTTP headers, cookies, JavaScript fingerprint attributes, and other data.

Anti-scraping measures and browser fingerprinting are becoming increasingly sophisticated. To avoid unwanted interruptions, companies have to play with different parameter combinations for different sites, which again increases the complexity of their data gathering solution. Fortunately, assembling fingerprints that bypass a particular anti-scraping solution can be automated and optimized with the help of machine learning, a functionality already included in Oxylabs’ products.

By the way, getting blocked by an anti-scraping solution does not mean that web scraping is a bad or illegitimate action. With anti-scraping measures, websites simply try to secure their servers from request overload and actions done by irresponsible or malicious actors. Separating between these malicious actors and legitimate scrapers would be exceedingly difficult, so administrators just push a blanket ban on both. Sometimes, the data is locked because of the location – many sites show different content in different countries. However, if a company is collecting competitor intelligence, for example, product prices, it needs to gather public data in various locations. It would be impossible without an extensive proxy network.

When parsing data, the main challenge is adapting to the constant layout changes of the web pages. This requires constant maintenance of parsers – a task that is not particularly difficult but highly time-consuming, especially if the company is scraping many different page types.

Yet another interesting challenge when gathering public data from ecommerce marketplaces is product mapping. Imagine a company that needs to gather prices and reviews of five different models of Samsung headphones. In different online marketplaces, such products can be listed in different departments and subcategories or have slightly different product names. This makes it difficult to track the same product across multiple ecommerce sites, even with the use of scraping.

Are there any use cases for employing alternative data beyond the business sector?

Even among businesses, public web intelligence collection has started to gain traction only recently. NGOs, the public sector, and academia are still lagging behind, but the interest in public web data is growing there, too. There are ‘avant-garde’ players, such as the Bank of Japan, that do interesting social and economic research based on alternative data analysis.

Academics in such fields as psychology have also started to uncover the benefits of web data, scraping public comments and forums for aggregated data to analyze human behavior. Nonprofit organizations often have really interesting research topics that allow employing web scraping technology for the common good. For example, Oxylabs has been working on a probono initiative with the Communications Regulatory Authority in Lithuania to create an AI-powered tool for fighting illegal content online (mainly related to child sexual abuse).

We see a huge uncovered potential in such use cases, but they need more support to raise visibility and awareness. We have launched a free-of-charge initiative called “Project 4β”, the aim of which is to transfer technical expertise and grant universities and NGOs free access to web intelligence collection tools.

In your opinion, what will drive the web intelligence industry forward in the upcoming years?

Without a doubt, ML and AI technologies. They allow automating recurring web scraping patterns, thus minimizing workload for the developers and the risk of human error. Oxylabs’ Web Unblocker, which I’ve been talking about previously, is mainly based on different ML algorithms that help perform complicated tasks such as proxy management, dynamic fingerprinting, and response recognition.

It is interesting that web scraping is one of the main drivers behind AI and ML developments. ML requires massive amounts of data for training to improve algorithmic predictions and accuracy. Buying ready-made datasets from third-party providers is often not enough for modern ML technology. This is where publicly available web data comes to help. Therefore, both fields positively reinforce each other.