by Andrius Palionis, VP of Enterprise Solutions at Oxylabs.io
Web scraping has been used to generate revenue, open up new avenues for competition, and even establish completely new businesses. Yet, there is a side of web scraping, arguably no less important, that is less spoken of – that of watchdogs, investigators, and protectors of democracy.
Such is the nature of any tool. Web scraping is unique in the fact that it can help even relatively small teams collect enormous amounts of public information, enabling them to process data that otherwise would be extremely resource-intensive.
What is web scraping?
Web scraping is the process of automated data collection from publicly available sources. Most of these will be regular websites we encounter on a daily basis. Developers create scripts that automatically browse through pages and download the data that is encountered by bots.
Such data is often unstructured and hard to understand as web pages were intended to look great in browsers, not to be used for data analysis. Parsers, a tool that turns jumbled data into a semi-structured or structured format, are often employed to make everything understandable.
The most interesting part, however, rests in how that data is being used. Businesses often use pricing and product data for intelligence and strategic purposes. But the list of possibilities don’t end with revenue generation.
NGOs, universities, and many other non-profit organizations use web scraping. It hasn’t quite established itself as a primary data collection process in these areas, yet, but web scraping is getting there. There have been uses for it in macroeconomic research and other areas of study.
Now web scraping is breaking through to a new audience – watchdogs. While web scraping has been employed to fight digital crime, it can fend off the old, regular type of criminal activity.
Starting with the big fish
The Billion Prices Project originally started as an alternative way to measure inflation. While its methods may not be perfect, as it captures only price fluctuations in ecommerce platforms instead of the changing value of the currency, subsequent research has shown that the conclusions are fairly accurate to regular metrics.
An unintended consequence of measuring inflation in such a way is the ability to counteract official statistics, if they are meddled with. As such, a few new projects were born out of Billion Prices, namely ones intended to measure inflation in Argentina and Venezuela.
Both countries were known to ‘creatively interpret’ inflation metrics before publishing them, leading to a mismatch between reported data and the real world effects on the population. These were, likely, attempts by the government to maintain power by downplaying the negative effects their policies had on the economy.
Without web scraping, none of this would have been possible. The Billion Prices Project has shown that it can serve noble purposes of maintaining truth and transparency even in cases where powerful entities might try to suppress them.
Smaller fish are worthy as well
But web scraping doesn’t need to be used just to fight government oppression or other great battles. There are plenty of smaller things, no less negatively impactful, that can be resolved through the use of web scraping.
Copy, Paste, Legislate has probably been one of the most prominent examples of web scraping that has been used for the common good. Investigative journalists used web scrapers to unveil that special interest groups (i.e., lobbyists) were attempting to push through the same laws across all 50 states in the USA. It even went on to win a prize for investigative journalism.
Reuters, on the other hand, used web scraping to find an underground market where adopted children were being bought and sold. Tracking tools would find scattered ads of those willing to have adoptees “disappear” into a new home. Reuters’ report eventually led to several convictions for kidnapping.
Web scraping can even help government institutions. For example, the shadow economy is a pressing matter in nearly every country. A significant part of that is done through payments in cash as these are often not logged anywhere, allowing plenty of room for money laundering and undeclared income.
While large scale money laundering likely happens on an institutional level, smaller scale activities can be found online. Some of the shadow economy is transferred through classified ad platforms where web scraping can be utilized to great effect.
Government institutions, tracking potential infractions against undeclared income, can use web scraping to estimate the size of the shadow economy and even find potential offenders. The nature of classified ads (i.e., that the persons involved must leave contact information) lends itself greatly to the process.
Conclusion
These are just a few examples of how web scraping has benefited or can benefit the public good. The power of automated data collection is that it enables relatively small groups to attain large sets of publicly available information.
As such, web scraping should not be thought of as something only large corporations do to further their own goals. Web scraping is something that opens up completely new possibilities for research, ways of detecting crime, and pursuing social justice.