Gathering web data at scale is a subject at the centre of various legal cases at the moment, with ongoing lawsuits against Google, Midjourney, OpenAI, and other tech giants. These legal battles have led people to question the legal status of web scraping and strengthened misconceptions surrounding this relatively new industry.
According to Oxylabs, this negative attention now risks overshadowing the benefits web scraping can bring to organsations and society at large.
Denas Grybauskas, Head of Legal at Oxylabs, commented: “Many have been quick to pounce on the negativity surrounding web data collection, clouding the good examples of its use. Gathering public web intelligence can benefit many projects, including investigative journalism and scientific research. For example, public data from social media sites and forums has been widely used in different sociology and psychology projects and even helped to predict COVID-19 outbreaks.”
“Web intelligence is used by travel fare aggregators and price comparison sites that help millions of people make better-informed decisions when shopping online. Web scraping is also vital for cybersecurity companies that monitor the activities of cybercriminals. It wouldn’t be an overstatement to say that without web intelligence, a lot of use cases we rely on daily would be impossible. However, as AI technology continues to evolve, consuming an ever-growing amount of public data, raising awareness about ethical web scraping has become especially important.”
To combat illegal data gathering, promote common standards, and share know-how about ethical practices, leading web intelligence organizations formed the Ethical Web Data Collection Initiative. The consortium aims to build trust around web data collection and educate industry players and the general public about its possibilities. Additionally, Oxylabs is spreading its expertise and ethical practices through such pro bono initiatives as Project 4β, which specifically targets universities and NGOs.
“Through 4β, we aim to transfer technological knowledge and support scientific research on big data”, Grybauskas added. “For example, we partnered with The Communications Regulatory Authority of the Republic of Lithuania to battle against child endangerment by deploying web scraping technology and AI-driven recognition tools that can detect harmful digital content units.”
According to Grybauskas, web scraping is a fresh industry, so it naturally has legal grey areas and can be tricky. Due to its complexity, it is often unfairly portrayed, missing the many benefits it brings.
“The most frequent mistake people make when scraping is failing to evaluate the nature of data they plan to fetch and adhere to the terms set. Legitimate scrapers focus on collecting public data that is open to everyone. However, even publicly available data can sometimes entail personal information or content subject to copyright laws. It is vital to encourage anyone gathering web data to consult legal practitioners before scraping.
“On the other hand, ongoing legal cases may bring more clarity to different aspects of online data gathering at scale, which would be beneficial not only to data-as-a-service companies and web intelligence providers but also to further AI research and development.”