In a quest to uncover the changing landscape of web scraping and ethical data acquisition, NDA recently sat down with Neil Emeigh, CEO of Rayobyte.
Initiating his venture out of sheer frustration with the quality of available proxies back in 2013, Emeigh transformed his enterprise into the largest US-based proxy provider. Today, he is a recognized authority on proxies and scraping, with a passion for ethical data acquisition.
In the upcoming discussion on “Web Scraping in 2023 and Beyond” at OxyCon, Neil will bring his rich experience and insights, addressing the convergence of technology, ethics, and business in the realm of data scraping.
Can you tell us about your journey from starting a one man proxy operation to becoming the CEO of Rayobyte?
In the early 2010s I was running my own web scraping project and became frustrated with the proxy provider options. Not only were the IP addresses provided often of low quality, but many providers sourced their proxies via unethical or illegal methods, so you never knew what you were getting. I figured I could provide a better alternative, so with the help of a couple of team members who are still with us today, I founded Blazing SEO (which has now rebranded itself as Rayobyte).
What inspired you to focus on ethical data acquisition and usage?
Our company’s foundation has always been deeply rooted in ethics. In the proxy industry, it’s all too common for companies to simply embed consent somewhere deep within their Terms & Conditions and consider their responsibility fulfilled. Regrettably, many residential proxy network participants are unaware that their IP addresses are being utilized, a practice which I’ve always found to be unsettling.
To counter this practice, we go the extra mile to ensure that individuals are informed when their device is being used as a proxy, acquiring their explicit consent. This principle is embodied in our supplier acquisition product, Cash Raven.
Looking ahead, I believe businesses worldwide will become increasingly conscientious, scrutinizing the origins of IPs and their acquisition methods before finalizing agreements with providers.
What are some of the most notable advancements in web scraping technology and methodologies that you’ve observed in recent years?
Some of the most notable changes include increased difficulty in scraping websites. When I first launched the company, our primary focus was serving SERP scraping clients. While back then datacenter proxies were the preferred method, it’s now nearly impossible to scrape Google at scale using them, and it has become prohibitively expensive for the average business owner.
To navigate these challenges, we and many of our clients have enhanced our anti-bot evasion measures. This includes improving fingerprint technology, developing advanced browser-based automation, and transitioning from datacenter proxies to more sophisticated residential proxies.
With advancements like ChatGPT, integrating AI into web scraping has become more accessible. Even regular developers can now leverage AI in their scraping processes. More frameworks, libraries, and no-code solutions are making it easier than ever for developers (and non-developers!) to be able to scrape – which wasn’t true 8 years ago. Now a data scientist with basic Python experience can build a full-fledged, scalable, web scraper himself!
What industries or sectors do you believe will benefit the most from web scraping and alternative data usage in the coming years?
Instead of singling out specific sectors, I’d propose a broader perspective: Which industries and businesses wouldn’t gain an advantage from web scraping? Its applications have expanded far beyond just securing limited-edition sneakers or monitoring prices. It’s challenging to identify any sector that wouldn’t derive value from the actionable insights harvested from web-scraped data.
Our main challenge moving forward is crafting an ethical framework. By doing so, we can encourage more companies to confidently tap into the vast potential of alternative data. If we navigate this collectively, the pertinent question shifts from who will benefit to who won’t.
How do you see the relationship between web scraping and artificial intelligence evolving in the future?
At Rayobyte, we handle billions of requests monthly. Given this immense scale, to make sense of this data and to use it to our customer’s advantage, we must tap into automation and AI. Our approach at Rayobyte focuses on evaluating traffic patterns entering our system, with two primary objectives.
The first is Automatic Abuse Prevention. Abuse poses one of the most significant challenges for proxy providers. Mishandled, especially within a residential network, it can wreak considerable havoc. At Rayobyte, we adopt an uncompromisingly ethical and vigilant approach towards usage. We integrate a combination of predefined rules, triggers, and AI to discern abusive traffic.
Our second objective is Intelligent Routing. Within a rotating pool of proxies, including residential, ISP, and datacenter, the foremost responsibility of a proxy provider is ensuring the proxy’s functionality, meaning that it’s accessible and hasn’t been banned. Drawing from historical performance data of IPs, a provider can discern the ideal one for traffic routing. With billions of requests each month, this vast dataset offers a fertile ground for training ML algorithms. These algorithms, when trained well, can optimize routing based on success rates rather than sticking to conventional static methods like round-robin routing.
From the perspective of a proxy provider, these are our key strategies. However, the intrigue amplifies when one dives into the domain of web scraping. Given the continuous updates websites make to deter evasions, it becomes inefficient to manually oversee configurations at a large scale. ML’s ability to learn, adapt, and modify configurations in real-time allows a scraping firm to remain one step ahead of evolving web dynamics.