Bridging the Gaps in Big Data and AI Industries

Interview with Vidas Bacevičius, Solutions Engineer at Oxylabs

Software developers, AI innovators, and key decision-makers are flocking to Berlin from across the globe. The city hosts the 10th installment of WeAreDevelopers World Congress, where the software industry’s brightest meet to network, discuss, and share practical tips.

Among them, Vidas Bacevičius will delve into how web scraping and AI form an increasingly symbiotic relationship, with the two technologies supporting each other’s development. Today, Vidas shares his thoughts on these two industries and how bridging gaps is the key feature of his career at Oxylabs, where he works to adapt technical innovation for real-life business applications.

Vidas, your presentation is entitled “Scrape, Train, Predict: The Lifecycle of Data for AI Applications.” Can you tell us what hides behind it?

Yes, well behind it is really this relationship between web scraping and AI, how they work hand-in-hand to improve each other. My goal is to share practical knowledge on how AI can be utilised to enhance web scraping results. And, at the same time, how web scraping improves AI applications. So, in other words, it’s about the challenges each faces, and how they are answered.

What are these challenges?

In the presentation, I explore two primary issues that companies and developers encounter when collecting public web data. The first one is blocking. Web scraping, in basic terms, is the automated extraction of public data from webpages. To do that, you use web scrapers, software tools made specifically for this purpose. Websites employ various anti-bot measures to prevent all non-organic activity, which means that these scrapers can also be blocked, even though they aren’t doing anything illegal or harmful. Therefore, we need to make our scraper appear as if it is organic user activity to avoid being blocked. Another challenge is bad content. That is, unstructured output, which needs to be parsed and structured to become usable.

Web scraping is a relatively young industry, now coming into the spotlight due to its connection with the AI boom. What brought you to one of this industry’s flagship companies?

While obtaining my degree in computer science, I’ve learned about the tech world and became a full-stack developer. I first came to Oxylabs as a system administrator. However, within a year and a half, I realised that I have a knack for explaining technology to non-tech people, and so I transitioned to a solutions engineer. Now, my job involves bridging the gap between technical teams and client-facing teams, such as sales and account management, and helping create custom solutions that my clients need.

Web scraping itself is fascinating because every case is like a puzzle. It’s always very interesting to talk with developers about solving these puzzles, from relatively simple challenges with particular HTTPs, to using advanced AI solutions, such as machine learning techniques, to overcome various obstacles that scrapers encounter.

How does AI help to overcome the challenges you talk about in your presentation?

I’ll give one example from my presentation. Sometimes websites return an HTTP status code 200, which means that everything is alright with the response, even though your scraper was actually blocked. In this case, instead of checking everything manually, we can train a model to check the response for signs of blocked content.

AI tools also help with data parsing and various techniques for circumventing anti-bot measures, such as mimicking organic mouse movements when scraping.

Would you say that the usage of AI in web scraping is growing?

Definitely yes. We continue to experiment with AI in various web scraping tasks, finding new ways to adapt it to our specific purposes. The better AI gets, the more we experiment and innovate with it, so this trend will most likely continue far into the future.

What about the other side of the coin – how does web scraping support AI?

Web scraping is one of the main ways for AI developers to get training data. More importantly, it allows you to get specifically the data that you need. Most large language models (LLMs) are trained on the same data sources, including historical data. This makes it harder to differentiate yourself from the competitors.

With web scraping, you can choose your sources based on the data you need and unblock the websites from which you want to take it. This gives a competitive edge, as you can develop unique AI models using the public data you seek out and collect from online sources. So, before I mentioned bridging the gap between tech and non-tech teams. Similarly, this is how we bridge the gap between AI models and big data that they need. Web scraping enables data control, which means that rather than relying on what is available, you can make available what is needed to bring your vision to life.

Thank you, Vidas. And those who are visiting WeAreDeveloper World Congress, stop by Stage 8 at 3 pm on Friday, July 11th, to discuss the details of how these two industries, AI and web scraping, advance each other.