Meta has emerged from the Metaverse to become a major player on the AI court. As such, the company has its own team of web crawlers that scrape pages that don’t have the Robots.txt protocol. Or, at least, we thought as much. According to some new reports, it seems that Meta’s new crawlers aren’t afraid of any robots, as they’ve been bypassing that protocol.
Major corporations have been using web crawlers to dive into and scrape data from websites across the internet for years. However, the people have made their stance clear; they do not want companies scraping their data without their consent. Of course, the companies all obey our wishes and avoid scraping data from websites without the Robots.txt file… right?
These are major corporations we’re talking about. Obviously, they’ve found ways of spitting in the faces of people who trust them. There have been reports of companies like Perplexity, OpenAI, and Anthropic AI all finding ways to scrape sites that have the Robots.txt file.
What is Robots.txt?
In case you don’t know what this file is, Robots.txt is a bit of code that keeps web crawlers from scraping data from a site. It’s been in commission since the late 90s, so it has its roots in the rise of the search engine age. The consensus was that, if you had the file on your site, you’d be safe from web crawlers of all sorts. We’re sure that, over the course of nearly 30 years, some company has come up with some way of getting around it. Maybe it wouldn’t have been front-page news a few years ago, but things have changed since the whole AI boom.
Now that we know how companies are getting data to fuel their AI models, any company bypassing Robots.txt is looked at with a cold eye. And, they should be. There are people out there who just want to avoid having their data scraped. Knowing that companies are blatantly ignoring their wishes is a huge slap to the face.
Meta’s new web crawlers might ignore the Robots.txt file
If you think that Meta is a perfect angel when it comes to data acquisition, you’d be mistaken. Among the other companies that bypass the file, a new report points to a duo of crawlers that might also avoid the Robot to train its chatbot.
As discovered by Originality.AI, Meta launched two new crawlers sometime in July. One is called Meta-ExternalFetcher, and the other one is called Meta-ExternalAgent. The reason why Meta brought two crawlers is because they perform two different functions.
Meta stated that Meta-ExternalAgent is “for use cases such as training AI models or improving products by indexing content directly.” So, it’s pretty standard stuff from the sound of it. It will travel to different websites and scrape the data from them to help train the company’s Llama models.
The second doesn’t seem to scrape information from sites directly. It looks like this is dedicated to fetching web links. We’re not sure what the web links will be used for, but the bot looks like it’s mostly for Meta AI Assistant. This one doesn’t sound as devious as the first one.
Sneaking past the robot
While the first one doesn’t really sound all that different, it’s notable for a few reasons. Firstly, Meta states that Meta-ExternalAgent “may bypass robots.txt rules.” So, based on the company’s statement, we can’t outright state that the company is bypassing it, but it’s fair to assume as much. This is Meta, we’re talking about. This company has its fair share of run-ins with the law dealing with how it gathers user data.
Second, one thing noted by Business Insider is that this crawler actually serves two purposes. It both crawls the sites and indexes them. That’s pretty odd, as most crawlers perform one task. As odd as it sounds, this could be a tactic to scare sites into letting Meta’s crawler in.
If you want a search engine to surface your website when someone does a relevant search, then you’ll want that engine to index your site. So, if you want your site to appear when someone does a search on a Meta platform, you’ll need it to index your site.
Ostensibly, launching a crawler that both scrapes and indexes your site means that, if you want the company to index your site, you also need to let it scrape data. At least, that’s how it appears. If that’s true, then that’s a new low for Meta.
What Meta has to say
A Meta spokesperson reached out and spoke about the claims being made against the company. They said that the company employs multiple crawlers in order “to make it easier for publishers to indicate their preferences.”
The spokesperson also reached out to Business Insider via email to state, “Like other companies, we train our generative AI models on content that is publicly available online,” they continued “We recognize that some publishers and web domain owners want options when it comes to their websites and generative AI.”
Lastly, the spokesperson said that the company launched multiple crawlers to avoid “bundling all use cases under a single agent, providing more flexibility for web publishers.”
This just makes us question why the Meta-ExternalAgent crawler both indexes and scrapes. In any case, if you’re worried about these new crawlers, Meta provided some information on how to avoid them.
We need a new way to stop companies from scraping data
This brings to light a pretty major issue throughout the tech industry. Many site owners were just hearing about Robots.txt last year when we were learning how companies acquired data. So, they enabled the file and slept well that night knowing that their site was safe from being scraped. However, we started getting stories about how companies have been bypassing it. Is nothing sacred?
The fact of the matter is that we need something new to keep crawlers away from the data on our sites. Robots.txt has been useful, but it’s more than 20 years old. We shouldn’t be trusting a method that’s been around since before the original iPod. Companies have already found a way around it. It’s at the point where it’s not really useful. If major companies like OpenAI have already dodged it, then it doesn’t serve much of a purpose than a placebo.
There needs to be something better put in place that blocks the crawlers. Not only that, but we need the government’s help to force companies not to bypass it. At this point, since companies can just casually sidestep the .TXT file, major corporations are basically on the honor system. That’s a thought to keep you up at night.
Hopefully, we see a new system come sooner rather than later. That is if it’s not already too late.