Resisting the crawl

24 minute read

Published:

The past months many people have seen many legislative attacks on our identities (by the UK Supreme Court, Donald Trump’s executive orders, Hungary’s constitutional amendment), freedom of speech, and freedom to gather and demonstrate (arrest Mahmoud Khalil, Harvard funding freeze for failing to “limit activism on campus”). Just to grab a few examples of the top of my mind, I could not possibly be exhaustive of all the explicit violence spurred and carried out. This terrifyingly comprehensive and world spanning attack on gender, immigrants, journalists, organisers, and activists by neo-liberal, corpo-bureaucratic institutions, politicians and governments may very well be only the intro to a period of catastrophe. I doubt if you have found this you hadn’t experienced the chocking, static panic of hyper normalisation. The feeling of political and societal disassociation from reality as a constant stream of news and live social-media feed on tragedies, cruelty, hate and lately unabashed state violence unfolds without much objection, or response from power and leaders. Maybe complicit silence and shuffling. Crawling up from between your shoulder blades to encircle your throat, the rage of disability to witness atrocity without agency.

Well, now you can experience this on the internet too. Yay! Late-stage capitalism clings to it’s current fad of techno-societal innovation: AI. An interface to software, based on natural language, a statistical tool to model any function, said to be intelligence. Extending the Silicon Valley model of supply capture to theft and building user reliance to direct manipulation. AI is forced into use via intrusive AI plugin, assistants, text completion, and summaries. It is capturing and internalising the creativity of people to further accelerate the unprecedented disconnect of people we all experience. And importantly make its creators and financiers a lot of money. And people do not like it.

XKCD 1838 Machine Learning

XKCD 1838 Machine Learning

XKCD 1838 Machine Learning

XKCD 1838 Machine Learning

XKCD 1838 Machine Learning

Recent months a more technical attack has been unfolding on open-source creators in the from of agressive AI scraper bots. This article aims to tease apart the logic of AI scraping, contextualise it within the political climate and analyse the tactics of resistance springing up against AI scraping. So what are these spiders, ro/bots, creepy crawlies? What do they do? How are these robots used by AI companies? And how can someone trap them, squash them and keep their intellectual property safe? And how does this fit in with other actions of the ailing capitalist machine?

Crawlers

(Web)-crawlers/spiders/ants/bots are a staple of the internet. A script for accessing websites and following links to find all that is out there on the web. I will use robots and bots from now on for simplicity. The data bots collect, enable search and navigation of the web via indexing the texts. This way we don’t have to keep records manually of every useful website online but can create search engines like ALIWEB and then Google and DuckDuckGo. But, almost as soon as the first crawlers were created this programmatic use of the web, became a problem. Robots clogged up server bandwidth with request for documents, got stuck on densely interlinked pages, and ended up slowing down connections for human users. So the creators of the first robots and web pages came together to fix the problem, creating [[robots.txt]] in an email thread of most academics, robots developers and webmasters .

This communal agreement worked! Website owners just put the domains robots should access to a simple .txt text file, all else is off limits. This gave people control of the use of their work and its visibility if they ran their own site. And search engine developers designed their robots to respect the standard so they work more efficiently and gather useful data. Most crucially for everyone, the limited infrastructure of the early internet remained functional. Much like other standards behind our digital existence the mutual incentive to cooperate and the elegant simplicity of the standard made it part of the backbone of the internet for the past three decades.

But robots.txt is not official or systematised as a legal framework or security requirement. This can actually be viewed as a benefit. There are many reasons to crawl the web. For instance, one important use is to archive knowledge for the future championed by the InternetArchive. With the mission to provide Universal Access to All Knowledge, they have been downloading the internet since 1996. And in 2017 they publicly stopped referring to robots.txt for U.S. government and military websites, then for all sites. In their announcement they cited continuity reasons, as robots.txt changes affected the history of sites, and inclusion in the archive. Their goal is to preserve the internet as a user would have seen it unaffected by the robots.txt restrictions. This was easy for them to do and negotiate with all concerned as there is no oversight on scraping. Researchers also frequently use robots to collect data for studies comprehensively and the malleability of robots.txt is what allows for this. Common crawl aims to provide access to open-data similar to that enjoyed by tech giants to researchers and creators.

Why the standard broke

But then came AI to fundamentally change the interest and use of publicly available information. When people complain about AI theft bots are the instrument of the theft. But why steal? Why lobby aggressively for “fair use” to train on copyrighted data. AI companies are by and large own or possess a substantial stake in the internet architecture. OpenAI has a complex relationship with Microsoft, Amazon and Google both own shares in Anthropic. One of the most embedded players, Google, had been indexing websites for decades. While Facebook had been capturing social media content on mass through multiple channels. Surely, they could rely on their wast datasets to train AI. Further, these companies have of recent signed licensing agreements with publishers like Axel Springer, Reuters, TIME in controversial and much contested deals by authors. Yet the hunger of AI seems endless. And this is how Meta got caught torrenting books, like college students, for AI training through AWS instances and is being sued over it.

And there are some very good reasons for AI developers and companies to argue they need fair use and access to more and more data. The main one is a technical in nature, a result of information theory. Neural scaling laws, describe the cost of how good the model is across various inputs model size, training time, dataset size, compute cost, measured by performance metrics like accuracy and precision. Data is valuable for containing more information on whatever the process we are modelling, so the more the better. Researchers from Google’s DeepMind found that doubling the size of the model, the number of tokens in the training text must also double. Extrapolating the current trends, another paper estimated that datasets the size of all publicly available human text could be used for Large Language Model training sometime between 2026-2032.

Models also need more data for the sake of diversity and quality, you can’t just train on 1 million copies of the Bee movie script. Diversity allows the model to generalise better to external data in the wild live in production in the hands of users. It also helps to apply the model to a variety of tasks, write essays, assignments, financial reports, poems, rejection emails and so on and so forth.

However by far the best argument may be bias reduction for fair use. As bias reduction and transparency standards move towards methods requiring attributes instead of excluding controversial features. The reasoning goes if you want models to reflect a variety of viewpoints, including those based upon years of research and work as well as the aggressively pushed conspiracies and slop of conservative think-thanks we need your data. This shifts the responsibility over the model’s output from the its creators to the creators of the data, and strips them of their agency to direct the use of their work as there is no guarantee to inclusion or later fine tuning excluding their work.

The most laughable argument is the one centring national security. Positing that the lack of consideration by non-US states gives them an edge in development. Although the biggest innovation outside the US, Deepseeks shared attention has been technological, and non-US companies have ample experience with techniques with data collection via dubious methods. And yet again this sidesteps the main question of knowledge creation and showing the anxieties of global politicking in its place. If anything generative AI sabotages the domestic innovation taking place. Increasing the competition allowing for all more access the fundamental data can also be dissected along similar logic. Innovation and competitive advantages had been due to compute and technological innovation. Moreover, this bustling startup space is highly unlikely to materialise anyhow as it is the AI companies established that have the most information and infrastructure for taking advantage of training using copyrighted works.

The methods of the heist

The economics

Machine Learning models of the current paradigm are complex statistically optimised functions, mapping from any input to any output. They are artefacts of the data they encode, reflecting and processing the given set information, through high dimensional transforms. AI models are unitary data representations in their current form undividable steerable and certainly more controllable then the freedom and complexity of creators behind their training data. As data representations they reproduce their inputs including copyrighted works, learn the inherent randomness in the data, pick up spurious correlations, and hallucinate. But, they do provide access to wide variety of data, and can be refined to display information fairly reliably given just one more billion dollars, pinky promise, say the AI bros.

Large tech companies have now years of experience with the two sided platform model of operation. Capturing the supply of a service, be it retail goods, web pages, taxis and use technology and wast sums of venture capital, and state assistance to vertically integrate supply chains in an effort to offer something consumers can not refuse. Some of these companies even turned a profit using their monopoly status and are now facing anti-trust law suits a badge of honour in big-tech. AI stands as a continuation of this process. Their creators trying to capture the existing knowledge aggressively ignoring existing regulations and practices. Integrating and optimising the production through technology of machine learning. And aggressively pushing to consumers what we call enshitification to enshrine yet another monopoly position. In this the AI industry is no different than tech giants who’s steps it is following, Google, Amazon, and Uber and the various associated small start-up ventures to be bought up or go bust when the bubble bursts.

But AI bros are ambitious and trying to capture a market that is much more human in writing, art, and creativity than commerce, transportation, or food delivery. While food preparation, and transport are great human institutions of sharing, attention, and care as we all by need partake in these processes. The extensive capital involvement and economies of scale certainly aided the capture of these spaces by tech giants. Quite the opposite great art and creativity is oft borne out of need, desperation, lack in the face of forgetting, time, and adversity. Try as they might AI will never match up to the pen, paper, spray cans love and anger we already possess, it doesn’t grasp the basis.

XKCD 1838 Machine Learning

The attack on open-source

But in contrast with search engines, AI in more ways internalises knowledge production for corporate goals in a more and more blatant capture of the knowledge economy. Google had over the years slowly enclosed its search results with paid listings, places, FAQs, its own sites and now AI. What they call Zero-Click search keeps clicks to owned or affiliated sites reducing the traffic to organic non-payed results. AI is at the top of this push now, providing an answer before all other results or quoted FAQs. It restricts interaction with original content in it’s original setting, removing all the hyperlinking and contextual information form original site. This not only breaks peoples ability to explore and learn more, but it drives reliance on the product, on Google.

And the drop in engagement has a price it had been observed to decrease engagement on Stack Overflow and other forums for co-creation frequently used for training. Now, your content can be accessed mixed together with a bunch of statistically related stuff through a tool completely outside of your control. This reduces direct revenues to creators, whether through merch, ad sales, or access. No wonder people call it theft.

But now the creators most open with their work in an effort to increase engagement, gain trust, and ensure the use and utility of their work face yet another attack from AI creators. As multiple open source sites have been complaining over the frequent crawling and usage by robots training ai bots.

Wikipedia sounded the alarm after an incident prompted by Jimmy Carter’s death and crawlers accessing the included 1.5 hour presidential debate against actor Ronald Reagan. In the article the Wikimedia Foundation behind Wikipedia state that 65% of their most expensive traffic comes from robots. They reason that while 35% of the total pageviews are from robots, as these are set up to access in bulk and explore the site tree deeper than human web-browsers they are more likely to get forwarded to core datacenter. Stating: “Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.”

Diaspora a distributed social network also experienced a large wave of traffic from AI crawlers. They also experienced the robots evading rate-limits by switching to residential IPs, as well as switching to non-bot UA strings. “This is literally a DDoS on the entire internet.” writes Dennis Schubert maintainer. One sysadmin resorted to blocking .br, that is all of Brazil, entirely legitimate users and all from pagure.io an open-source code hosting system. Then once bots moved on to koji, the software to test and create Fedora packages, causing an outage and build failures. GNOME GitLab traffic was dominated by robots making up approximately more than 95% of all traffic, measured by proof of work implementation. KDE’s GitLab had similar issues resulting in an outage mid-March due to scrapers with Ailbaba IP ranges.

These open-source projects often instrumental to fundamental infrastructure, they are free and therefore accessible and widespread. They are community projects with limited to resources and maintenance based on participation instead of commercial steam. These projects survive by sharing their resources openly these spaces are at the front of the clash between AI and creators and creatives. So, what is to happen to free software, open-source, creation for the sake of sharing?

Resisting the crawl

So what can we do? Luckily, AI is not HAL, or Skynet, it will not “wake up” the leds turning red. Plus it is running on multiple redundant sterile cloud servers with 24/7 security. With bats, and bombs, and buckets of water we won’t get far.

There developed two distinct community responses to the apparent crisis. More accessible and precise preference signalling schemes to give authors instead of domain owners more precise control over their works. And a host of technologies revamping some old ideas from fighting email spam and catching worms. These methods both offer forms of communication, refusal and active resistance in the form of cost increases forming a diversity of tactics essential for effective social contest of digitality. The next two sections discusses practical resources and proposals to block AI crawlers and robots used to train AI. Take what you can and add what you can.

Revamping Preference Signalling

To give authors agency and control over the use, purpose and goals of their works. A variety of extensions(this sections main resource) have been proposed to the robots.txt standard . Establishing new standards and multiple layers of protections, including agent-specific implementation, HTTP headers, HTML meta tags, XML sitemaps as well as displaying legal notices. Further, extensions Disallow/Allow directives to specify preferences. Another proposal from a google engineer suggests include content layer control to have specific rules per content for AI training. Yet another proposal is to adding new properties to robots.txt for AI training to express even more granular control over the use of content. Further proposals thought of establishing additional standards specifically for AI training. In an effort to differentiate between training, RAG, and search by Guardian news & media and Spawning.

A different approach is to embed information in the content itself as metadata in the XMP in images and video. Or this GitHub project for Text and Data Mining (TDM) in Europe, proposing the use of structured HTTP headers and expanding signalling to APIs and cloud services consistently. The TDM·AI proposal offers yet another protocol for binding metadata to media assets.

There are though several issues with preference signalling standards. Firstly, they rely on AI creators to comply with the wishes of author and second lies on the adoption by users. A recent row on Bluesky illustrates this nicely. As they too announced their proposal for a new standard for preference signalling users on mass angrily replied they would want to opt out and that the base setting should be to disallow AI training. This being despite the fact that Bluesky based on an open-protocol is already accessible for scraping and use in AI training. Some users confused the proposal with a change on Bluesky’s stance on AI training too. People feel angry and disempowered in the face of AI and the wild complexities of how their data is being used a processed. Narrowing down all the various standard proposals and providing information to creators on the practicalities of signal monitoring is crucial work to be done.

Enforcement from the bottom-up

But where is problems, need, anger and frustration there will soon be innovation. In recent months old and new techniques for blocking access have resurfaced to limit the access of AI creators to web resources.

Proof of work and community efforts

One disgruntled developer gave up moved their Gitea server behind a VPN and got to work. Xe Iaso soon released Anubis to weigh the soul of incoming HTTP requests using proof-of-work. Using simple puzzles Anubis detects the compute resources most browsers have but stripped down bots do not. GNOME sysadmin Bart Piotrowski implemented Anubis and saw that only 3.2 % of requests passed the challenge, suggesting most traffic was automated. Many more people are inspired by Anubis quickly accumulated 6.5k stars and 62 contributors on Git.

A different approach is Spawning’s Kudurru which is a collaborative effort to identify and block crawlers on mass by monitoring popular AI dataset. Similar open lists of AI web scrapers enable common response to the situation unfolding.

Into the pits

Content farms like eHow aimed at generating clicks by publishing endless stream of articles up to the millions of pieces per month. They are a great resource for text if you are an AI researcher. However, content farms have other uses too. For instance, catching naughty robots. John Levine anti-spam researcher owns the internet lamest content farm for this reason. It is 6,859,000,000 web sites with a single page, on it 9 links of randomly generated names to further sites on the farm. Intended to waste spammer time last April OpenAi’s GPTBot got stuck in the pit of links following Amazon’s robot a couple months earlier.

This technique for fighting spam, may have been the inspiration for Aaron. Who got upset to the point of creation after Facebook’s robot exceeded 30 million hits on his site. In January he released Nepenthes named after the carnivorous pitcher plant it is designed to devour all errand robots that are happen to fall inside of it. Nepenthes randomly generates an endless number of pages with links back to the pit, with Markov-babble generated content that statistically resembles sensible text, but is utter garbage. Aaron describes Nepenthes as deliberately malicious, aggressive malware, intended to cause harm and increase AI scraping costs. Stating in the Ars Technica piece on the topic:

“Ultimately, it’s like the Internet that I grew up on and loved is long gone,” Aaron told Ars. “I’m just fed up, and you know what? Let’s fight back, even if it’s not successful. Be indigestible. Grow spikes.”

Some are certainly inspired to fight back. The project received praise and visibility by the likes of Cory Doctorow and Jürgen Geuter. Others had further ideas, like Gergely Nagy “algernon” developing Iocaine, after the poison in The Princess Bride, which works similarly but with a twist of a reverse proxy to shadow the real content when facing a crawler, and set up to deploy multiple nodes of these bullshit mazes as needed. Marcus Butler meanwhile developed Quixotic for generating bullshit to deter scraping. Web infrastructure Cloudflare also joined the fight with AI Labyrinth including a free plan to deploy AI-generated set of linked pages automatically in response to inappropriate bot activity, in addition to the option to block crawlers outright. The random content they serve is however less malicious statistical babble but neutral scientific facts to avoid misinformation.

Running these systems does come with some drawbacks. It costs money to run these tarpits, just as serving robots. But tarpit creators find it as a reasonable cost to slow down and sabotage AI training without restrictions. To quote Aaron from the Ars piece:

“The amount of power that AI models require is already astronomical, and I’m making it worse. And my view of that is, OK, so if I do nothing, AI models, they boil the planet. If I switch this on, they boil the planet. How is that my fault?”

As to the effectiveness of the tactic. Gergely Nagy experienced a 94% drop in bot traffic after deploying Iocaine, and the take up of other solutions does suggest tarpits do deter pesky bots. However, as to whether they manage to poison models. The cost of running bots is low compared to training and running inference on the model. And AI firms have been long dealing with filtering and scrutinising the data they collect. But as the tactics evolve it certainly gives them more things to think about and waste resources waiting for profitability.

Diversity of tactics and the noise to come

None of these methods will destroy generative AI. As it simply is just a large pile of linear algebra, the methods the maths are developed. Much like stem-cell research will not go politely back in their box no matter bans or public outcries. However, people in their anger are recognising the vision of power behind AI, the concentration and appropriation of knowledge. AI as a statistical tool is highly reflective of the social and political climate of the current day moving towards centralisation, authoritarianism. The diverse tactics to fight robots showcases a powerful example of the diversity of tactics the current political environment requires. To find an idea draft it and see people re-mix it, go on tangents, and building collective resistance out of individual and personal struggles for freedom.

To quote Albert Camus “Je me révolte, donc nous sommes.”, I rebel so we exist. Goes the musings of the french philosopher on the history of rebellion, pointing out the community inducing nature of rebellion. In the fractured politics of today and the induced entropy and confusion of AI. Small acts of refusal and rebellious innovation are laying the building blocks of a more sustainable internet and future maintained not by the directives of big tech, but the refusal and community of creators.