Bots are overwhelming websites with their hunger for AI data

by Benderon 6/17/25, 9:26 PMwith 26 comments
by pleebon 6/18/25, 12:16 AM

I run a fairly large forum, and I've been getting emails from linode That the CPU usage has been going over 90% multiple times a day, Yours have been complaining that the site has been taking up to five or six seconds to load. I checked the log, and I would keep getting hit with hundreds of connections and second from specific addresses, So I set up rate limiting with Cloudflare.

I thought everything was going well after that, until suddenly it started getting even worse. I started realizing that instead of one IP hitting the site a hundred times per second, it was now hundreds of IP's hitting the site Slightly below the Throttling threshold I had set up.

by johneaon 6/17/25, 9:34 PM

This is an ever growing problem.

The model of the web host paying for all bandwidth was somewhat aligned with traditional usage models, but the wave of scrapping for training data is disrupting this logic.

I remember reading, about 10 years ago?, of how backend website communications (ads and demographic data sharing) had surpassed the bandwidth consumed by actual users. But even in this case, the traffic was still primarily linked to the website hosts.

Whereas with the recent scrapping frenzy the traffic is purely client side, and not initiated by actual website users, and not particularly beneficial to the website host.

One has to wonder what percentage of web traffic now is generated by actual users, versus host backend data sharing, and the mammoth new wave of scrapping.

by rgloveron 6/17/25, 9:55 PM

> Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.

Is anybody tracking the IP ranges of bots or anything similar that's reliable?

It seems like they're taking the "what are you gonna do about it" approach to this.

Edit: Yes [1]

[1] https://github.com/FabrizioCafolla/openai-crawlers-ip-ranges

by renegat0x0on 6/18/25, 5:40 AM

The additional bad outcome is that all content can go behind logins, and paywalls. What then? You will have to provide data, email in every corner of the web to lo in.

There are also good crawlers that search for sites, like Google, or marginalia which gives your page recognizibility. If you lock everything from the web, we'll it disappears from the web.

by millipedeon 6/17/25, 10:50 PM

Information is valuable; we just weren't charging for it. AI is just bringing the market for knowledge back into equilibrium.

by dehrmannon 6/18/25, 1:04 AM

Who's doing this at such a high volume? Most of the data is static enough that there isn't value in frequent crawls, crawls are (probably) more expensive than caching, and small shops and hobbyists don't have the resources to move the needle.

by superkuhon 6/17/25, 9:32 PM

While catchy that headline kind of misses the point. It should be "Corporations are overwhelming websites with their hunger for AI data". They're the ones doing it and corporations are by far the most damaging non-human persons (especially since they are formed nowadays to abstract away liability for the damage they cause).

This is not some new enemy "bots". This is the same old non-human legal persons that polluted our physical world repeating things in the digital. Bots run by actual human persons are not the problem.

by CSMastermindon 6/17/25, 9:37 PM

What's the solution here? Metered usage based on network traffic that gets shared with the website owners?

Otherwise everything moves behind a paywall?

by josefritzishereon 6/17/25, 10:05 PM

I think the solution is criminal penalties.

by darekkayon 6/17/25, 10:19 PM

ai.robots.txt contains a big list of AI crawlers to block, either through robots.txt or via server rules:

https://github.com/ai-robots-txt/ai.robots.tx

by cryptonectoron 6/18/25, 4:35 PM

This is going to drive all blogs to GitHub gists and such.

by tartoranon 6/17/25, 9:32 PM

RIP internet. It will soon make no sense to share something with the world unless you're in for profit. But who's gonna pay for it?