You may now add me to the list of FOSS folks directly impacted by unethical AI scraping effectively performing Denial of Service attacks.
My wife informed me this morning that our billing system had been knocked offline.
The reason? #Amazon, #OpenAI, and similar bot scraping traffic blew up my access logs to the point of filling that server's entire drive. They're constantly scraping and re-scraping my FreeBSD wiki.
In January 2023, freebsdwiki.net--which has not been actively updated for more than a decade, and whose last update AT ALL occurred in 2021--received 812,706 lines in its httpd-access.log.
In January 2024, that was up to 2,690,389 lines.
As of January 2025, it was up to 7,607,518 lines. And last month--March 2025--it was up to an eye-watering 18,168,053 lines.
Here's a breakdown of user-agent strings seen in March 2025. The top two user agents are Amazonbot and Gptbot. Between the two of them, they account for 35% of all traffic to the site.
That doesn't sound as bad as the chart makes it look... but even the chart doesn't capture the full story.
THIS is what Amazon's scraper traffic looks like: a never ending series of RecentChanges pulls, several times a second, from multiple IP addresses.
This is an INSANELY difficult load to manage, because it isn't really cachable and hits the database HARD.
I hope you didn't expect to see different tactics from OpenAI's gptbot. It's not rotating IP addresses... but I betcha if I block 20.171.207.122 today, tomorrow I see a giant pool of IP addresses behind gptbot just like the one I see now behind amazonbot.
Let's talk about the next few entries on the list:
Semrushbot: this is for an "SEO index", it's been a plague on the net for years.
MJ12bot is for a UK-based engine called "Majestic" which maps links, ignoring actual content. Another very long term, prolific abuser, dwarfed by the current AI scraping.
Petalbot: this one is for "Petal search," which is exclusively used on Huawei phones. Doesn't even have a website!
Barkrowler is another SEO engine focused on links, not content.
After that, we've got several more bots, several demonstrably fake lookalike "chrome" or "safari" agents, and even the first several human-LOOKING user agents are obviously fake... unless you believe, for example, that an elderly version of Microsoft Edge has a greater percent of the traffic share than ALL versions of Safari, Chrome, and Firefox combined.
Bytespider is notable here for pretending it's "mobile Safari" running on Android. Sure Jan.
Anyway, here's another graphical view into the problem: look at what happened in December 2024.
The "unique visitors" stat went up by an order of magnitude between Nov 24 and Dec 24... and not only has it not dropped since, it's gotten actively WORSE.
Not only has the "unique visitors" stat stayed in the six figures level, the *bandwidth* has gone up from typically around 700MiB per month to a whopping TWENTY-SIX GiB per month.
If you're wondering why the stats for February show "zero" visitors and uniques, well, that stats run fired off just prior to the cron job that COMPRESSES all the logfiles--so I'm pretty sure the script failed ENOSPC there, recovering for a little while once the last few humongous logs were gzipped.
Are the scrapers hitting my tech blog also? You'd better believe it--but, lacking the tooling MediaWiki offers to show you just what's changed, they aren't scraping anywhere NEAR so frequently.
After all, they don't want to burn their OWN processing time actually parsing things before they ingest into the latest model, now do they...?
This does raise an interesting question, though... just what the fuck does Amazonbot think it's doing HERE?
@jimsalter Are there explanations for *why* the scrapers seem to want to re-scrape so often? Also why does it feel like this is only being discussed on our side of this fence? Maybe I've missed some article where Facebook employees are explaining why they're re-indexing so often, or how they're working on building a better scraper?
@bobthcowboy so, you're wondering why the supervillains aren't monologuing?
I mean, they're NOT GOOD PEOPLE. Good people don't behave like this. There's really no doubt to extend any benefit from, here.
If you really want to listen to supervillain monologues, you can absolutely find plenty of video of the likes of Sam Altman declaring that anything being done in pursuit of AI is worth it, no matter the cost, period.
@jimsalter No, you're right. I'm often involved in hiring in my roles and I actually decided ~10 years ago that I would never hire someone who worked at Facebook at that time or later. And like, there's people right here on Fosstodon who still work for Big Tech, despite the general state of that scene. I don't get it and I try to understand what the motivation is aside from greed or naivety.
@bobthcowboy I won't say I wouldn't hire anybody who worked for Facebook or, say, Microsoft... there was a period in my life when I was actively being recruited by those orgs, and it was DIFFICULT soul-searching figuring out whether the best thing to do was say hell no, or to say yes and try to change the orgs from within.
I would definitely be probing such candidates on morals and ethics during an interview, though.
@jimsalter @bobthcowboy these orgs are too big to change anything from within, at least when you're just some engineer.
@durchaus @bobthcowboy I came to the same conclusion, which is why I never went that route even when headhunted.
Once I got more overtly political, the headhunting attempts largely stopped. I'm okay with THAT, too.
@bobthcowboy @jimsalter Apparently it's cheaper to just fetch the stuff off the internet every time you need it instead of caching it locally.
@jimsalter wonder if there is a way to add a views google ad sense or something to capitalize on these scrapes
@jimsalter a long time ago, when this web thing was relatively new, i attended the same institution as some grad students learning to scrape. One in particular would recompile his regex for every page he scraped, never reuse it, never release it, running a shared host out of memory repeatedly. Your story does not surprise me at all.
@jimsalter I've got my popcorn out.
At this stage I resorted to using Fail2Ban for bots that transgress where robots.txt says they shouldn't go, but the lists can get so long that it starts to impact performance.
@spineless_echidna this is one of many, many reasons I am constantly advising people to avoid fail2ban. :)
@jimsalter is there a decent alternative? I'm starting to consider rolling out nepenthes or other tarpits in a parallel VM, but I wonder at what cost.
I guess I could re-write the URL on a per-user-agent basis and just re-direct them to the tarpit. If they're smart enough to detect it and blacklist my genuine addresses, it's still a win.
@spineless_echidna I'm also strongly considering Nepenthes or similar. I'd really, really like to actively POISON those bots with false data, not just hamper them with a tarpitted workload.
You want to train your models? Step right the fuck up, have I ever got some training data for YOU...
@Tubsta @jimsalter @spineless_echidna pissing off the people who know how to do things is always a winning strategy. *Grabs Popcorn*
@oxyhyxo @Tubsta @spineless_echidna might be interesting to run one of those horrible spammer tools that tries to bypass blocks by using a thesaurus on every third word, against some readily-available public domain text that can be ethically sourced.
I'm trying to think of cheap ways to poison the scraped data, too subtly to detect until it makes it into the actual model.
@jimsalter @Tubsta @spineless_echidna thesaurus but 2 or 3 definitions deep. Swap the odd noun for a verb.
@oxyhyxo @Tubsta @spineless_echidna you're definitely picking up what I'm putting down.
@jimsalter @Tubsta @spineless_echidna aka put whatever they're scraping through a grammatical woodchipper
@oxyhyxo @Tubsta @spineless_echidna I wonder how hard it would be to just proxy a scraper direct to output from a hosted ChatGPT or similar instance... "incest" in the training data is ROUGH on LLMs.
@oxyhyxo @Tubsta @spineless_echidna I'm not looking to put it through a grammatical woodchipper so much as subtly change the actual MEANING, in ways that are less likely to be detected before they're ingested and potentially do major damage to the model itself.
Models interpret words as vectors. Consider an engineering problem, in which you only "slightly" modify the vectors of a portion of the moving parts inside a machine...
@Tubsta @jimsalter @spineless_echidna and its not like every sysadmin ever doesnt have a degree of BOFH about them
@jimsalter @spineless_echidna They are truly a scourge to the internet as a whole. People focus on the energy consumption of the GPUs, but the constant bot strain on websites everywhere cause everyone to use more energy and bandwidth.
@jimsalter @spineless_echidna While realizing it’s not perfect for the smaller percentage that lie with old browser versions, I’ve found it *extremely* effective to have a regex block of ~2-3 dozen of the worst behaving offenders that provide no value to me — classic annoyances like MJ12bot and Turnitinbot, to newer Bytespider, GPTBot, PerplexityBot, etc. — that is imported for all my sites in nginx & instantly returns 403 for all reqs. Fast and effective for the largest, low-hanging fruit.
@matt_garber @spineless_echidna OpenAI and Amazon have both already been caught lying about their user agents and using residential IP proxies (which begs the question: how are they getting ACCESS to pools of residential IP addresses?)when blocked.
I'm looking into deliberately poisoning their models with trash data, personally. You want my CPU cycles and bandwidth? You got 'em, hope you enjoy 'em...
@jimsalter @spineless_echidna Yeah, that’s fair enough, and the poisoned dataset methods are enticing, too. FWIW, from the sample of sites I block the UAs for, some fairly high traffic, I haven’t yet seen a *commensurate* offsetting increase in bogus UAs vs. all the requests I still see coming in from their “genuine” crawler identifier (getting denied), that that’s still my primary method and which doesn’t rely on IPs at all. (I also rate limit everything dynamic to keep reqs/s reasonable.)
@jimsalter @spineless_echidna Re: the performance of doing something with fail2ban or equivalent, the key I learned *if you’re going to* use things like f2b/sshguard is *not* using the default iptables rules configuration where a new rule gets added for each entry, but having a config that uses IP sets (or now, nftables sets) — the performance penalty of adding individual rules goes away, ipsets default to 65k addresses or CIDR range entries and can be increased up, and block lookups are O(1).
@jimsalter Do you want me to try to track this down at its source?
My contacts are beginning to grow stale with me being 2+ years out but I could try.
@feoh absolutely, if you can manage it. I'm guessing it's following an anchor tag it found on some script kiddie's site, from some attempt said skiddie was making to exploit something along the lines of log4j or similar, but at this point, who the hell knows for sure?
@jimsalter No luck yet but (pardon if this is obvious) did you see this?
There's a specific AmazonBot contact addy in there.
@jimsalter weird referrals? newdumpspdf looks like some exam dump website. Wonder if you had some post that looked like it was certificate or training related and ended up scraped by some weird website
@jimsalter first, I'm wondering that the scraper bots are so honest w.r.t. their user agent string. This way, it should be very easy to just block them based on their user agent.
Second, why do they have to update so often? Google and other search engines don't scrape as much, right? Scraping data for AI isn't actually that much different from search engine scraping, neither in content nor frequency. So why?
@durchaus they're only "honest" until blocked, either via robots.txt or more direct tactics (like blackholing IP ranges).
Once the gig is demonstrably up, they hide the UA, use residential IP addresses as proxies, and so on.
@jimsalter Welcome to my world. We've blocked the legit AI bot user-agents only to find the sleazy Alibaba LLC/TenCent/Huawei bots slamming the sites with older Chrome agents (120 or older)
@jimsalter When I told gptbot to piss off in my robots.txt the next day I suddenly started receiving massive (and roughly equal) traffic from supposed Firefox and Chrome users who can apparently read as fast as a bot and know which files in my CMS to look for that aren't public facing.
Cloudflare is helping some, but it's like playing a game of whack-a-mole.
We need legislation to make robots.txt legally enforceable. It should be a no trespassing sign.
@Rairii sure is. This is a mediawiki site, and if you believe AI crawlers don't know EXACTLY what a mediawiki is and hunger SPECIFICALLY for its data... welp.
@jimsalter @Rairii Seeing the same thing on my partner's and my silly MediaWiki site as well that we use for planning stuff. It has barely 50 pages :/
@jimsalter do they ignore robots.txt?
@dan @jimsalter of course they do.
@jimsalter @opticron @dan That's what this tool does: https://github.com/TecharoHQ/anubis
"According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies' bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers."
Wild.
@jimsalter @hallo Have you uberspace noticed this on your servers?
@hamoid @jimsalter Oh yes we have. And we still have a lot to do with it. We have a blog post in german about it: https://blog.uberspace.de/2024/08/bad-robots/