Fosstodon @fosstodon

Recent searches

Search options

Only available when logged in.

You may now add me to the list of FOSS folks directly impacted by unethical AI scraping effectively performing Denial of Service attacks.

My wife informed me this morning that our billing system had been knocked offline.

The reason? #Amazon, #OpenAI, and similar bot scraping traffic blew up my access logs to the point of filling that server's entire drive. They're constantly scraping and re-scraping my FreeBSD wiki.

https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

man sitting in sofa in a flooded living room, feets in water, writing on a laptop

Ars Technica · Mar 25Open source devs say AI crawlers dominate traffic, forcing blocks on entire countriesBy Benj Edwards

Apr 02, 2025, 10:25 PM·

132boosts·87favorites

**Jim Salter** @jimsalter · Apr 2 *

Apr 2 *

Jim Salter @jimsalter

In January 2023, freebsdwiki.net--which has not been actively updated for more than a decade, and whose last update AT ALL occurred in 2021--received 812,706 lines in its httpd-access.log.

In January 2024, that was up to 2,690,389 lines.

As of January 2025, it was up to 7,607,518 lines. And last month--March 2025--it was up to an eye-watering 18,168,053 lines.

**Jim Salter** @jimsalter · Apr 2 *

Apr 2 *

Jim Salter @jimsalter

Here's a breakdown of user-agent strings seen in March 2025. The top two user agents are Amazonbot and Gptbot. Between the two of them, they account for 35% of all traffic to the site.

That doesn't sound as bad as the chart makes it look... but even the chart doesn't capture the full story.

**Jim Salter** @jimsalter · Apr 2

Apr 2

Jim Salter @jimsalter

THIS is what Amazon's scraper traffic looks like: a never ending series of RecentChanges pulls, several times a second, from multiple IP addresses.

This is an INSANELY difficult load to manage, because it isn't really cachable and hits the database HARD.

**Jim Salter** @jimsalter · Apr 2

Apr 2

Jim Salter @jimsalter

I hope you didn't expect to see different tactics from OpenAI's gptbot. It's not rotating IP addresses... but I betcha if I block 20.171.207.122 today, tomorrow I see a giant pool of IP addresses behind gptbot just like the one I see now behind amazonbot.

**Jim Salter** @jimsalter · Apr 2

Apr 2

Jim Salter @jimsalter

Let's talk about the next few entries on the list:

Semrushbot: this is for an "SEO index", it's been a plague on the net for years.

MJ12bot is for a UK-based engine called "Majestic" which maps links, ignoring actual content. Another very long term, prolific abuser, dwarfed by the current AI scraping.

Petalbot: this one is for "Petal search," which is exclusively used on Huawei phones. Doesn't even have a website!

Barkrowler is another SEO engine focused on links, not content.

**Jim Salter** @jimsalter · Apr 2

Apr 2

Jim Salter @jimsalter

After that, we've got several more bots, several demonstrably fake lookalike "chrome" or "safari" agents, and even the first several human-LOOKING user agents are obviously fake... unless you believe, for example, that an elderly version of Microsoft Edge has a greater percent of the traffic share than ALL versions of Safari, Chrome, and Firefox combined.

Bytespider is notable here for pretending it's "mobile Safari" running on Android. Sure Jan.

Recent searches

Search options

Administered by:

Server stats:

Back