loading

214723 author: tk@bbs.kawa-kun.com 03 Apr 2025 19:23
tags: #ai #llm #llms

139194 author: khobochka@mastodon.social 27 Dec 2024 10:25
https://pod.geraspora.de/posts/17342163
tags: #chatgpt #llms

#LLMs are a fucking scourge. Perceiving their training infrastructure as anything but a horrific all-consuming parasite destroying the internet (and wasting real-life resources at a grand scale) is delusional.

#ChatGPT isn't a fun toy or a useful tool, it's a _someone else's_ utility built with complete disregard for human creativity and craft, mixed with malicious intent masquerading as "progress", and should be treated as such.

... Summing up the top UA groups, it looks like my server is doing 70% of all its work for these fucking LLM training bots that don’t to anything except for crawling the fucking internet over and over again.

Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki. And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.

If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

36686 author: danstowell@mastodon.social 30 Aug 2024 09:50
tags: #alttext #commoncrawl #llms

Fediverse images & #alttext will certainly be scraped by groups to train their AIs on image-text correspondence. I'm sure it will be happening already. (Yes, many tools can already generate crappy alttext, but high-quality paired data is *valuable* in ML.) Thanks to the precedent set by #commoncrawl and #LLMs, our copyrights and licence terms will be ignored even when explicitly asserted. Case law is not strong enough (nor international).