loading

36687 author: danstowell@mastodon.social 30 Aug 2024 09:59
tags: #commoncrawl #creativecommons

It's worth pointing out the role of #commoncrawl in all of this. Their aim was "beneficial": instead of every research group scraping the web separately (hammering all our servers), they decided to do it once as a public pool of data for research. But:
(a) they did nothing to help respect authors' licensing (e.g. "no-derivatives"/"share-alike" #creativecommons choices);
(b) they hide behind US "fair use" law, but they do nothing to ensure the data will only be used for fair-use purposes.

36686 author: danstowell@mastodon.social 30 Aug 2024 09:50
tags: #alttext #commoncrawl #llms

Fediverse images & #alttext will certainly be scraped by groups to train their AIs on image-text correspondence. I'm sure it will be happening already. (Yes, many tools can already generate crappy alttext, but high-quality paired data is *valuable* in ML.) Thanks to the precedent set by #commoncrawl and #LLMs, our copyrights and licence terms will be ignored even when explicitly asserted. Case law is not strong enough (nor international).