
Editor’s note: This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database—large enough to be measured in petabytes—is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.Common Crawl has not said much publicly about its support of… ...[TheTopNews] Read More.
1 week ago





