{"id":732620,"date":"2026-06-10T21:25:15","date_gmt":"2026-06-10T18:25:15","guid":{"rendered":"https:\/\/buradabiliyorum.com\/en\/publishers-push-common-crawl-to-stop-collecting-content-for-ai-training\/"},"modified":"2026-06-10T21:25:15","modified_gmt":"2026-06-10T18:25:15","slug":"publishers-push-common-crawl-to-stop-collecting-content-for-ai-training","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/publishers-push-common-crawl-to-stop-collecting-content-for-ai-training\/","title":{"rendered":"Publishers push Common Crawl to stop collecting content for AI training"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a33e6ebbcfb1\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a33e6ebbcfb1\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/publishers-push-common-crawl-to-stop-collecting-content-for-ai-training\/#Could_AI_lose_a_key_source_of_training_data_Major_publishers_want_Common_Crawl_to_stop_collecting_and_sharing_their_content\" >Could AI lose a key source of training data? Major publishers want Common Crawl to stop collecting and sharing their content.<\/a><ul class='ez-toc-list-level-5' ><li class='ez-toc-heading-level-5'><ul class='ez-toc-list-level-5' ><li class='ez-toc-heading-level-5'><ul class='ez-toc-list-level-5' ><li class='ez-toc-heading-level-5'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/publishers-push-common-crawl-to-stop-collecting-content-for-ai-training\/#Topics_on_this_page\" >Topics on this page<\/a><\/li><\/ul><\/li><\/ul><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2 class=\"subhead\" itemprop=\"alternativeHeadline\"><span class=\"ez-toc-section\" id=\"Could_AI_lose_a_key_source_of_training_data_Major_publishers_want_Common_Crawl_to_stop_collecting_and_sharing_their_content\"><\/span>Could AI lose a key source of training data? Major publishers want Common Crawl to stop collecting and sharing their content.<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><\/p>\n<div class=\"bialty-container\">\n<p>Digital Content Next (DCN) sent the Common Crawl Foundation a cease-and-desist letter demanding that it stop scraping and distributing protected publisher content.<\/p>\n<p>The U.S. trade group, which represents major digital publishers (e.g., the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox), also asked Common Crawl to remove DCN members\u2019 content from its datasets, including paywalled and subscriber-only <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">news<\/a> articles.<\/p>\n<p><strong>Publishers question opt-outs.<\/strong> DCN\u2019s lawyers raised concerns about whether Common Crawl honored publisher opt-out requests and removed older content when asked.<\/p>\n<ul class=\"wp-block-list\">\n<li>The letter said Common Crawl had, in some cases, told publishers it was complying, only to later say technical costs and delays prevented full removal. DCN\u2019s lawyers said they were reviewing whether those statements may have been inaccurate or misleading.<\/li>\n<li>Common Crawl publishes a registry of sites that have opted out of scraping. The list includes many large news publishers.<\/li>\n<\/ul>\n<p id=\"h-dcn-alleges-infringement\"><strong>DCN alleges infringement. <\/strong>The letter argued that copyright law is not an opt-out system. DCN said Common Crawl \u201cflagrantly infringed\u201d publisher copyrights by creating and distributing datasets containing protected content without permission or compensation. <\/p>\n<ul class=\"wp-block-list\">\n<li>The group also said Common Crawl made that content available to companies developing AI tools and large language models.<\/li>\n<li>DCN CEO Jason Kint said the legal notice challenges the idea that online content can be collected, stored, and reused simply because it is accessible.<\/li>\n<\/ul>\n<p id=\"h-common-crawl-pushes-back\"><strong>Common Crawl pushes back.<\/strong> Executive Director Rich Skrenta denied that CCBot bypasses paywalls to scrape websites. He also denied misleading publishers after The Atlantic reported in November that some content from publishers that had requested removal remained available.<\/p>\n<ul class=\"wp-block-list\">\n<li>\u201cWhen a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset,\u201d Skrenta said.<\/li>\n<\/ul>\n<p id=\"h-why-we-care\"><strong>Why we care.<\/strong> This fight could shape how much publisher content AI search engines can use without permission. If courts or settlements impose stricter consent requirements, AI responses may rely more on licensed sources and less on the open web.<\/p>\n<p id=\"h-why-we-care\"><strong>AI training stakes<\/strong>. Since 2008, Common Crawl has scraped billions of webpages to build a free public archive. Its datasets have been widely used to train AI models. The New York Times\u2019 2023 copyright lawsuit against OpenAI cited Common Crawl as making up 60% of GPT-3\u2019s training data, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/pressgazette.co.uk\/media_law\/common-crawl-ai-news-publishers-scraping-cease-and-desist-letter\/\">Press Gazette<\/a> reported.<\/p>\n<ul class=\"wp-block-list\">\n<li>A 2024 Mozilla Foundation paper said that, in its current form, generative AI likely would not have been possible without Common Crawl.<\/li>\n<li>Common Crawl has been working on open standards for AI crawling preferences, Skrenta said this week. DCN\u2019s letter asks for a harder line: stop scraping protected publisher content and remove member content already in the datasets.<\/li>\n<\/ul>\n<div class=\"ttd-topics-display\">\n<div class=\"ttd-topics-content\">\n<h5><span class=\"ez-toc-section\" id=\"Topics_on_this_page\"><\/span>Topics on this page<span class=\"ez-toc-section-end\"><\/span><\/h5>\n<div class=\"ttd-topics-links\">Common Crawl FoundationArtificial intelligenceJason KintLarge language modelRich SkrentaAssociated PressBloomberg TelevisionCCBotCopyright infringementData scrapingDigital Content NextFox CorporationGenerative AIGPT-3Mozilla FoundationNational Public RadioNBCUniversalOpenAIThe AtlanticThe New York TimesWeb scraping<\/div>\n<\/div>\n<div class=\"ttd-topics-show-extra-button\">+16 more<\/div>\n<\/div>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" target=\"_blank\" >Technology<\/a><\/span> category.<\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/searchengineland.com\/publishers-common-crawl-content-ai-training-479831\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Could AI lose a key source of training data? Major publishers want Common Crawl to stop collecting and sharing their content. Digital Content Next (DCN) sent the Common Crawl Foundation a cease-and-desist letter demanding that it stop scraping and distributing protected publisher content. The U.S. trade group, which represents major digital publishers (e.g., the AP,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":732621,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/searchengineland.com\/wp-content\/seloads\/2026\/06\/common-crawl-ai-training.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-732620","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/732620","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=732620"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/732620\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/732621"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=732620"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=732620"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=732620"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}