Publishers Target Common Crawl In Fight Over AI Training Data

Danish media outlets have demanded that the nonprofit web archive Common Crawl remove copies of their articles from past data sets and stop crawling their websites immediately. This request was issued amid growing outrage over how artificial intelligence companies like OpenAI are using copyrighted materials.

Common Crawl plans to comply with the request, first issued on Monday. Executive director Rich Skrenta says the organization is “not equipped” to fight media companies and publishers in court.

The Danish Rights Alliance (DRA), an association representing copyright holders in Denmark, spearheaded the campaign. It made the request on behalf of four media outlets, including Berlingske Media and the daily newspaper Jyllands-Posten. The New York Times made a similar request of Common Crawl last year, prior to filing a lawsuit against OpenAI for using its work without permission. In its complaint, the New York Times highlighted how Common Crawl’s data was the most “highly weighted data set” in GPT-3.

Thomas Heldrup, the DRA’s head of content protection and enforcement, says that this new effort was inspired by the Times. “Common Crawl is unique in the sense that we’re seeing so many big AI companies using their data,” Heldrup says. He sees its corpus as a threat to media companies attempting to negotiate with AI titans.

Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco–based organization was best known prior to the AI boom for its value as a research tool. “Common Crawl is caught up in this conflict about copyright and generative AI,” says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawl’s role in AI training. “For many years it was a small niche project that almost nobody knew about.”

Prior to 2023, Common Crawl did not receive a single request to redact data. Now, in addition to the requests from the New York Times and this group of Danish publishers, it’s also fielding an uptick of requests that have not been made public.

In addition to this sharp rise in demands to redact data, Common Crawl’s web crawler, CCBot, is also increasingly thwarted from accumulating new data from publishers. According to the AI detection startup Originality AI, which often tracks the use of web crawlers, more than 44 percent of the top global news and media sites block CCBot. Apart from BuzzFeed, which began blocking it in 2018, most of the prominent outlets it analyzed—including Reuters, the Washington Post, and the CBC—spurned the crawler in only the last year. “They’re being blocked more and more,” Baack says.

Common Crawl’s quick compliance with this kind of request is driven by the realities of keeping a small nonprofit afloat. Compliance does not equate to ideological agreement, though. Skrenta sees this push to remove archival materials from data repositories like Common Crawl as nothing short of an affront to the internet as we know it. “It’s an existential threat,” he says. “They’ll kill the open web.”

Most PopularGearPS5 vs PS5 Slim: What’s the Difference, and Which One Should You Get?By Eric RavenscraftGear13 Great Couches You Can Order OnlineBy Louryn StrampeGearThe Best Portable Power StationsBy Simon HillGearThe Best Wireless Earbuds for Working OutBy Adrienne So

He’s not alone in his concerns. “I’m very troubled by efforts to erase web history and especially news,” says journalism professor Jeff Jarvis, a staunch Common Crawl defender. “It’s been cited in 10,000 academic papers. It’s an incredibly valuable resource.” Common Crawl collects recent examples of research conducted using its data sets; newer highlights include a report on internet censorship in Turkmenistan and research into fine-tuning online fraud detection.

Common Crawl’s evolution from low-key tool beloved by data nerds and ignored by everyone else to a newly-controversial AI helpmate is part of a larger clash over copyright and the open web. A growing contingent of publishers as well as some artists, writers, and other creative types are fighting efforts to crawl and scrape the web—sometimes even if said efforts are noncommercial, like Common Crawl’s ongoing project. Any project that could potentially be used to feed AI’s appetite for data is under scrutiny.

In addition to a slew of lawsuits alleging copyright infringement filed against the generative AI world’s major players, copyright activists are also pushing for legislation to put guardrails on data training, forcing AI companies to pay for what they use. Additional scrutiny on Common Crawl and other popular data sets like LAION-5B have revealed that, in hoovering data from all over the internet, these corpuses have inadvertently archived some of its darkest corners. (LAION 5-B was temporarily taken down in December 2023 after an investigation by Stanford researchers found that the data set included child sexual abuse materials.)

The Danish Rights Alliance has a notably hard-charging approach to AI and copyright issues. Earlier this year, it led a campaign to file Digital Millennium Copyright Act (DMCA) takedown notices—which alert companies to potentially infringing content hosted on their platforms—for book publishers whose work had been uploaded to OpenAI’s GPT Store without their permission. Last year, it spearheaded an effort to remove a popular generative AI training set known as Books3 from the internet. As a whole, the Danish media is remarkably organized in its fight against AI companies using media as training data without first licensing it; a collective of major newspapers and TV stations has recently threatened to sue OpenAI unless it provides compensation for the use of their work in its training data.

If enough publishers and news outlets opt out of Common Crawl, it could have a significant impact on academic research in a range of disciplines. It could also have another unintended consequence, Baack argues. He thinks that putting an end to Common Crawl might primarily impact newcomers and smaller projects in addition to academics, entrenching today’s power players in their current dominant positions and calcifying the field. “If Common Crawl is damaged so much that it’s not useful anymore as a training data source, I think we’d basically be empowering OpenAI and other leading AI companies,” he says. “They have the resources to crawl the web themselves now.”

About Kate Knibbs

Check Also

The Hottest Startups in London in 2024

In the “Start-Up, Scale-Up” review report published last year, chancellor Rachel Reeves promised to make …

Leave a Reply