Wikipedia vs. AI Slop: The volunteer army saving big tech’s training data


As Wikipedia takes on a critical role in shaping artificial intelligence tools, its content moderators are feeling the pressure and responsibility of their task.

On January 15, Wikimedia Foundation — the nonprofit that runs Wikipedia — announced a string of partnerships with major AI companies, including Amazon, Meta, Microsoft, Mistral AI, and Perplexity, “to integrate human-governed knowledge into their platforms at scale.” The large language models of these AI giants will now gain access to information across several Wikimedia projects, including free encyclopedias in 350 languages, Wikibooks in 75 languages, and the Wiktionary in more than 190 languages.

Regional-language Wikipedia editors now find themselves doing double duty: feeding AI systems with credible knowledge while also guarding their languages against AI-generated misinformation and low-quality content. Their work could determine how well billions of people can access accurate AI in their own languages.

Without strong human involvement, AI could actually widen the gap between English and regional-language.”

“Local contributors understand linguistic subtleties and cultural context that AI simply cannot replicate,” Pranayraj Vangari, a film director and theatre research scholar who has contributed 10,700 articles as a volunteer editor on Telugu Wikipedia over the last 13 years, told Rest of World. “Without strong human involvement, AI could actually widen the gap between English and regional-language Wikipedias instead of narrowing it.” 

When software engineer Ravi Chandra Enaganti started publishing Telugu articles in 2007, the language had barely any representation on the internet. Spoken by 96 million people from southern India, Telugu remains underrepresented in AI and natural language processing applications. Enaganti told Rest of World Wikipedia’s latest partnership could help his language overcome another hurdle on the internet.

“Since Wikipedia is built on verified, reliable sources, it is actually a good thing that AI tools scrape our data rather than non-credible sources,” he said. “I want to ensure the ‘brain’ of the internet is trained on quality content.”

Over 284,000 editors make more than one edit per month on English-language Wikipedia; 30,000 of them make at least five edits per month. Several European-language Wikipedias, like the ones in French and Spanish, also have tens of thousands of editors. Widely spoken Asian languages like Marathi, Telugu, and Tamil, however, only have a few hundred editors.

AI struggles to answer even basic questions in regional languages, according to Netha Hussain, a physician and researcher working at the Sahlgrenska University Hospital in Gothenburg, Sweden. She has contributed over 300 English-language articles and 100 in Malayalam since 2010. Hussain focuses on fighting misinformation in health care and adding content on underrepresented topics such as women’s health and women’s biographies.

“As an editor, it is now time for me to not only write articles but also focus more of my time on finding and fixing knowledge gaps, strengthening verifications, [and] maintaining neutrality,” she told Rest of World

The Wikimedia Foundation is relying on its global community of volunteer editors to carve its own path.

“Over many years, these volunteers have developed sophisticated rules and tools for identifying content that does not belong on Wikipedia — taken together, it’s like Wikipedia’s immune system,” Marshall Miller, senior director of product at the Wikimedia Foundation, told Rest of World. “Right now, these volunteers are evolving the immune system to adapt to the newest challenge of low-quality content coming from irresponsibly used generative AI.”

In many cases, editors are fighting AI with AI tools, Miller said.

Every language Wikipedia community has set up different processes to address AI-generated content.”

“Every language Wikipedia community has set up different processes to address AI-generated content on Wikipedia,” he said. “While there are some language communities that leverage AI and machine translation tools to grow content in regional languages, others are focused on ensuring AI content is flagged early on their language Wikipedias.”

All three contributors told Rest of World they use AI tools like ChatGPT and Gemini to gather ideas for what to write next, or to enhance the structure and linguistic flow of their content, but none of them rely entirely on the chatbots to create posts or fact-check. While AI tools can help with back-end tasks, filter out vandalism, ease discovery, and offer translation, they also pose challenges: AI-generated secondary sources can end up infiltrating Wikipedia as reliable sources, and newer contributors might mass-produce low-quality content.

“I have come across AI-generated articles filled with generic language, unreliable or fake citations, and confident-sounding claims without proper evidence,” Vangari said.

Since 2024, volunteers have flagged more than 4,800 articles with suspected AI-generated content, Wikimedia said. About 5% of newly created English-language Wikipedia pages in a single month contained some AI-generated text, an October 2024 Princeton University study found. But given the size of the volunteer-editor community, it’s also easier to vet and delete slop in English. Regional language editors are facing the same problem with fewer resources.

In the long run, there’s also the danger that Wikipedia, which is being used to populate AI sources, begins to draw from these sources itself. The doubling up could lead to model collapse

“We should have robust tools, policies, and practices to mitigate these challenges on Wikipedia,” Hussain said.



Source link

Leave a Reply

Translate »
Share via
Copy link