AIs are being fed with AI output by the people who are supposed to feed AI with original input

Workers hired via crowdsource services like Amazon Mechanical Turk are using large language models to complete their tasks – which could have negative knock-on effects on AI models in the future.

Data is critical to AI. Developers need clean, high-quality datasets to build machine learning systems that are accurate and reliable. Compiling valuable, top-notch data, however, can be tedious. Companies often turn to third party platforms such as Amazon Mechanical Turk to instruct pools of cheap workers to perform repetitive tasks – such as labeling objects, describing situations, transcribing passages, and annotating text.

Their output can be cleaned up and fed into a model to train it to reproduce that work on a much larger, automated scale.

AI models are thus built on the backs of human labor: people toiling away, providing mountains of training examples for AI systems that corporations can use to make billions of dollars.

But an experiment conducted by researchers at the École polytechnique fédérale de Lausanne (EPFL) in Switzerland has concluded that these crowdsourced workers are using AI systems – such as OpenAI’s chatbot ChatGPT – to perform odd jobs online.

Training a model on its own output is not recommended. We could see AI models being trained on data generated not by people, but by other AI models – perhaps even the same models. That could lead to disastrous output quality, more bias, and other unwanted effects.

The experiment

The academics recruited 44 Mechanical Turk serfs to summarize the abstracts of 16 medical research papers, and estimated that 33 to 46 percent of passages of text submitted by the workers were generated using large language models. Crowd workers are often paid low wages – using AI to automatically generate responses allows them to work faster and take on more jobs to increase pay.

The Swiss team trained a classifier to predict whether submissions from the Turkers were human- or AI-generated. The academics also logged their workers’ keystrokes to detect whether the serfs copied and pasted text onto the platform, or typed in their entries themselves. There’s always the chance that someone uses a chatbot and then manually types in the output – but that’s unlikely, we suppose.

“We developed a very specific methodology that worked very well for detecting synthetic text in our scenario,” Manoel Ribeiro, co-author of the study and a PhD student at EPFL, told The Register this week.

[…]

Large language models will get worse if they are increasingly trained on fake content generated by AI collected from crowdsource platforms, the researchers argued. Outfits like OpenAI keep exactly how they train their latest models a close secret, and may not heavily rely on things like Mechanical Turk, if at all. That said, plenty of other models may rely on human workers, which may in turn use bots to generate training data, which is a problem.

Mechanical Turk, for one, is marketed as a provider of “data labeling solutions to power machine learning models.”

[…]

As AI continues to improve, it’s likely that crowdsourced work will change. Riberio speculated that large language models could replace some workers at specific tasks. “However, paradoxically, human data may be more precious than ever and thus it may be that these platforms will be able to implement ways to prevent large language model usage and ensure it remains a source of human data.”

Who knows – maybe humans might even end up collaborating with large language models to generate responses too, he added.

Source: Today’s AI is artificial artificial artificial intelligence • The Register

It’s like a photocopy of a photocopy of a photocopy…

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft

 robin@edgarbv.com  https://www.edgarbv.com