Generated Data Fouls Human Datasets Some crowdworkers are using ChatGPT to generate data.

Published

Jun 28, 2023

Reading time

2 min read

The crowdworkers you hire to provide human data may use AI to produce it.

What's new: Researchers at École Polytechnique Fédérale de Lausanne found that written material supplied by workers hired via Amazon Mechanical Turk showed signs of being generated by ChatGPT.

How it works: 44 Mechanical Turk workers summarized medical research abstracts in roughly 100 words. The authors analyzed each summary for evidence that it had been generated by ChatGPT. The analysis relied on two methods:

The authors fine-tuned e5-base to differentiate between summaries written by humans prior to the experiment and summaries generated by the authors, who prompted ChatGPT with the Mechanical Turk instructions.
They also tracked the keystrokes of Mechanical Turk workers. Matching keystrokes and submissions counted as evidence that the writing was human-written. On the other hand, keystrokes that indicated copying and pasting indicated that submissions were generated.

Results: The authors analyzed 46 summaries written by 44 workers. The classifier found 21 summaries that showed 50 percent or greater likelihood of having been written by ChatGPT and 15 summaries that showed at least a 98 percent or greater likelihood. 41 of the summaries involved copying and pasting.

Yes, but: The researchers studied 46 summaries, a rather small sample. Furthermore, summarization is labor-intensive for humans but well within the capabilities of large language models. Other crowdsourced tasks may not be so easy to automate.

Behind the news: Mechanical Turk, founded by Amazon in 2005, has played an outsize role in machine learning. Many of the field’s most important datasets including ImageNet employed crowdsourced labor.

Why it matters: Machine learning engineers often use services like Mechanical Turk to collect and annotate training data on the assumption that humans are doing the work. If a significant number of crowdworkers instead rely on AI, it raises questions about the quality of the data and the validity of the output from models trained on it. Recent work found that, as the amount of model-generated content in a training set increases, the trained model’s performance decreases.

We're thinking: Training on machine-generated data seems likely to affect model performance unless you’re training a smaller model to mimic a larger one (known as model distillation). For example, it’s hard to imagine a language model trained only on the output of ChatGPT surpassing ChatGPT, whereas one trained on human data might. The lack of transparency with respect to which data comes from humans and which comes from machines presents a huge challenge for AI practitioners.

Subscribe to The Batch